Kùzu 0.0.7 Release

Kùzu Team
Kùzu Team
Developers of Kùzu Inc.

We are very happy to release Kùzu 0.0.7 today! This release comes with the following new main features and improvements: To install the new version, please visit the download section of our website and getting started guide. The full release notes are here.

Macro and UDF

Create Macro Statements

In this release, we’ve added the support of CREATE MACRO statement to define customized scalar functions, i.e., those that return only a single value, through Cypher.

Here is an example of defining a macro to add two input parameters. The second parameter b:3 is an example of how to provide a default value for a parameter in case the parameter is absent.

// Create a macro which adds two parameters. If the second parameter b is not provided, the default value of 3 will be used instead.
CREATE macro addWithDefault(a,b:=3) AS a + b;
// Executes the macro without providing the default value.
RETURN addWithDefault(2);  // returns 5 (2 + 3)
// Executes the macro by providing the default value (actual parameter value will be used).
RETURN addWithDefault(4, 7);  // returns 11 (4 + 7)

See more details on supported macro expression types here.

C++ UDFs

We are also introducing two C++ interfaces, createScalarFunction and createVectorizedFunction in the Connection class of the C++ API to define both scalar and vectorized UDFs.

createScalarFunction provides a way for users to define scalar functions in C++ and use it in Kùzu as if they’re built-in functions. Here is an example of a unary scalar function that increments the input value by 5:

static int32_t addFiveScalar(int32_t x) {
    return x + 5;
}
// Register the unary scalar function using the createScalarFunction API.
conn->createScalarFunction("addFiveScalar", &addFiveScalar);
// Issue a query using the UDF.
conn->query("MATCH (p:person) return addFiveScalar(to_int32(p.age))");

For users familiar with internals of our intermediate result representation, they can make use of createVectorizedFunction to create vectorized function over our ValueVectors to achieve better performance. See our doc here for more details.

Data Update and Return Clauses

Merge Clause

This release implements the MERGE clause, which is an updating clause that will first try to match the given pattern and, if not found, create the pattern. At a high level, MERGE <pattern> can be interpreted as If MATCH <pattern> then RETURN <pattern> ELSE CREATE <pattern>.Additionally, one can further specify the SET operation based on whether the pattern is found or not through ON CREATE and ON MATCH.

For example, the following query tries to merge a user node with name “Adam”. Suppose a node with name “Adam” exists in the database already. In this case, we update the same node’s age property and return the node (so no new node gets inserted).

MERGE (n:User {name : 'Adam'}) ON MATCH SET n.age = 35 RETURN n.*;
------------------
| n.name | n.age |
------------------
| Adam   | 35    |
------------------

Here is another example where we try to merge a Follows edge with since property equal to 2022 between Adam and Karissa. Suppose no such edge exists in the database, then the statement create the edge and set the since property to 1999.

MATCH (a:User), (b:User) 
WHERE a.name = 'Adam' AND b.name = 'Karissa' 
MERGE (a)-[e:Follows {since:2022}]->(b) 
ON CREATE SET e.since = 1999
RETURN e;
---------------------------------------------------------
| e                                                     |
---------------------------------------------------------
| (0:0)-{_LABEL: Follows, _ID: 0:5, since: 1999}->(0:1) |
---------------------------------------------------------

See our doc here for more details.

Multi-label Set/Delete

Kùzu now allows set/delete on nodes and relationship variables that can be binding to multiple labels. For example, to delete all nodes in database (assuming all edges have been deleted).

MATCH (n) DELETE n;

Similarly, to set since property of all relationships in the database.

MATCH ()-[f]->() SET f.since = 2023

Note that when evaluating this query, tuples in tables that don’t have since property will be ignored.

See our docs in Set and Delete for more details.

Return After Update

We are also enabling return after updating clause starting from this release. That is updated value will be returned in queries that update values. Here are some examples:

MATCH (u:User)
WHERE u.name = 'Adam' SET u.age = NULL
RETURN u.*;
------------------
| u.name | u.age |
------------------
| Adam   |       |
------------------
MATCH (u1:User), (u2:User)
WHERE u1.name = 'Adam' AND u2.name = 'Noura' 
CREATE (u1)-[e:Follows {since: 2011}]->(u2)
RETURN e;
---------------------------------------------------------
| e                                                     |
---------------------------------------------------------
| (0:0)-{_LABEL: Follows, _ID: 0:5, since: 2011}->(0:3) |
---------------------------------------------------------

See our docs in Set and Delete for more examples.

Return with .*

Kùzu now provides syntactic sugar for returning all properties of a node or relationship with *.

MATCH (a:User) RETURN a.*;
-------------------
| a.name  | a.age |
-------------------
| Adam    | 30    |
-------------------
| Karissa | 40    |
-------------------
| Zhang   | 50    |
-------------------
| Noura   | 25    |
-------------------

See our doc here for more details.

Data Export

Kùzu now supports exporting query results to CSV files using the COPY TO command. For example the following COPY TO statement could return the below CSV file.

COPY (MATCH (u:User) RETURN u.*) TO 'user.csv';

CSV file:

u.name,u.age
"Adam",30
"Karissa",40
"Zhang",50
"Noura",25

See Data Export for more information.

New Data Types and APIs

MAP

A MAP is a dictionary of key-value pairs where all keys have the same type and all values have the same type. Different from STRUCT, MAP doesn’t require the same key to be present in each row. Therefore, MAP is more suitable when the schema is not determined.

RETURN map([1, 2], ['a', 'b']) AS m;
--------------
| m          |
--------------
| {1=a, 2=b} |
--------------

See map for more information.

UNION

Kùzu’s UNION is implemented by taking DuckDB’s UNION type as a reference. Similar to C++ std::variant, UNION is a nested data type that is capable of holding multiple alternative values with different types. The value under key “tag” is considered as the value being currently hold by the UNION.

See union for more information.

Converting Query Results to Arrow

In previous releases, we supported converting query result to Arrow tables in our Python API. In this release, converting to Arrow arrays are now also available in Rust, C (see kuzu_query_result_get_arrow_schema and kuzu_query_result_get_next_arrow_chunk), and C++ (see getArrowSchema and getNextArrowChunk) APIs.

NodeGroup Based Node Table Storage

This release introduces changes the storage layout of node tables. Before this release, we used to store each column in a node table contiguously in separate files. Each column contains one data file (e.g., n-1.col) and one null file (e.g., n-1.null) if the column may contain null values. This design posed two problems: 1) it requires maintaining many files in the database directory, which may lead to too many open files error; 2) it is not suitable for data compression. Although we still don’t implement compression yet (this will wait until the next few releases), this design would force us to adopt a single compression technique for the entire column.

Instead, partitioning each column into multiple chunks can offer more flexibility as each column chunk can be compressed and decompressed independently. In this release, we introduced the concept NodeGroup, which is equivalent to RowGroup and represents a horizontal partition of a table.1 With node group-based storage design, we also store data of all columns in a single file data.kz.2 This will enable more powerful compression schemes, e.g., constant compression, bit-packing, dictionary compression in the coming releases. For details on our new design, please visit this issue.

Unnesting Arbitrary Subqueries

Consider the following query that finds the name of users a who have at least 1 user b who is younger than a:

MATCH (a:User) 
WHERE EXISTS { MATCH (a)-[:Follows]->(b:User) WHERE a.age > b.age} 
RETURN a.name;

The query inside EXISTS is a correlated subquery and very expensive to evaluate because the inner subquery needs to be evaluated for each a with a nested loop join operator (which is often an inefficient way to evaluate joins). In this release, we implemented an optimization that unnests correlated subqueries based on the techniques adopted from this paper Unnesting Arbitrary Queries by Neumann and Kemper. This allows us to use hash joins instead of nested loop joins and execute these queries much faster. More details will come in a separate blog post on both this technique and how much gains we obtain.

Footnotes

  1. We use the term NodeGroup mainly due to that we also partition rel tables based on their src/dst nodes, instead of number of rows.

  2. Primary key index files are still kept separately, but eventually they will also be merged into the data.kz file.