We are very happy to release Kùzu 0.0.7 today! This release comes with the following new main features and improvements: To install the new version, please visit the download section of our website and getting started guide. The full release notes are here.
Macro and UDF
Create Macro Statements
In this release, we’ve added the support of CREATE MACRO
statement to define customized scalar functions, i.e., those that return only a single value, through Cypher.
Here is an example of defining a macro to add two input parameters. The second parameter b:3
is an example of how to provide a default value for a parameter in case the parameter is absent.
// Create a macro which adds two parameters. If the second parameter b is not provided, the default value of 3 will be used instead.
CREATE macro addWithDefault(a,b:=3) AS a + b;
// Executes the macro without providing the default value.
RETURN addWithDefault(2); // returns 5 (2 + 3)
// Executes the macro by providing the default value (actual parameter value will be used).
RETURN addWithDefault(4, 7); // returns 11 (4 + 7)
See more details on supported macro expression types here.
C++ UDFs
We are also introducing two C++ interfaces, createScalarFunction
and createVectorizedFunction
in the Connection
class of the C++ API to define both scalar and vectorized UDFs.
createScalarFunction
provides a way for users to define scalar functions in C++ and use it in Kùzu as if they’re built-in functions.
Here is an example of a unary scalar function that increments the input value by 5:
static int32_t addFiveScalar(int32_t x) {
return x + 5;
}
// Register the unary scalar function using the createScalarFunction API.
conn->createScalarFunction("addFiveScalar", &addFiveScalar);
// Issue a query using the UDF.
conn->query("MATCH (p:person) return addFiveScalar(to_int32(p.age))");
For users familiar with internals of our intermediate result representation, they can make use of createVectorizedFunction
to create vectorized function over our ValueVectors to achieve better performance.
See our doc here for more details.
Data Update and Return Clauses
Merge Clause
This release implements the MERGE
clause, which is an updating clause that will first try to match the given pattern and, if not found, create the pattern. At a high level, MERGE <pattern>
can be interpreted as If MATCH <pattern> then RETURN <pattern> ELSE CREATE <pattern>
.Additionally, one can further specify the SET
operation based on whether the pattern is found or not through ON CREATE
and ON MATCH
.
For example, the following query tries to merge a user node with name “Adam”. Suppose a node with name “Adam” exists in the database already. In this case, we update the same node’s age
property and return the node (so no new node gets inserted).
MERGE (n:User {name : 'Adam'}) ON MATCH SET n.age = 35 RETURN n.*;
------------------
| n.name | n.age |
------------------
| Adam | 35 |
------------------
Here is another example where we try to merge a Follows
edge with since
property equal to 2022 between Adam
and Karissa
. Suppose no such edge exists in the database, then the statement create the edge and set the since
property to 1999.
MATCH (a:User), (b:User)
WHERE a.name = 'Adam' AND b.name = 'Karissa'
MERGE (a)-[e:Follows {since:2022}]->(b)
ON CREATE SET e.since = 1999
RETURN e;
---------------------------------------------------------
| e |
---------------------------------------------------------
| (0:0)-{_LABEL: Follows, _ID: 0:5, since: 1999}->(0:1) |
---------------------------------------------------------
See our doc here for more details.
Multi-label Set/Delete
Kùzu now allows set/delete on nodes and relationship variables that can be binding to multiple labels. For example, to delete all nodes in database (assuming all edges have been deleted).
MATCH (n) DELETE n;
Similarly, to set since
property of all relationships in the database.
MATCH ()-[f]->() SET f.since = 2023
Note that when evaluating this query, tuples in tables that don’t have since
property will be ignored.
See our docs in Set and Delete for more details.
Return After Update
We are also enabling return after updating clause starting from this release. That is updated value will be returned in queries that update values. Here are some examples:
MATCH (u:User)
WHERE u.name = 'Adam' SET u.age = NULL
RETURN u.*;
------------------
| u.name | u.age |
------------------
| Adam | |
------------------
MATCH (u1:User), (u2:User)
WHERE u1.name = 'Adam' AND u2.name = 'Noura'
CREATE (u1)-[e:Follows {since: 2011}]->(u2)
RETURN e;
---------------------------------------------------------
| e |
---------------------------------------------------------
| (0:0)-{_LABEL: Follows, _ID: 0:5, since: 2011}->(0:3) |
---------------------------------------------------------
See our docs in Set and Delete for more examples.
Return with .*
Kùzu now provides syntactic sugar for returning all properties of a node or relationship with *.
MATCH (a:User) RETURN a.*;
-------------------
| a.name | a.age |
-------------------
| Adam | 30 |
-------------------
| Karissa | 40 |
-------------------
| Zhang | 50 |
-------------------
| Noura | 25 |
-------------------
See our doc here for more details.
Data Export
Kùzu now supports exporting query results to CSV files using the COPY TO
command. For example the following
COPY TO
statement could return the below CSV file.
COPY (MATCH (u:User) RETURN u.*) TO 'user.csv';
CSV file:
u.name,u.age
"Adam",30
"Karissa",40
"Zhang",50
"Noura",25
See Data Export for more information.
New Data Types and APIs
MAP
A MAP
is a dictionary of key-value pairs where all keys have the same type and all values have the same type. Different from STRUCT
, MAP
doesn’t require the same key to be present in each row. Therefore, MAP
is more suitable when the schema is not determined.
RETURN map([1, 2], ['a', 'b']) AS m;
--------------
| m |
--------------
| {1=a, 2=b} |
--------------
See map for more information.
UNION
Kùzu’s UNION
is implemented by taking DuckDB’s UNION
type as a reference. Similar to C++ std::variant
, UNION
is a nested data type that is capable of holding multiple alternative values with different types. The value under key “tag” is considered as the value being currently hold by the UNION
.
See union for more information.
Converting Query Results to Arrow
In previous releases, we supported converting query result to Arrow tables in our Python API.
In this release, converting to Arrow arrays are now also available in Rust, C (see kuzu_query_result_get_arrow_schema
and kuzu_query_result_get_next_arrow_chunk
), and C++ (see getArrowSchema
and getNextArrowChunk
) APIs.
NodeGroup Based Node Table Storage
This release introduces changes the storage layout of node tables.
Before this release, we used to store each column in a node table contiguously in separate files.
Each column contains one data file (e.g., n-1.col
) and one null file (e.g., n-1.null
) if the column may contain null values.
This design posed two problems: 1) it requires maintaining many files in the database directory, which may lead to too many open files
error; 2) it is not suitable for data compression. Although we still don’t implement compression yet (this will wait until the next few releases), this design would force us to adopt a single compression technique for the entire column.
Instead, partitioning each column into multiple chunks can offer more flexibility as each column chunk can be compressed and decompressed independently.
In this release, we introduced the concept NodeGroup, which is equivalent to RowGroup and represents a horizontal partition of a table.1
With node group-based storage design, we also store data of all columns in a single file data.kz
.2
This will enable more powerful compression schemes, e.g., constant compression, bit-packing, dictionary compression in the coming releases.
For details on our new design, please visit this issue.
Unnesting Arbitrary Subqueries
Consider the following query that finds the name of users a
who have at least 1 user b
who is younger than a
:
MATCH (a:User)
WHERE EXISTS { MATCH (a)-[:Follows]->(b:User) WHERE a.age > b.age}
RETURN a.name;
The query inside EXISTS
is a correlated subquery and very expensive to evaluate because the inner subquery needs to be evaluated for each a
with a nested loop join operator (which is often an inefficient way to evaluate joins). In this release, we implemented an optimization that unnests correlated subqueries based on the techniques adopted from this paper Unnesting Arbitrary Queries by Neumann and Kemper. This allows us to use hash joins instead of nested loop joins and execute these queries much faster. More details will come in a separate blog post on both this technique and how much gains we obtain.