Kuzu 0.0.4 Release

We are very happy to release Kuzu 0.0.4 today! This release comes with the following new main features and improvements:

Data Ingestion Improvements

We continue to improve our data ingestion in this release. We still rely on Apache Arrow to parse parquet and csv files. Several bottlenecks in our earlier implementation are identified and optimized now, including copying from arrow arrays and construction of hash indexes. We now also store null bits separately, which simplifies our loading logic and makes it faster.

Here are some benchmark numbers for loading two node and two rel tables that only contain primitive types or strings from the LDBC benchmark:

CPU: MAC M1 MAX
Disk: 2TB SSD
System Memory: 32GB
Dataset: LDBC-100
Number of thread: 10

Files	# lines	file size	v0.0.3	v0.0.4
comment.csv	220M	22.49 GB	890s	108s (8.2x)
post.csv	58M	7.68 GB	304s	32s (9.5x)
likesComment.csv	242M	13 GB	772s	142s (5.4x)
knows.csv	20M	1.1 GB	80s	21s (3.8x)

Besides performance improvement, we now also allow interrupting COPY statements in the shell. You can interrupt long running COPY statements without crashing the shell.

We will continue to improve our data ingestion to make it more efficient and robust as we’re moving to the new storage design in the coming releases. Please stay tuned!

New Cypher Features

Undirected Relationships in Queries

Kuzu now supports undirected relationships in Cypher queries. An undirected relationship is the union of both in-coming and out-going relationships. This feature is mostly useful in the following two cases.

Case 1: Relationship is undirected by nature Relationships between nodes in Kuzu are currently directed (though we are internally debating to add a native undirected relationship type). A relationship file must contain FROM and TO columns each of which refers to a primary key column of a node table. However, sometimes the nature of the relationships are undirected, e.g., an isFriendOf relationships in a social network.

Currently, you have two options: (1) you can either store each friendship twice, e.g., Alice isFriendOf Bob and Bob isFriendOf Alice. This is a bad choice because internally Kuzu will index each edge twice (in the forward and backward) edges, so this one fact ends up getting stored 4 times. Or (2) you can store it once, say Alice isFriendOf Bob.

The advantage of option (1) was that in Kuzu v 0.0.3, if you want to find all friends of Alice, you could simply ask this query:

MATCH (a:Person)-[:isFriendOf]->(b:Person)
WHERE a.name = 'Alice' RETURN b;

Instead, if you chose option (2), you would have to ask two queries, one to MATCH (a:Person)-[:isFriendOf]->(b:Person) and the other to MATCH (a:Person)<-[:isFriendOf]-(b:Person), and UNION them, which gets messy if you want to do more with those neighbors (e.g., find their neighbors etc.).

With undirected edge support, you can now choose option (2) and find Alice’s friends with:

MATCH (a:Person)-[:isFriendOf]-(b:Person)
WHERE a.name = 'Alice'
RETURN b;

So if you do not specify a direction in your relationships, Kuzu will automatically query both the forward and backward relationships for you.

Note from Kuzu developers: As noted above, we are debating a native undirected relationship type. That seems to solve the problem of, in which fake direction should an undirected relationship be saved at? Should be a Alice-[isFriendOf]->Bob or vice versa. Happy to hear your thoughts on this.

Case 2: Relationship direction is not of interest Although relationship is stored in a directed way, its direction may not be of interest in the query. The following query tries to find all comments that have interacted with comment Kuzu. These comments could be either replying to or replied by Kuzu. The query can be asked naturally in an undirected way.

MATCH (c:Comment)-[:replyOf]-(other:Comment)
WHERE c.author = 'Kuzu'
RETURN other;

Recursive Queries: Shortest Path Queries and Improved Variable-length Queries

This release brings in the beginnings of a series of major improvements we will do to recursive joins. The two major changes in this release are:

Multilabeled and undirected Variable-length Join Queries Prior to this release we supported variable-length join queries only in the restricted case when the variable-length relationship could have a single relationship label and was directed. For example you could write this query:

MATCH (a:Person)-[:knows*1..2]->(b:Person)
WHERE a.name = 'Alice' 
RETURN b

But you couldn’t ask for arbitrary labeled variable-length relationships between Persons a and b (though you could write the non-recursive version of that query: MATCH (a:Person)-[:knows]->(b:Person) .... Similarly we did not support undirected version of the query: MATCH (a:Person)-[:knows*1..2]-(b:Person). Kuzu now supports multi-label as well as undirected variable-length relationships. For example, the following query finds all nodes that are reachable within 1 to 3 hops from Alice, irrespective of the labels on the connections or destination b nodes:

MATCH (a:Person)-[e:*1..3]-(b)
WHERE a.name = 'Alice'
RETURN b;

Shortest path

Finally, we got to implementing an initial version of shortest path queries. You can find (one of the) shortest paths between nodes by adding the SHORTEST keyword to a varible-length relationship. The following query asks for a shortest path between Alice and all active users that Alice follows within 10 hops and return these users, and the length of the shortest path.

MATCH (a:User)-[p:Follows* SHORTEST 1..10]->(b:User)
WHERE a.name = 'Alice' AND b.state = 'Active'
RETURN b, p, length(p)

The p in the query binds to the sequences of relationship, node, relationship, node, etc. Currently we only return the internal IDs of the relationships and nodes (soon, we will return all their properties).

New Data Types

`SERIAL`

This release introduces SERIAL data type. Similar to AUTO_INCREMENT supported by many other databases, SERIAL is mainly used to create an incremental sequence of unique identifier column which can serve as a primary key column.

Example:

person.csv

Alice
Bob
Carol

CREATE NODE TABLE Person(ID SERIAL, name STRING, PRIMARY KEY(ID));
COPY Person FROM `person.csv`;
MATCH (a:Person) RETURN a;

Output:

-------------------------------------------
| a                                       |
-------------------------------------------
| (label:Person, 3:0, {ID:0, name:Alice}) |
-------------------------------------------
| (label:Person, 3:1, {ID:1, name:Bob})   |
-------------------------------------------
| (label:Person, 3:2, {ID:2, name:Carol}) |
-------------------------------------------

When the primary key of your node tables are already consecutive integers starting from 0, you should omit the primary key column in the input file and make primary key a SERIAL type. This will improve loading time significantly. Similarly, queries that need to scan primary key will also get faster. That’s because internally we will not store a HashIndex or primary key column so any scan over primary key will not trigger a disk I/O.

`STRUCT`

Kuzu now supports STRUCT data type similar to composite type in Postgres. Here is an example:

WITH {name:'University of Waterloo', province:'ON'} AS institution
RETURN institution.name AS name;

Output:

--------------------------
| name                   |
--------------------------
| University of Waterloo |
--------------------------

We support storing structs as node properties for now. For example you can create: CREATE NODE TABLE Foo(name STRING, exStruct STRUCT(x INT16, y STRUCT(z INT64, w STRING)), PRIMARY KEY (name)). We will support storing structs on relationships soon. As shown in the CREATE NODE example above, you can store arbitrarily nested structs, e.g., structs that contain structs as a field, on nodes. One missing feature we have for now is storing and processing a LIST<STRUCT> composite type.

Note: Updating STRUCT column with update statement is not supported in this release but will come soon.

Client APIs

Windows compatibility

Developers can now build Kuzu from scratch on Windows platform! Together with this release we also provide pre-built libraries and python wheels on Windows.

C

We provide official C language binding in this release. Developers can now embed Kuzu with native C interfaces.

Node.js

We provide official Node.js language binding. With Node.js API, developer can leverage Kuzu analytical capability in their Node.js projects. We will soon follow this blog post with one (or a few) blog posts on developing some applications with Node.js.