Kùzu 0.1.0 Release

Kùzu Team
Kùzu Team
Developers of Kùzu Inc.

We are very happy to release Kùzu 0.1.0 today! This is a major release with the following set of new features and improvements:

NodeGroup-Based Storage

With this release, we have completed the major features of our NodeGroup-base storage design, which was outlined in this issue. The primary goal of this design was to have a storage design that is conducive to implementing compression and zone maps optimization. Conceptually, a NodeGroup is equivalent to a Parquet RowGroup, which represents a horizontal partition of a table consisting of k many nodes (k=64x2048 for now). Each k nodes’ data are managed and compressed as a unit on disk files. In release v0.0.7, we had completed the first part of this design and changed our node table storage to use NodeGroups. In this release, we have completed the second part of this design and now relationship tables are also stored as NodeGroups. That means we now compress the relationships of k many nodes together.

We also stores all column data in a single file data.kz which has significantly reduced the number of database files we now maintain.

String Compression

We have extended our compression to compress strings in the database using dictionary compression. For each string “column chunk” (which is a partition of an entire column in a table storing one NodeGroup’s values), each string s is stored once in a dictionary, and for each record that has value s, we store a pointer to s. This design applies when storing string properties on relationship tables. This is done by using 3 column chunks in total. 2 column chunks store the dictionary as follows. One “raw strings” column chunk stores all the unique strings in the column chunk one after another, and another “offsets column chunk” identifies the beginning indices of each string. Then, one additional “index column chunk” stores the pointers to the strings as indices to the “offsets” column to identify the strings. The offset and index columns are bitpacked in the manner of integer columns.

String Compression Benchmark

Here is a micro-benchmark using the Comment table in LDBC100. To compare the compression rate of each column individually, we construct a new table Tx for each string column x in the Comment table, e.g., Browser Used. Tx consists of the column x and a serial primary key, which allows us to avoid storing any materialized hash index. We report the size of the data.kz file and compare against a previous version v0.0.10 of Kùzu.

ColumnVersion 0.0.10Version 0.1.0Difference
Browser Used4.2 GB272 MB-93.5%
Content9.7 GB7.5 GB-22.7%
Location IP5 GB1.6 GB-68.0%

We also report the entire LDBC100 database size, including all database files (data.kz, indices, metadata, catalog), of v0.1.0 and a slightly older version v0.0.8, which included compression of nodes. So this experiment reports both improvements that come from storing relationship tables in compressed form as well as storing strings of both node and relationship tables in compressed form.

DatabaseVersion 0.0.8Version 0.1.0Difference
LDBC100127 GB94 GB-26.0%

Data Ingestion Improvements

Moving our relationship table storage to a NodeGroup-based one also improved our data ingestion times. The following benchmark reports the loading time of the LDBC100 likesComment.csv relationship records. The file contains 242M records and takes 13 GB in raw CSV format. Below we compare v0.1.0 against v0.0.10 using a machine with 2 Intel Xeon Platinum 8175M CPUs, each of which has 48 physical CPU cores. We used 300 GB of the 380GB total RAM during this test.

Version 0.0.10Version 0.1.0Difference
8 threads266.8 s229.8 s-13.9%
4 threads312.5 s246.8 s-21.0%
2 threads446.7 s335.6 s-24.8%
1 threads700.8 s581.9 s-17.0%

New Features

Direct Scans of DataFrames

We now support scanning Pandas DataFrames directly. Consider the following person DataFrame that contains two columns, id and height_in_cm (only the latter will be used in the example):

id = np.array([0, 2, 3, 5, 7, 11, 13], dtype=np.int64)
height_in_cm = np.array([167, 172, 183, 199, 149, 154, 165], dtype=np.uint32)
person = pd.DataFrame({'id': id, 'height': height_in_cm})

The query below finds all students who are taller than the average height of the records in the person DataFrame:

query = 'CALL READ_PANDAS("person")
         WITH avg(height / 2.54) as height_in_inch
         MATCH (s:student)
         WHERE s.height > height_in_inch
         RETURN s'
results = conn.execute(query)

Details of this feature can be found here.

Copy

This release comes with several new features related to Cypher’s COPY clause.

Copy To Parquet Files

Query results can now be exported to Parquet files.

COPY ( MATCH (a:Person) RETURN a.* ) TO "person.parquet";

Copy To CSV Files

We added serveral configuration options when exporting to CSV files.

COPY ( MATCH (a:Person) RETURN a.* ) TO "person.csv" (delim = '|', header=true);

We also improved the performance of the CSV writer. Below is a micro benchmark of exporting the LDBC100 Comment table to CSV format.

COPY (MATCH (p:Comment) RETURN p.*) to ‘comment.csv’;
Version 0.0.10Version 0.1.0
Runtime1239.3s104.56s

Optional column_names Argument in Copy From Statements

Users can now load data to a subset of the columns in a table. Previously, we required that if users are going to load an empty table T from a file F, e.g., a CSV or Parquet file, then F must contain: (1) as many columns as the columns in T; and (2) in the same order as table T. Now users can optionally add a column_names argument in COPY FROM statements, which relaxes both of these restrictions: (1) F can now contain a subset of the columns; and (2) in arbitrary order, which needs to be specified in the column_names argument. Here is an example:

CREATE NODE TABLE Person (id INT64, name STRING, comment STRING, PRIMARY KEY(id));
COPY Person (name, id) FROM "person.csv";

The code above first creates a Person table with 3 columns, and then loads two of its columns from a file that contains name and id values of the columns respectively. The third comment column in the table will be set to NULL for all imported records. The details of this feature can be found here.

Updates

Detach Delete

Kùzu now supports Cypher’s DETACH DELETE clause, which deletes a node and all of its relationships together. Previously users could only use the DELETE command, which deleted nodes that had no relationships. For example, the following query deletes a User node with name Adam and all of its edges.

MATCH (u:User) WHERE u.name = 'Adam' DETACH DELETE u;

Return Deleted Rows

RETURN clauses can now return variable bindings that were used in the DELETE command. For example, you can return nodes that were deleted in the previous DELETE statement as follows:

DELETE (a:Person) RETURN a;

Details of this feature can be found here.

Other Changes

SQL-style Cast Function

We have implemented a SQL-style cast function cast(input, target_type) to cast values between different types. The cast function will convert the input argument to the target_type if casting of the input value to the target type is defined. For example:

RETURN cast("[1,2,3]", "INT[]");
--------------------------
| CAST([1,2,3], INT32[]) |
--------------------------
| [1,2,3]                |
--------------------------

Along with this, we are deprecating our previous way of doing casts with separate functions, e.g., STRING(1.2) or to_int64("32"). Details of the cast function can be found here.

Recursive Relationship Node Filter

Since v0.0.5 we have supported filtering the intermediate relationships that can bind to recursive relationships, based on the properties of these intermediate relationships. With the current release, we now support filtering the intermediate nodes that are bound to recursive relationships. As we did for filtering intermediate relationships, we adopt Memgraph’s syntax for this feature as follows:

MATCH p = (a:User)-[:Follows*1..2 (r, n | WHERE n.age > 21)]->(b:User)
RETURN p;

The first variable r that is inside the recursive relationship above binds to the intermediate relationships while the second variable n binds to the intermediate nodes. The |symbol can be followed with a WHERE clause where these variables can be used to express a filtering expression. This query finds all 1 to 2-hop paths between two User nodes where the intermediate nodes of these paths have age properties greater than 21. Details of this feature can be found here.

Count Subquery

We have added support for counting subqueries, which checks the number of matches for the given pattern in the graph. The output of this counting can be bound to a variable with aliasing. For example, the following query counts the number of followers of each user in the graph.

MATCH (a:User)
RETURN a.name, COUNT { MATCH (a)<-[:Follows]-(b:User) } AS num_follower
ORDER BY num_follower;

The details of count subqueries can be found here.

New INT128 Data Type

Finally, we now have support for 16-byte signed huge integers.

Development

Nightly Build

We have setup a nightly build pipeline for Kùzu users who want to access our latest feature set. Here is how you can use the latest nightly version of Kùzu:

  • For the Python API, the latest nightly version can be installed with pip install --pre kuzu.
  • For the Node.js API, the latest nightly version can be installed with npm i kuzu@next.
  • For the Rust API, the latest nightly version can be found at crates.io.
  • For the CLI, C and C++ shared library, and the Java JAR, the latest nightly version can be downloaded from the latest run of this GitHub Actions pipeline.

Reduced Binary Size

With this release, we removed our Apache Arrow dependency, which significantly reduces oure binary size. Additionally, we now strip the shared library and CLI binaries of the symbols that are not needed by our client APIs. This further reduces our binary sizes. For example, on a MacOS arm64 platform, these two improvements achieve the following cumulative binary size reductions:

Version 0.0.10Version 0.1.0
Binary Size27.2 MB10.3 MB

Stripping of our other libraries (e.g. Python) is a work in progress.

Closing Remarks

As usual, we would like to thank everyone in the Kùzu engineering team, especially our interns, for making this release possible. We look forward to your feedback!

Enjoy Kùzu v 0.1.0 and the upcoming holiday season, which in this part of the world 🇨🇦🇨🇦 coincides with coming of the cold but cozy winter 🤗🤗.