We are very happy to release Kùzu 0.1.0 today! This is a major release with the following set of new features and improvements:
NodeGroup-Based Storage
With this release, we have completed the major features of our NodeGroup-base storage design, which was outlined in this issue. The primary goal of this design was to have a storage design that is conducive to implementing compression and zone maps optimization. Conceptually, a NodeGroup is equivalent to a Parquet RowGroup, which represents a horizontal partition of a table consisting of k many nodes (k=64x2048 for now). Each k nodes’ data are managed and compressed as a unit on disk files. In release v0.0.7, we had completed the first part of this design and changed our node table storage to use NodeGroups. In this release, we have completed the second part of this design and now relationship tables are also stored as NodeGroups. That means we now compress the relationships of k many nodes together.
We also stores all column data in a single file data.kz
which has significantly reduced the number of database files we now maintain.
String Compression
We have extended our compression to compress strings in the database using dictionary compression. For each string “column chunk” (which is a partition of an entire column in a table storing one NodeGroup’s values), each string s is stored once in a dictionary, and for each record that has value s, we store a pointer to s. This design applies when storing string properties on relationship tables. This is done by using 3 column chunks in total. 2 column chunks store the dictionary as follows. One “raw strings” column chunk stores all the unique strings in the column chunk one after another, and another “offsets column chunk” identifies the beginning indices of each string. Then, one additional “index column chunk” stores the pointers to the strings as indices to the “offsets” column to identify the strings. The offset and index columns are bitpacked in the manner of integer columns.
String Compression Benchmark
Here is a micro-benchmark using the Comment table in LDBC100. To compare the compression rate of each column individually,
we construct a new table Tx
for each string column x
in the Comment table, e.g., Browser Used
. Tx consists of the
column x and a serial primary key, which allows us to avoid storing any materialized hash index. We report the size of the data.kz
file
and compare against a previous version v0.0.10 of Kùzu.
Column | Version 0.0.10 | Version 0.1.0 | Difference |
---|---|---|---|
Browser Used | 4.2 GB | 272 MB | -93.5% |
Content | 9.7 GB | 7.5 GB | -22.7% |
Location IP | 5 GB | 1.6 GB | -68.0% |
We also report the entire LDBC100 database size, including all database files (data.kz
, indices, metadata, catalog), of v0.1.0
and a slightly older version v0.0.8, which included compression of nodes. So this experiment reports
both improvements that come from storing relationship tables in compressed form as well as
storing strings of both node and relationship tables in compressed form.
Database | Version 0.0.8 | Version 0.1.0 | Difference |
---|---|---|---|
LDBC100 | 127 GB | 94 GB | -26.0% |
Data Ingestion Improvements
Moving our relationship table storage to a NodeGroup-based one also improved our
data ingestion times. The following benchmark reports the loading time of the LDBC100 likesComment.csv
relationship records.
The file contains 242M records and takes 13 GB in raw CSV format. Below we compare v0.1.0 against v0.0.10 using a machine with
2 Intel Xeon Platinum 8175M CPUs, each of which has 48 physical CPU cores. We used 300 GB of the 380GB total RAM during this test.
Version 0.0.10 | Version 0.1.0 | Difference | |
---|---|---|---|
8 threads | 266.8 s | 229.8 s | -13.9% |
4 threads | 312.5 s | 246.8 s | -21.0% |
2 threads | 446.7 s | 335.6 s | -24.8% |
1 threads | 700.8 s | 581.9 s | -17.0% |
New Features
Direct Scans of DataFrames
We now support scanning Pandas DataFrames directly. Consider the following person
DataFrame
that contains two columns, id
and height_in_cm
(only the latter will be used in the example):
id = np.array([0, 2, 3, 5, 7, 11, 13], dtype=np.int64)
height_in_cm = np.array([167, 172, 183, 199, 149, 154, 165], dtype=np.uint32)
person = pd.DataFrame({'id': id, 'height': height_in_cm})
The query below finds all students who are taller than the average height of the records in the person
DataFrame:
query = 'CALL READ_PANDAS("person")
WITH avg(height / 2.54) as height_in_inch
MATCH (s:student)
WHERE s.height > height_in_inch
RETURN s'
results = conn.execute(query)
Details of this feature can be found here.
Copy
This release comes with several new features related to Cypher’s COPY
clause.
Copy To Parquet Files
Query results can now be exported to Parquet files.
COPY ( MATCH (a:Person) RETURN a.* ) TO "person.parquet";
Copy To CSV Files
We added serveral configuration options when exporting to CSV files.
COPY ( MATCH (a:Person) RETURN a.* ) TO "person.csv" (delim = '|', header=true);
We also improved the performance of the CSV writer. Below is a micro benchmark of exporting the LDBC100 Comment table to CSV format.
COPY (MATCH (p:Comment) RETURN p.*) to ‘comment.csv’;
Version 0.0.10 | Version 0.1.0 | |
---|---|---|
Runtime | 1239.3s | 104.56s |
Optional column_names
Argument in Copy From Statements
Users can now load data to a subset of the columns in a table. Previously, we required that if
users are going to load an empty table T from a file F,
e.g., a CSV or Parquet file, then F must contain: (1) as many columns as the columns in T; and (2) in the same order as
table T. Now users can optionally add a column_names
argument in COPY FROM
statements,
which relaxes both of these restrictions: (1) F can now contain a subset of the columns; and (2) in arbitrary
order, which needs to be specified in the column_names
argument. Here is an example:
CREATE NODE TABLE Person (id INT64, name STRING, comment STRING, PRIMARY KEY(id));
COPY Person (name, id) FROM "person.csv";
The code above first creates a Person
table with 3 columns, and then loads two of its columns from a file
that contains name
and id
values of the columns respectively.
The third comment
column in the table will be set to NULL
for all imported records. The details
of this feature can be found here.
Updates
Detach Delete
Kùzu now supports Cypher’s DETACH DELETE clause,
which deletes a node and all of its relationships together.
Previously users could only use the DELETE
command, which deleted nodes that had no relationships.
For example, the following query deletes a User
node with name
Adam and all of its edges.
MATCH (u:User) WHERE u.name = 'Adam' DETACH DELETE u;
Return Deleted Rows
RETURN
clauses can now return variable bindings that were used in the DELETE
command. For example,
you can return nodes that were deleted in the previous DELETE statement as follows:
DELETE (a:Person) RETURN a;
Details of this feature can be found here.
Other Changes
SQL-style Cast Function
We have implemented a SQL-style cast
function cast(input, target_type)
to cast values between different
types. The cast function will convert the input
argument to the target_type
if
casting of the input value to the target type is defined. For example:
RETURN cast("[1,2,3]", "INT[]");
--------------------------
| CAST([1,2,3], INT32[]) |
--------------------------
| [1,2,3] |
--------------------------
Along with this, we are deprecating our previous way of doing casts with separate functions, e.g., STRING(1.2)
or to_int64("32")
.
Details of the cast
function can be found here.
Recursive Relationship Node Filter
Since v0.0.5 we have supported filtering the intermediate relationships that can bind to recursive relationships, based on the properties of these intermediate relationships. With the current release, we now support filtering the intermediate nodes that are bound to recursive relationships. As we did for filtering intermediate relationships, we adopt Memgraph’s syntax for this feature as follows:
MATCH p = (a:User)-[:Follows*1..2 (r, n | WHERE n.age > 21)]->(b:User)
RETURN p;
The first variable r
that is inside the recursive relationship above binds to the intermediate relationships while
the second variable n
binds to the intermediate nodes. The |
symbol can be followed with a WHERE
clause
where these variables can be used to express a filtering expression. This query finds all 1 to 2-hop paths between
two User
nodes where the intermediate nodes of these paths have age
properties greater than 21.
Details of this feature can be found here.
Count Subquery
We have added support for counting subqueries, which checks the number of matches for the given pattern in the graph. The output of this counting can be bound to a variable with aliasing. For example, the following query counts the number of followers of each user in the graph.
MATCH (a:User)
RETURN a.name, COUNT { MATCH (a)<-[:Follows]-(b:User) } AS num_follower
ORDER BY num_follower;
The details of count subqueries can be found here.
New INT128 Data Type
Finally, we now have support for 16-byte signed huge integers.
Development
Nightly Build
We have setup a nightly build pipeline for Kùzu users who want to access our latest feature set. Here is how you can use the latest nightly version of Kùzu:
- For the Python API, the latest nightly version can be installed with
pip install --pre kuzu
. - For the Node.js API, the latest nightly version can be installed with
npm i kuzu@next
. - For the Rust API, the latest nightly version can be found at crates.io.
- For the CLI, C and C++ shared library, and the Java JAR, the latest nightly version can be downloaded from the latest run of this GitHub Actions pipeline.
Reduced Binary Size
With this release, we removed our Apache Arrow dependency, which significantly reduces oure binary size. Additionally, we now strip the shared library and CLI binaries of the symbols that are not needed by our client APIs. This further reduces our binary sizes. For example, on a MacOS arm64 platform, these two improvements achieve the following cumulative binary size reductions:
Version 0.0.10 | Version 0.1.0 | |
---|---|---|
Binary Size | 27.2 MB | 10.3 MB |
Stripping of our other libraries (e.g. Python) is a work in progress.
Closing Remarks
As usual, we would like to thank everyone in the Kùzu engineering team, especially our interns, for making this release possible. We look forward to your feedback!
Enjoy Kùzu v 0.1.0 and the upcoming holiday season, which in this part of the world 🇨🇦🇨🇦 coincides with coming of the cold but cozy winter 🤗🤗.