Kùzu 0.0.9 Release

Kùzu Team
Kùzu Team
Developers of Kùzu Inc.

We are very happy to release Kùzu 0.0.9 today! This release comes with the following new main features and improvements:

New Features

Load From

Kùzu now supports loading directly from a file without importing into the database through the LOAD FROM clause. For instance, the following query counts the number of rows whose first column starts with ‘Adam’.

LOAD FROM "user.csv"
WHERE column0 =~ 'Adam*'
RETURN COUNT(*)

LOAD FROM can also be used as the input source for a bulk update.

LOAD FROM "user.csv"
CREATE (:Person {name: column0, age: to_int64(column1)});

Details can be found in the LOAD FROM documentation page.

Header Schema

By default, Kùzu will read the header of the file to detect column names and types. If no header is available it will use auto-generated names and all columns will be strings. To manually specify the header, you can use LOAD WITH HEADERS ... FROM ....

For example, the following query will load name as a string type for the first column and age as an INT64 type for the second column.

LOAD WITH HEADERS (name STRING, age INT64) FROM "user.csv"
WHERE name =~ 'Adam*'
RETURN name, age;

If a header is manually specified, Kùzu will try to cast to the given type and throw exceptions if casting fails. More information can be found here.

Transaction Statement

This release replaces the beginReadTransaction(), beginWriteTransaction(), commit() and rollback() APIs in all language bindings with explicit statements.

BEGIN TRANSACTION;
CREATE (a:User {name: 'Alice', age: 72});
MATCH (a:User) RETURN *;
COMMIT;

The above sequence of statements starts a write transaction, adds a new node, and within the same transaction also reads all of the tuples in User table before committing the transaction. More info on the new transaction statement can be found here.

Comment on Table

You can now add comments to a table using the COMMENT ON TABLE statement. The following query adds a comment to the User table.

COMMENT ON TABLE User IS 'User information';

Comments can be extracted through the new SHOW_TABLES() function.

CALL SHOW_TABLES() RETURN *;
----------------------------------
| name | type | comment          |
----------------------------------
| User | NODE | User information |
----------------------------------
| City | NODE |                  |
----------------------------------

Recursive Relationship Projection

This release expands recursive relationship patterns and enables projection on intermediate nodes and relationships. Previously, Kùzu only supported returning all node and relationship properties on the path.

MATCH (a:User)-[e:Follows*1..2 (r, n | WHERE r.since > 2020)]->(b:User)
RETURN nodes(e), rels(e);

This incurs a significant computational overhead when a user is only interested in a subset of properties on the path. Also, returning all properties makes the result harder to interpret.

Kùzu now allows projection inside recursive relationship patterns using a list-comprehension-like syntax.

MATCH (a:User)-[e:Follows*1..2 (r, n | WHERE r.since > 2020 | {r.since}, {n.name})]->(b:User)
RETURN nodes(e), rels(e);

The query above finds all paths between two users which are between 1 and 2 hops, and where the users followed each other after 2020. The query returns the since property of any Follow relationships and the name of any intermediate users.

For more information, check out the new documentation.

The performance improvements are shown in the Performance Improvements section.

CREATE REL GROUP1

We have received a lot of feedback regarding the limitation that a relationship can only be defined over a single pair of node tables. This release introduces a CREATE REL GROUP statement which has a similar syntax to CREATE REL TABLE, but allows multiple FROM ... TO ... clauses. This statement will create a relationship table for each pair internally. When querying, a relationship group is simply syntatic sugar for any of the relationships in the group.

For example, the following statement creates a group containing a Knows_User_User relationship and a Knows_User_City relationship.

CREATE REL TABLE GROUP Knows (FROM User To User, FROM User to City, year INT64);

To query with the group, simply treat it as any other relationship, so:

MATCH (a:User)-[:Knows]->(b) RETURN *;

The query above is equivalent to

MATCH (a:User)-[:Knows_User_User|:Knows_User_city]->(b) RETURN *;

Note

  • For COPY FROM and CREATE, we currently don’t support using a relationship group so you need to explicitly specify a single relationship table.

See Create Table for more information.

Data Types & Functions

We introduced a few more numerical data types:

  • INT8: 1 byte signed integer
  • UINT64: 8 byte unsigned integer
  • UINT32: 4 byte unsigned integer
  • UINT16: 2 byte unsigned integer
  • UINT8: 1 byte unsigned integer

We have also added several casting and list functions. See functions for more information.

Performance Improvements

New CSV and Parquet Reader

In this release, we have started replacing arrow’s CSV and Parquet reader with our own lightweight and customized implementations.

Following DuckDB’s implementation, we’ve replaced arrow’s streaming CSV reader with a parallel one. The parallel CSV reader assumes there are no multi-line strings and provides a large performance boost on multi-threaded machines.

If multi-line strings are present, the CSV reading will fail, and you will need to fall back to single thread mode by setting parallel=false. See Data Import from CSV Files.

We demonstrate the performance of our parallel csv reader through the new LOAD FROM feature as follows.

LOAD FROM "ldbc-100/comment_0_0.csv" (header = true, delim = '|') RETURN COUNT(*);
# Threads124816
Time (s)297.19170.71 (1.7x)109.38 (2.7x)69.01 (4.3x)53.28 (5.6x)

Bitpacking Compression

With this release, we have implemented our first compression algorithm! We are introducing the bitpacking compression algorithm for integers. It is useful when using a large integer type (e.g., INT32 or INT64) for storing small integers, which can be encoded more compactly with fewer bits. This helps both storage and query processing times.

To show the difference, we take the length column from LDBC Comment table as an example, which is of type INT32 and whose values range from 2 to 1998. Together with an auto-increment ID column as the primary key, we create a node table (ID INT64, length INT32, PRIMARY KEY(ID)). The loaded data file size, and loading time is listed in the below table. Data file size is largely reduced from 2.6GB to 1.1GB (2.4x), while the data loading time stays the same (75.69s vs. 75.84s).

Reduced data file size also helps reduce disk I/O operations, which can improve query scan performance. We show that with a query that sums all the lengths.

MATCH (l:length) RETURN sum(l.length);

The query time improved from 1.64s to 0.45s (3.6x)!

Data sizeLoading timeQuery time
Without compression2.6GB75.69s1.64s
With compression1.1GB (2.4x)75.84s0.45s (3.6x)

More compressions on integers, floats, and string values will be coming soon. Please stay tuned!

Note: The compression is currently only done on node tables. It will be adapted to rel tables in our next release. By default, we turn on compression for all node tables. To disable it, we provide an option when starting the database. For example, starting our CLI with --nocompress option can disable compression on all write statements to node tables.

General Data Loading Improvement

Data loading time is improved due the following changes:

  • Parallel csv reader.
  • Compression means we write less data to disk.
  • Removed line counting when copying rel tables.
  • Dedicated casting functions to avoid string copy.
  • Reduced hash index file size.
Files# LinesCSV file sizev0.0.8v0.0.9
comment.csv220M22.49 GB187.76s131.48s
person.csv0.45M43.6M1.16s0.78s
likesComment.csv242M13 GB250.64s210.72s
knows.csv20M1.1 GB24.40s19.54s

Projection Pushdown for Recursive Joins

The following two queries both compute paths along the Knows relationship with 1 to 3 hops from a single starting point, and then returns the firstName of all nodes along the path.2

Without Projection:

MATCH (a:Person)-[e:Knows*1..3]->(b:Person)
WHERE a.ID = 933
RETURN properties(nodes(e), 'firstName');

With Projection:

MATCH (a:Person)-[e:Knows*1..3 (r, n | {}, {n.firstName})]->(b:Person)
WHERE a.ID = 933
RETURN properties(nodes(e), 'firstName');
With projectionWithout projection
471.9 ms3412.8 ms

With projection, the optimizer can completely avoid materializing a hash table for relationship properties which is a major bottleneck in computation.


Footnotes

  1. This is an experimental feature and might be changed in the future.

  2. This experiment was carried out on an M1 Macbook Pro with 16GB of memory and 8 threads. Sideway information passing is disabled.