We are very happy to release Kùzu 0.0.9 today! This release comes with the following new main features and improvements:
New Features
Load From
Kùzu now supports loading directly from a file without importing into the database through the LOAD FROM
clause. For instance, the following query counts the number of rows whose first column starts with ‘Adam’.
LOAD FROM "user.csv"
WHERE column0 =~ 'Adam*'
RETURN COUNT(*)
LOAD FROM
can also be used as the input source for a bulk update.
LOAD FROM "user.csv"
CREATE (:Person {name: column0, age: to_int64(column1)});
Details can be found in the LOAD FROM documentation page.
Header Schema
By default, Kùzu will read the header of the file to detect column names and types. If no header is available it will use auto-generated names and all columns will be strings. To manually specify the header, you can use LOAD WITH HEADERS ... FROM ...
.
For example, the following query will load name
as a string type for the first column and age
as an INT64 type for the second column.
LOAD WITH HEADERS (name STRING, age INT64) FROM "user.csv"
WHERE name =~ 'Adam*'
RETURN name, age;
If a header is manually specified, Kùzu will try to cast to the given type and throw exceptions if casting fails. More information can be found here.
Transaction Statement
This release replaces the beginReadTransaction()
, beginWriteTransaction()
, commit()
and rollback()
APIs in all language bindings with explicit statements.
BEGIN TRANSACTION;
CREATE (a:User {name: 'Alice', age: 72});
MATCH (a:User) RETURN *;
COMMIT;
The above sequence of statements starts a write transaction, adds a new node, and within the same transaction also reads all of the tuples in User table before committing the transaction. More info on the new transaction statement can be found here.
Comment on Table
You can now add comments to a table using the COMMENT ON TABLE
statement. The following query adds a comment to the User
table.
COMMENT ON TABLE User IS 'User information';
Comments can be extracted through the new SHOW_TABLES()
function.
CALL SHOW_TABLES() RETURN *;
----------------------------------
| name | type | comment |
----------------------------------
| User | NODE | User information |
----------------------------------
| City | NODE | |
----------------------------------
Recursive Relationship Projection
This release expands recursive relationship patterns and enables projection on intermediate nodes and relationships. Previously, Kùzu only supported returning all node and relationship properties on the path.
MATCH (a:User)-[e:Follows*1..2 (r, n | WHERE r.since > 2020)]->(b:User)
RETURN nodes(e), rels(e);
This incurs a significant computational overhead when a user is only interested in a subset of properties on the path. Also, returning all properties makes the result harder to interpret.
Kùzu now allows projection inside recursive relationship patterns using a list-comprehension-like syntax.
MATCH (a:User)-[e:Follows*1..2 (r, n | WHERE r.since > 2020 | {r.since}, {n.name})]->(b:User)
RETURN nodes(e), rels(e);
The query above finds all paths between two users which are between 1 and 2 hops, and where the users followed each other after 2020. The query returns the since
property of any Follow
relationships and the name of any intermediate users.
For more information, check out the new documentation.
The performance improvements are shown in the Performance Improvements section.
CREATE REL GROUP1
We have received a lot of feedback regarding the limitation that a relationship can only be defined over a single pair of node tables. This release introduces a CREATE REL GROUP
statement which has a similar syntax to CREATE REL TABLE
, but allows multiple FROM ... TO ...
clauses. This statement will create a relationship table for each pair internally. When querying, a relationship group is simply syntatic sugar for any of the relationships in the group.
For example, the following statement creates a group containing a Knows_User_User relationship and a Knows_User_City relationship.
CREATE REL TABLE GROUP Knows (FROM User To User, FROM User to City, year INT64);
To query with the group, simply treat it as any other relationship, so:
MATCH (a:User)-[:Knows]->(b) RETURN *;
The query above is equivalent to
MATCH (a:User)-[:Knows_User_User|:Knows_User_city]->(b) RETURN *;
Note
- For
COPY FROM
andCREATE
, we currently don’t support using a relationship group so you need to explicitly specify a single relationship table.
See Create Table for more information.
Data Types & Functions
We introduced a few more numerical data types:
- INT8: 1 byte signed integer
- UINT64: 8 byte unsigned integer
- UINT32: 4 byte unsigned integer
- UINT16: 2 byte unsigned integer
- UINT8: 1 byte unsigned integer
We have also added several casting and list functions. See functions for more information.
Performance Improvements
New CSV and Parquet Reader
In this release, we have started replacing arrow’s CSV and Parquet reader with our own lightweight and customized implementations.
Following DuckDB’s implementation, we’ve replaced arrow’s streaming CSV reader with a parallel one. The parallel CSV reader assumes there are no multi-line strings and provides a large performance boost on multi-threaded machines.
If multi-line strings are present, the CSV reading will fail, and you will need to fall back to single thread mode by setting parallel=false
. See Data Import from CSV Files.
We demonstrate the performance of our parallel csv reader through the new LOAD FROM feature as follows.
LOAD FROM "ldbc-100/comment_0_0.csv" (header = true, delim = '|') RETURN COUNT(*);
# Threads | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
Time (s) | 297.19 | 170.71 (1.7x) | 109.38 (2.7x) | 69.01 (4.3x) | 53.28 (5.6x) |
Bitpacking Compression
With this release, we have implemented our first compression algorithm! We are introducing the bitpacking compression algorithm for integers. It is useful when using a large integer type (e.g., INT32 or INT64) for storing small integers, which can be encoded more compactly with fewer bits. This helps both storage and query processing times.
To show the difference, we take the length
column from LDBC Comment
table as an example, which is of type INT32
and whose values range from 2 to 1998.
Together with an auto-increment ID
column as the primary key, we create a node table (ID INT64, length INT32, PRIMARY KEY(ID))
. The loaded data file size, and loading time is listed in the below table. Data file size is largely reduced from 2.6GB to 1.1GB (2.4x), while the data loading time stays the same (75.69s vs. 75.84s).
Reduced data file size also helps reduce disk I/O operations, which can improve query scan performance. We show that with a query that sums all the lengths.
MATCH (l:length) RETURN sum(l.length);
The query time improved from 1.64s to 0.45s (3.6x)!
Data size | Loading time | Query time | |
---|---|---|---|
Without compression | 2.6GB | 75.69s | 1.64s |
With compression | 1.1GB (2.4x) | 75.84s | 0.45s (3.6x) |
More compressions on integers, floats, and string values will be coming soon. Please stay tuned!
Note: The compression is currently only done on node tables. It will be adapted to rel tables in our next release. By default, we turn on compression for all node tables. To disable it, we provide an option when starting the database. For example, starting our CLI with --nocompress
option can disable compression on all write statements to node tables.
General Data Loading Improvement
Data loading time is improved due the following changes:
- Parallel csv reader.
- Compression means we write less data to disk.
- Removed line counting when copying rel tables.
- Dedicated casting functions to avoid string copy.
- Reduced hash index file size.
Files | # Lines | CSV file size | v0.0.8 | v0.0.9 |
---|---|---|---|---|
comment.csv | 220M | 22.49 GB | 187.76s | 131.48s |
person.csv | 0.45M | 43.6M | 1.16s | 0.78s |
likesComment.csv | 242M | 13 GB | 250.64s | 210.72s |
knows.csv | 20M | 1.1 GB | 24.40s | 19.54s |
Projection Pushdown for Recursive Joins
The following two queries both compute paths along the Knows relationship with 1 to 3 hops from a single starting point, and then returns the firstName of all nodes along the path.2
Without Projection:
MATCH (a:Person)-[e:Knows*1..3]->(b:Person)
WHERE a.ID = 933
RETURN properties(nodes(e), 'firstName');
With Projection:
MATCH (a:Person)-[e:Knows*1..3 (r, n | {}, {n.firstName})]->(b:Person)
WHERE a.ID = 933
RETURN properties(nodes(e), 'firstName');
With projection | Without projection |
---|---|
471.9 ms | 3412.8 ms |
With projection, the optimizer can completely avoid materializing a hash table for relationship properties which is a major bottleneck in computation.