Kuzu 0.6.0 Release

It’s been a short while since our last release, but we’re back with a new version of Kuzu: 0.6.0! This release comes with several bug fixes, CLI updates and a much awaited feature: in-memory mode for Kuzu to quickly create temporary databases in memory. Many users had asked for this feature, so we hope it simplifies and possibly speeds up some of your workloads. In this post, we’ll give an overview of the in-memory feature and provide some insights about the performance benefits to expect when using this mode in Kuzu. We’ll also highlight an improvement to the CLI that allows you to change the output mode of query results.

In-memory mode

Opening an in-memory database

Kuzu now supports both “on-disk” and “in-memory” modes. As you create your databases, if you do not specify a database path, specify an empty string, or explicitly specify:memory:, Kuzu will be opened under in-memory mode.

Here’s how to do this using the CLI (simply run the kuzu command in your terminal):

❯ kuzu
Opened the database under in-memory mode.
Enter ":help" for usage hints.
kuzu>

In Python, you can leave the database path empty in the Database constructor:

import kuzu

# Leave the database path empty to open it under in-memory mode
db = kuzu.Database()
conn = kuzu.Connection(db)

# Create node and relationship tables and insert data under in-memory mode

For the other language APIs, you can pass an empty string or :memory: to the Database constructor.

The main differences between using in-memory mode and on-disk mode are:

There are no writes to the write-ahead-log (WAL) during transactions, so no data is persisted to disk (so CHECKPOINT will do nothing).
All data is lost when the process finishes.

Importantly, your databases under in-memory mode are temporary, which can be useful in many scenarios that require performing quick graph querying and analysis on subsets of records, without the need to persist the data.

Performance characteristics

Due to the above differences, in-memory mode and on-disk mode can present different performance characteristics. The table below shows performance numbers for four experiments we ran:

COPY of the LDBC 100 Comment table
1M insertions (each insert of a node is an auto-transaction) into a node table named nodeT
Full table scan over the Comment node table
2-hop join over the LDBC 100 Knows table

All experiments were run on a server with 384 GB RAM, 2TB SSD, and 2 Intel Xeon Platinum 8175M CPUs.

// 1. COPY from CSV file
COPY Comment FROM 'ldbc/ldbc-100/csv/comment_0_0.csv' (DELIM="|", HEADER=true);

// 2. Insert each record's values as parameters via an individual transaction using a client API
CREATE NODE TABLE nodeT(id INT64, name STRING, age INT64, net_worth FLOAT, PRIMARY KEY (id));
CREATE (:nodeT {id: $id, name: $name, age: $age, net_worth: $net_worth});

// 3. Full table scan
MATCH (c:Comment)
RETURN MIN(c.ID), MIN(c.creationDate), MIN(c.locationIP), MIN(c.browserUsed), MIN(c.content), MIN(c.length);

// 4. Perform a 2-hop join
MATCH (a:Person)-[:Knows]->(b:Person)-[:Knows]->(c:Person)
RETURN MIN(a.birthday), MIN(b.birthday), MIN(c.birthday);

Experiment	On-disk (s)	In-memory (s)
`COPY`	34.58	14.79
Insert	79.31	47.81
Scan	5.33 (cold) / 1.80 (warm)	1.89
2-hop Join	0.95 (cold) / 0.90 (warm)	0.90

The key takeaways are:

The performance of COPY and large scans during cold runs are much improved (from 34.6s to 14.8s in this experiment) under the in-memory mode compared to the on-disk mode. This is due to avoiding all disk I/Os that the on-disk mode has to do to persist the data.
Similarly, the performance of insertions is significantly improved (from 79.3s to 47.8s in this experiment) because there are no writes to the WAL (which would require writing and syncing the disk file).
For “cold” scans, i.e., the initial scans that are done when the database starts and the buffer manager is empty, also improve significantly (from 5.3s to 1.9s).
For large scans during warm runs, the performance difference between the two modes is negligible (1.89s vs. 1.8s), since the required pages are already cached in the buffer manager.
For 2-hop joins, where the performance bottleneck is in the joins and not the scans, the performance difference is negligible in both cold and warm runs of the query.

Overall, you can expect the in-memory mode to improve the performance of your data ingestion pipelines, such as a COPY statement or your write-heavy transaction workloads. You can also expect visible performance improvements if you are running a query only once before closing your database. Scenarios where you only need a temporary database to run a few queries, a few times are where you can expect good performance benefits using in-memory mode.

See our documentation page for more details on how to work with in-memory databases.

CLI output mode

The CLI now supports changing the output mode of query results via the :mode [mode] command. By default, the output mode is set to box, but you can change it to any one the modes listed below. To display all available output modes, simply type the :mode command without any arguments when you are in the Kuzu shell.

kuzu> :mode
Available output modes:
    box (default):    Tables using unicode box-drawing characters
    column:    Output in columns
    csv:    Comma-separated values
    html:    HTML table
    json:    Results in a JSON array
    jsonlines:    Results in a NDJSON format
    latex:    LaTeX tabular environment code
    line:    One value per line
    list:    Values delimited by "|"
    markdown:    Markdown table
    table:    Tables using ASCII characters
    tsv:    Tab-separated values
    trash:    No output

Let’s see this feature in action with a simple example. We’ll first create a node table of persons and then query the table to display the results in different output modes.

CREATE NODE TABLE Person (name STRING, age INT64, PRIMARY KEY(name));
CREATE (p:Person {name: 'Alice'}) SET p.age = 30;
CREATE (p:Person {name: 'Bob'}) SET p.age = 25;
CREATE (p:Person {name: 'Charlie'}) SET p.age = 35;

By default the results of a MATCH query are displayed inside a box:

kuzu> MATCH (p:Person) RETURN p.*;
┌─────────┬───────┐
│ p.name  │ p.age │
│ STRING  │ INT64 │
├─────────┼───────┤
│ Alice   │ 30    │
│ Bob     │ 25    │
│ Charlie │ 35    │
└─────────┴───────┘

Here’s the same query but with the output mode set to csv:

kuzu> :mode csv
mode set as csv
kuzu> MATCH (p:Person) RETURN p.*;
p.name,p.age
Alice,30
Bob,25
Charlie,35

And here’s the same query but with the output mode set to jsonlines (newline-delimited JSON):

kuzu> :mode jsonlines
mode set as jsonlines
kuzu> MATCH (p:Person) RETURN p.*;
{"p.name":"Alice","p.age":"30"}
{"p.name":"Bob","p.age":"25"}
{"p.name":"Charlie","p.age":"35"}

Depending on your use case downstream, you can set the output mode to the one that best suits your needs. Read more details about this feature on the documentation page.

Closing remarks

The in-memory feature from this release is the first of many more usability and performance improvements in our upcoming roadmap for Kuzu. You can check the release notes on GitHub for a comprehensive list of the bugfixes and updates in this release. Once you give these features a try, come on over to our Discord with your feedback. Till next time, have fun using Kuzu!