It’s been a short while since our last release, but we’re back with a new version of Kùzu: 0.6.0! This release comes with several bug fixes, CLI updates and a much awaited feature: in-memory mode for Kùzu to quickly create temporary databases in memory. Many users had asked for this feature, so we hope it simplifies and possibly speeds up some of your workloads. In this post, we’ll give an overview of the in-memory feature and provide some insights about the performance benefits to expect when using this mode in Kùzu. We’ll also highlight an improvement to the CLI that allows you to change the output mode of query results.
In-memory mode
Opening an in-memory database
Kùzu now supports both “on-disk” and “in-memory” modes.
As you create your databases, if you do not specify a database path, specify an empty string, or
explicitly specify:memory:
, Kùzu will be opened under in-memory mode.
Here’s how to do this using the CLI (simply run the kuzu
command in your terminal):
❯ kuzu
Opened the database under in-memory mode.
Enter ":help" for usage hints.
kuzu>
In Python, you can leave the database path empty in the Database
constructor:
import kuzu
# Leave the database path empty to open it under in-memory mode
db = kuzu.Database()
conn = kuzu.Connection(db)
# Create node and relationship tables and insert data under in-memory mode
For the other language APIs, you can pass an empty string or :memory:
to the Database
constructor.
The main differences between using in-memory mode and on-disk mode are:
- There are no writes to the write-ahead-log (WAL) during transactions, so no data is persisted to disk (so
CHECKPOINT
will do nothing). - All data is lost when the process finishes.
Importantly, your databases under in-memory mode are temporary, which can be useful in many scenarios that require performing quick graph querying and analysis on subsets of records, without the need to persist the data.
Performance characteristics
Due to the above differences, in-memory mode and on-disk mode can present different performance characteristics. The table below shows performance numbers for four experiments we ran:
COPY
of the LDBC 100Comment
table- 1M insertions (each insert of a node is an auto-transaction) into a node table named
nodeT
- Full table scan over the
Comment
node table - 2-hop join over the LDBC 100
Knows
table
All experiments were run on a server with 384 GB RAM, 2TB SSD, and 2 Intel Xeon Platinum 8175M CPUs.
// 1. COPY from CSV file
COPY Comment FROM 'ldbc/ldbc-100/csv/comment_0_0.csv' (DELIM="|", HEADER=true);
// 2. Insert each record's values as parameters via an individual transaction using a client API
CREATE NODE TABLE nodeT(id INT64, name STRING, age INT64, net_worth FLOAT, PRIMARY KEY (id));
CREATE (:nodeT {id: $id, name: $name, age: $age, net_worth: $net_worth});
// 3. Full table scan
MATCH (c:Comment)
RETURN MIN(c.ID), MIN(c.creationDate), MIN(c.locationIP), MIN(c.browserUsed), MIN(c.content), MIN(c.length);
// 4. Perform a 2-hop join
MATCH (a:Person)-[:Knows]->(b:Person)-[:Knows]->(c:Person)
RETURN MIN(a.birthday), MIN(b.birthday), MIN(c.birthday);
Experiment | On-disk (s) | In-memory (s) |
---|---|---|
COPY | 34.58 | 14.79 |
Insert | 79.31 | 47.81 |
Scan | 5.33 (cold) / 1.80 (warm) | 1.89 |
2-hop Join | 0.95 (cold) / 0.90 (warm) | 0.90 |
The key takeaways are:
- The performance of
COPY
and large scans during cold runs are much improved (from 34.6s to 14.8s in this experiment) under the in-memory mode compared to the on-disk mode. This is due to avoiding all disk I/Os that the on-disk mode has to do to persist the data. - Similarly, the performance of insertions is significantly improved (from 79.3s to 47.8s in this experiment) because there are no writes to the WAL (which would require writing and syncing the disk file).
- For “cold” scans, i.e., the initial scans that are done when the database starts and the buffer manager is empty, also improve significantly (from 5.3s to 1.9s).
- For large scans during warm runs, the performance difference between the two modes is negligible (1.89s vs. 1.8s), since the required pages are already cached in the buffer manager.
- For 2-hop joins, where the performance bottleneck is in the joins and not the scans, the performance difference is negligible in both cold and warm runs of the query.
Overall, you can expect the in-memory mode to improve the performance of your data ingestion pipelines, such as a COPY
statement or
your write-heavy transaction workloads. You can also expect visible performance improvements if you are running a query only once before
closing your database. Scenarios where you only need a temporary database to run a few queries, a few times
are where you can expect good performance benefits using in-memory mode.
See our documentation page for more details on how to work with in-memory databases.
CLI output mode
The CLI now supports changing the output mode of query results via the :mode [mode]
command. By
default, the output mode is set to box
, but you can change it to any one the modes listed below.
To display all available output modes, simply type the :mode
command without any arguments when
you are in the Kùzu shell.
kuzu> :mode
Available output modes:
box (default): Tables using unicode box-drawing characters
column: Output in columns
csv: Comma-separated values
html: HTML table
json: Results in a JSON array
jsonlines: Results in a NDJSON format
latex: LaTeX tabular environment code
line: One value per line
list: Values delimited by "|"
markdown: Markdown table
table: Tables using ASCII characters
tsv: Tab-separated values
trash: No output
Let’s see this feature in action with a simple example. We’ll first create a node table of persons and then query the table to display the results in different output modes.
CREATE NODE TABLE Person (name STRING, age INT64, PRIMARY KEY(name));
CREATE (p:Person {name: 'Alice'}) SET p.age = 30;
CREATE (p:Person {name: 'Bob'}) SET p.age = 25;
CREATE (p:Person {name: 'Charlie'}) SET p.age = 35;
By default the results of a MATCH
query are displayed inside a box:
kuzu> MATCH (p:Person) RETURN p.*;
┌─────────┬───────┐
│ p.name │ p.age │
│ STRING │ INT64 │
├─────────┼───────┤
│ Alice │ 30 │
│ Bob │ 25 │
│ Charlie │ 35 │
└─────────┴───────┘
Here’s the same query but with the output mode set to csv
:
kuzu> :mode csv
mode set as csv
kuzu> MATCH (p:Person) RETURN p.*;
p.name,p.age
Alice,30
Bob,25
Charlie,35
And here’s the same query but with the output mode set to jsonlines
(newline-delimited JSON):
kuzu> :mode jsonlines
mode set as jsonlines
kuzu> MATCH (p:Person) RETURN p.*;
{"p.name":"Alice","p.age":"30"}
{"p.name":"Bob","p.age":"25"}
{"p.name":"Charlie","p.age":"35"}
Depending on your use case downstream, you can set the output mode to the one that best suits your needs. Read more details about this feature on the documentation page.
Closing remarks
The in-memory feature from this release is the first of many more usability and performance improvements in our upcoming roadmap for Kùzu. You can check the release notes on GitHub for a comprehensive list of the bugfixes and updates in this release. Once you give these features a try, come on over to our Discord with your feedback. Till next time, have fun using Kùzu!