Kùzu 0.8.0 Release

Kùzu Team
Kùzu Team
Developers of Kùzu Inc.

It’s 2025, and we’re kicking off the year with the exciting release of Kùzu 0.8.0. This release brings a new feature that distinguishes Kùzu from any other graph database out there — Kùzu-WASM for in-browser graph analytics. You can now run your graph database while keeping all data and compute within your browser session! Our extension ecosystem also has an exciting addition: fts extension for full-text search. You can now run keyword-based search queries using BM25 in Kùzu.

In addition to these new features, we’ve streamlined the developer workflow during relationship table creation by unifying CREATE REL TABLE GROUP into a single, flexible CREATE REL TABLE syntax.

Finally, we’ve significantly improved our execution of aggregation queries via a new parallel hash aggregation mechanism. In the following sections, we dive into details.

New Features

Kùzu-WASM

Starting from version 0.8.0, we are happy to release a WebAssembly (WASM) version of Kùzu that runs within web browsers on a variety of devices.

Why is this exciting? With Kùzu-WASM, you can:

  • Perform interactive data analytics directly in the browser.
  • Achieve low latency with fast in-browser graph analytics.
  • Ensure data privacy if you need data to stay entirely in the browser.

This makes it very useful for use cases like building interactive in-browser graph analytics and visualization in sensitive data environments, where users can analyze their data privately without transferring it to a server. To use Kùzu-WASM, you have to install the kuzu-wasm package first:

npm i kuzu-wasm

Then, you can follow our API documentation to integrate it into your projects. To see Kùzu-WASM in action, we have developed a WebAssembly version of Kùzu Explorer, which is Kùzu’s browser-based graph visualization and CLI. You can check out this demo here.

Full-Text Search (fts) Extension

We’re introducing an fts extension to enable full-text search capabilities using the BM25 scoring algorithm in Kùzu. The FTS index can be built on one or multiple columns in a Kùzu node table. Similar to other GDBMSs, we interpret each node as a “document”. Our implementation of FTS is native, i.e., we do not use any separate libraries, similar to DuckDB’s FTS feature, and is based on the paper Old Dogs Are Great at New Tricks (read the paper for a simple and elegant approach to support FTS natively in columnar systems!).

In our implementation, we leveraged several graph-native capabilities of Kùzu: words from source documents are stored as a Kùzu node table, while the occurrences of words in “documents” are stored as a Kùzu relationship table. This means our default CSR join index on relationship tables serves as an inverted index from words to documents, and allows Kùzu to answer queries very quickly (expect sub-second latencies).

To utilize the FTS index, you have to first install and load the fts extension.

INSTALL fts;
LOAD EXTENSION fts;

The fts extension introduces three functions: CREATE_FTS_INDEX, QUERY_FTS_INDEX and DROP_FTS_INDEX for creating, querying and dropping the index, respectively.

We use an example to demonstrate how the extension can be used to build an index on a Books table and search from it.

// Create the book table and insert data to it
CREATE NODE TABLE Book (ID SERIAL, abstract STRING, author STRING, title STRING, PRIMARY KEY (ID));
CREATE (b:Book {abstract: 'An exploration of quantum mechanics.', author: 'Alice Johnson', title: 'The Quantum World'});
CREATE (b:Book {abstract: 'An introduction to machine learning techniques.', author: 'Emma Brown', title: 'Learning Machines'});
CREATE (b:Book {abstract: 'A fantasy tale of dragons and magic.', author: 'Charlotte Harris', title: 'The Dragon\'s Call'});

// Build a fts-index named `book_index` on the book table
CALL CREATE_FTS_INDEX(
    'Book',   // Table name
    'book_index',   // Index name
    ['abstract', 'author', 'title'],   // Properties to build FTS index on
    stemmer := 'porter'   // Stemmer to use (optional: if left out, the English snowball stemmer is used)
)

// Query the `book_index` table
CALL QUERY_FTS_INDEX('Book', 'book_index', 'quantum machine')
RETURN node.title, score
ORDER BY score DESC;

The following result is returned:

┌───────────────────┬──────────┐
│ node.title        │ score    │
│ STRING            │ DOUBLE   │
├───────────────────┼──────────┤
│ The Quantum World │ 0.857996 │
│ Learning Machines │ 0.827832 │
└───────────────────┴──────────┘

Additionally, we support the conjunctive option for querying the index, which searches for documents containing all the keywords in the query.

To give a sense of the performance, here is a benchmark on our current performance. We used the ms-passage dataset, which contains 8.8M documents that take 2.9GB in raw size. We used a machine with 2 AMD EPYC 7551 CPUs with 64 cores, using 409GB of buffer manager. The index creation, which is not yet optimized, takes 16 minutes. We will work on bringing this down. Queries in the benchmark however generally have sub-second, ~0.5s, latencies.

There are also some limitations for now — FTS indices can only be built on node tables and are immutable. To refresh the index, you need to drop and recreate the index. We will work on removing these limitations in future releases. For more details on how to use FTS in Kùzu, you can check out our documentation here.

Usability Improvements

Deprecation of REL TABLE GROUP

Until now, users had two ways to define relationships between node tables: CREATE REL TABLE and CREATE REL TABLE GROUP. The former was limited to a single FROM and TO node table pair. With the latter, you had the flexibility to define relationships between multiple FROM and TO node table pairs in a single command.

Rather than imposing an additional cognitive load on the user while constructing the database schema, we’ve decided to deprecate the CREATE REL TABLE GROUP syntax, and instead let the CREATE REL TABLE syntax be used more generally with multiple FROM and TO pairs.

The example below illustrates this. Say you have the following node tables:

CREATE NODE TABLE Comment(id INT64 PRIMARY KEY, content STRING);
CREATE NODE TABLE Post(id INT64 PRIMARY KEY, author STRING);

You can now create a relationship table between multiple node table pairs using a single CREATE REL TABLE command (in prior versions you would have had to use the CREATE REL TABLE GROUP command to achieve this).

// Simultaneously create a relationship table with multiple `FROM` and `TO` node table pairs
CREATE REL TABLE IS_REPLY_OF(FROM Comment TO Comment, FROM Comment TO Post, creation TIMESTAMP);

For copying data into a relationship table that has multiple FROM and TO pairs, we provide a new syntax that allows you to specify the FROM and TO node table names explicitly, as shown below:

COPY IS_REPLY_OF FROM 'comment_isReplyOf_comment.csv' (from='Comment', to='Comment');
COPY IS_REPLY_OF FROM 'comment_isReplyOf_post.csv' (from='Comment', to='Post');

Note that if the relationship table has only one source and destination node table, you do not have to specify the source and target node table names. That is, the COPY FROM commands you used in the previous versions of Kùzu are still valid. You can find more detailed documentation here.

DataFrame usability improvements

Skipping erroneous rows when copying from DataFrame

In our prior release, we had introduced a feature that allows users to skip erroneous or malformed rows when copying data into tables from CSV and Parquet files. We now extend this support to DataFrames. Your COPY FROM commands from Pandas or Polars DataFrames can now be more robust to failure due to incorrectly parsed column data types or other problems. When copying from a Pandas or Polars DataFrame into a Kùzu table, you can specify an ignore_errors=true parameter, allowing you to skip rows that might trigger an exception.

COPY User FROM df (ignore_errors = true)

More details can be found in our docs. The same functionality also works for using LOAD FROM (see here), i.e., when you scan the data from the DataFrame while ignoring errors.

SKIP and LIMIT when scanning from DataFrame

The SKIP and LIMIT parameters are now available when scanning data using LOAD FROM for Pandas and Polars DataFrames.

// Skip the first row
LOAD FROM df (skip = 1) RETURN *
// Load the first 10 rows
LOAD FROM df (limit = 10) RETURN *

For more details on skipping rows, check our docs.

Performance improvements

We are a highly performance-oriented system and we continued optimizing the core query processor of the system in this release as well.

Improved parallel aggregations

We’ve implemented a new parallel hash aggregation mechanism that significantly improves the performance of aggregation-heavy queries. In the new implementation, each thread first locally performs aggregation with a fixed-sized hash table. When the local hash table is full, tuples are partitioned into N groups and flushed to N global queues. Each queue maintains tuples for a single group. After the thread has exhausted its sources to perform local aggregation, it turns to read from a global queue to merge the final result for the group until all groups are aggregated.

We did a benchmark evaluation on a few queries adapted from ClickBench. The timing numbers below are for experiments run on a machine with 2 AMD EPYC 7551 processors (32 cores) and 512GB DDR4 memory and 2TB SSDs.

// Q1
MATCH (h:hits) RETURN h.UserID, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10;
// Q2
MATCH (h:hits) RETURN h.UserID, h.SearchPhrase, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10;
// Q3
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchEngineID, h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10;
// Q4
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10;
QueryRelease1 thread4 threads16 threads64 threads
Q10.8.011.6s5.4s (2.1x)2.6s (4.5x)1.5s (7.7x)
Q10.7.111.3s6.5s (1.7x)5.7s (2.0x)5.9s (1.9x)
Q20.8.018.3s7.5s (2.4x)3.79s (4.8x)2.47s (7.4x)
Q20.7.118s10.3s (1.7x)8.5s (2.1x)8.3s (2.2x)
Q30.8.07.7s3.1s (2.5x)1.55s (5.0x)1.1s (7.0x)
Q30.7.18.2s4.4s (1.9x)3.8s (2.2x)3.8s (2.2x)
Q40.8.07.1s2.8s (2.5x)1.25s (5.7x)0.76s (9.3x)
Q40.7.18.4s4.3s (2.0x)3.67s (2.3x)3.65s (2.3x)

The speedup numbers (e.g., 2.1x, 4.5x, etc.) in each row indicate how much faster the query runs compared to the single-threaded baseline from the same release. The new release (0.8.0) scales much better across multiple threads compared to 0.7.1. For example, Q4 at 64 threads in 0.8.0 runs 9.3 times faster, whereas in 0.7.1, it only achieves 2.3x speedup.

Closing remarks

We extend our thanks to the entire Kùzu team and our incredible interns who have worked very hard to make this release possible. We are thrilled to see Kùzu becoming a more and more feature-rich database, catering to a broader user base and a wide range of use cases. As usual, please share your experience with us – your feedback continues to shape Kùzu’s evolution. From the sometimes unreasonable cold of 🇨🇦, where it was -23°C last week, we hope you enjoy the new release! 🎉 🎉

Till next time!

PS: Oh, and we promise, no tariffs to 🇺🇸; Kùzu is free and open-source!