Kùzu 0.4.0 Release

Kùzu Team
Kùzu Team
Developers of Kùzu Inc.

With the warmer weather in 🇨🇦 approaching, there’s cause for excitement on more than one front - we’ve just released version 0.4.0 of Kùzu! This is a significant release, as it introduces a new storage layer along with a host of additional features, improvements and extensions, detailed in this post. Let’s gear up!

Features

New extensions: Scanning DuckDB and PostgreSQL

The first feature we discuss is great news for users who want to move data from external relational databases to Kùzu, but don’t want to do additional ETL using intermediate files. Postgres and DuckDB database extensions are here!

In v0.2.0, we introduced the idea of the Kùzu extensions framework and our first extension, httpfs. In v0.4.0, we are happy to introduce two brand new extensions to connect Kùzu to the following two external databases: DuckDB and PostgreSQL . For now, these extensions are read-only, allowing you to directly scan data from either database (no write support).

Using them is remarkably simple: You first install the required extension and then load it into your Kùzu session. The following snippet shows how this is done via the CLI:

INSTALL postgres;
LOAD EXTENSION postgres;

To use the duckdb extension, you would change postgres in the above command with duckdb. You first attach a Postgres database and its associated tables (while providing the necessary connection string parameters) to your Kùzu CLI or client session.

ATTACH 'dbname=university user=postgres host=localhost password=yourpassword port=5432' AS pg_db_uni (dbtype 'postgres');

The following Cypher query passes the results from scanning the Postgres person table to the COPY FROM command in Kùzu, so that you can more easily build graphs from data in external relational databases (using a LOAD FROM subquery in a COPY FROM statement is also a new feature of this release; see below).

CREATE NODE TABLE Person(name STRING, age INT64, PRIMARY KEY (name));
COPY Person FROM (LOAD FROM pg_db_uni.person RETURN *);

You can see the attached databases to a given Kùzu session any time with the show_attached_databases() call:

CALL show_attached_databases() RETURN *;
// Output
-----------------------------------
| name            | database type |
-----------------------------------
| pg_db_uni       | POSTGRES      |
-----------------------------------
| duckdb_employee | DUCKDB        |
-----------------------------------

We are planning to develop more extensions like these in the future to fulfill our vision of using Kùzu to seamlessly do graph modeling, querying, and analysis over your raw records wherever they may be residing.

Import/Export database

Because our storage layer is still evolving, migrating between Kùzu versions requires manually exporting your old Kùzu database node and relationship tables to CSV or Parquet files and then using the newer Kùzu version, creating a new database with the same tables and copying these files back. You can now migrate databases between different Kùzu versions without this manual process. We have introduced two new commands, EXPORT DATABASE and IMPORT DATABASE, to streamline this process. The EXPORT DATABASE command allows you to export the contents of the database to a specific directory. The query below exports the database to an absolute path, /path/to/export, utilizing the same configuration parameters as COPY FROM statements.

EXPORT DATABASE '/path/to/export' (FORMAT="csv", HEADER=true);

The data is exported to CSV with headers included, but you can also export to Parquet, if desired. We also generate several files that contains the Cypher commands needed to import the database, including the node and relationship tables and macros, back into Kùzu. You can import the database from /path/to/export to the database your current CLI or client session is connected to with the IMPORT DATABASE command:

IMPORT DATABASE '/path/to/export';

COPY FROM with subquery

COPY FROM is Kùzu’s fastest way to do bulk insertion of records into node and relationship tables. Previously, this feature could only be used to insert data from raw files, such as CSV or Parquet. In v0.4.0, we added support for using subqueries following the COPY FROM statement. This feature allows you to first perform a task like MATCH and then use the results of that query as input to the COPY FROM command.

For example, consider that we have a graph with a User node label and a Follows relationship type. We want to create a new Person node table and a Knows relationship table, where the goal is to state that a Person “knows” another Person if they follow each other. We can use the COPY FROM command with a subquery to achieve this as follows:

// Define node/rel tables
CREATE NODE TABLE Person(name STRING, PRIMARY KEY (name));
CREATE REL TABLE Knows(FROM Person TO Person);
// Run COPY FROM with a subquery
COPY Person FROM (MATCH (a:User) RETURN a.name);
COPY Knows FROM (MATCH (a:User)-[r:Follows]->(b:User) RETURN a.name, b.name);

An alternate use case would be when you want to directly ingest data from an existing object, such as a Pandas DataFrame and use the results as input to the COPY FROM command. This can be combined with predicate filters as follows:

# Assumes that you have a Kùzu connection object named `conn`
# Also assumes that you created a node table named `Person` with columns `name` and `age`
import pandas as pd

df = pd.DataFrame({
    "name": ["Adam", "Karissa", "Zhang", "Noura"],
    "age": [30, 40, 50, 25]
})

conn.execute("COPY Person FROM (LOAD FROM df WHERE age < 30 RETURN *")

Using subqueries with COPY FROM opens up a wider range of possibilities for data manipulation and transformation prior to inserting data into the database.

Bulk insert into a non-empty table

Recall again that COPY FROM is Kùzu’s fastest way to insert records into tables. In prior releases, the COPY FROM command could only be used to bulk insert data into an empty table. This restriction has now been removed. In Kùzu v0.4.0, you can also bulk insert data into a non-empty table, making it both easier and faster to append data to an existing table.

Below, we show an example of how COPY FROM might be used in conjunction with the subquery feature described earlier. We have a single table named Person for two CSV files that have the same structure.

// Create node table
CREATE NODE TABLE Person(name STRING, age INT64, PRIMARY KEY (name));
// Run COPY FROM with a subquery
COPY Person FROM (LOAD FROM "person1.csv" RETURN *);
COPY Person FROM (LOAD FROM "person2.csv" RETURN *);

Note that the usual primary key constraints still apply; i.e., if the file person2.csv contains a record whose primary key already exists in the Person table, it will produce a RuntimeError and the transaction will be rolled back. From a performance perspective, you should expect some slowdown in terms of records inserted/second for the subsequent bulk inserts (because the system needs to more I/O to scan the data that is already stored on disk) but it will still be much faster than inserting records one at a time via CREATE commands. We recommend that you use this approach if you’re inserting large amounts of data into your database.

Scan from Pandas PyArrow backend

Earlier versions of Kùzu provided the ability to scan data from a Pandas DataFrame using the NumPy backend. In v0.4.0, we added support for PyArrow-backed Pandas DataFrames as well. Make sure to run pip install -U pyarrow pandas before trying the example below.

import kuzu
import pandas as pd

db = kuzu.Database("persons")
conn = kuzu.Connection(db)

# Convert the Pandas DataFrame to a PyArrow-backed DataFrame
df = pd.DataFrame({
    "name": ["Adam", "Karissa", "Zhang", "Noura"],
    "age": [30, 40, 50, 25]
}).convert_dtypes(dtype_backend="pyarrow")

# Scan the PyArrow-backed Pandas DataFrame in Kùzu by referencing the DataFrame object
result = conn.execute("LOAD FROM df RETURN *;")
print(result.get_as_df())
      name  age
0     Adam   30
1  Karissa   40
2    Zhang   50
3    Noura   25

How does this work under the hood? Internally, Kùzu uses a similar layout to Apache Arrow’s Array, allowing it to perform a memcpy operation, which is more efficient than a conventional copy. Using memcpy means we directly access the values in the memory blocks of the underlying Arrow objects, avoiding the need to move data from the DataFrame’s location in memory.

As Pandas 2.0 evolves, it is adding more and more support for Arrow-backed DataFrames in Python. Using the PyArrow backend in Pandas offers numerous benefits over the NumPy backend, including better support for strings and nulls, improved performance and better interoperability with other full-fledged Arrow-backed DataFrame libraries (like Polars, cuDF, etc.).

Better integration with Polars

Although this feature came out in a minor release just prior to this one (v0.3.2), it’s worth mentioning here. Kùzu now allows directly outputting the results of a Cypher query as a Polars DataFrame. The query results are converted to an Arrow table obtained via our get_as_arrow() method, and then seamlessly passed to a Polars DataFrame via the Polars from_arrow() method. This feature was made possible thanks to an external contributor via relatively few lines of code, shown below.

import polars as pl

# class QueryResult:
    # ...
    # ...
    def get_as_pl(self) -> pl.DataFrame:
        return pl.from_arrow(data=self.get_as_arrow())

Apache Arrow is becoming the de facto standard for columnar data interchange in the Python ecosystem. In a future release, we also plan to support native scanning of Polars DataFrames in Python, in a similar way to how we now scan PyArrow-backed Pandas DataFrames.

New data type: ARRAY and additional functions

In v0.4.0, we consolidated the naming of our LIST data types for fixed/variable length lists to be more in line with other databases. This introduces some breaking changes to users who were using VAR-LIST (now named LIST), and FIXED-LIST (now named ARRAY). Going forward, we will use the term ARRAY for fixed-length lists, which is intended for use with with machine learning algorithms that require vector embeddings that are of a pre-determined length.

We also introduced similarity search functions that operate on ARRAY types: cosine similarity, dot product, cross product and inner product. This feature is particularly useful for users who want to perform search & retrieval using embeddings stored in Kùzu tables.

Internal ID compression

We now apply compression to the internal IDs in the storage layer. Internally, for each relationship, we store, in each direction, its source and destination node IDs, and a unique relationship ID. All node and relationship IDs are represented as internal IDs, and compressed as integer values now. Applying compression on internal IDs can result in significant reduction in the size of a Kùzu database. For LDBC SF100, we observed a 45% reduction in size for the data.kz file within the Kùzu database directory.

VersionSize of data.kz for LDBC SF100
0.3.0126 GB
0.4.069 GB

Closing Remarks

This post highlighted just a few of the many features and improvements that came along with the 0.4.0 release. It’s recommended to check out our release notes on GitHub for a more comprehensive list.

We are excited to bring these enhancements to the ever-growing Kùzu user community. As always, our many thanks go out to everyone in the Kùzu team, including our interns and our external contributors for their excellent work in making this release possible. We encourage you to try out the latest release on your own workflows and engage with us on Discord!