With the warmer weather in 🇨🇦 approaching, there’s cause for excitement on more than one front - we’ve just released version 0.4.0 of Kùzu! This is a significant release, as it introduces a new storage layer along with a host of additional features, improvements and extensions, detailed in this post. Let’s gear up!
Features
New extensions: Scanning DuckDB and PostgreSQL
The first feature we discuss is great news for users who want to move data from external relational databases to Kùzu, but don’t want to do additional ETL using intermediate files. Postgres and DuckDB database extensions are here!
In v0.2.0, we introduced the idea of the Kùzu extensions framework and our first extension, httpfs
. In
v0.4.0, we are happy to introduce two brand new extensions to connect Kùzu to the following two external databases: DuckDB and PostgreSQL .
For now, these extensions are read-only, allowing you to directly scan data from either database (no write support).
Using them is remarkably simple: You first install the required extension and then load it into your Kùzu session. The following snippet shows how this is done via the CLI:
INSTALL postgres;
LOAD EXTENSION postgres;
To use the duckdb
extension, you would change postgres
in the above command with duckdb
. You first
attach a Postgres database and its associated tables (while providing the necessary connection string parameters)
to your Kùzu CLI or client session.
ATTACH 'dbname=university user=postgres host=localhost password=yourpassword port=5432' AS pg_db_uni (dbtype 'postgres');
The following Cypher query passes the results from scanning the Postgres person
table to the COPY FROM
command in Kùzu,
so that you can more easily build graphs from data in external relational databases (using a LOAD FROM
subquery in a
COPY FROM
statement is also a new feature of this release; see below).
CREATE NODE TABLE Person(name STRING, age INT64, PRIMARY KEY (name));
COPY Person FROM (LOAD FROM pg_db_uni.person RETURN *);
You can see the attached databases to a given Kùzu session any time with the show_attached_databases()
call:
CALL show_attached_databases() RETURN *;
// Output
-----------------------------------
| name | database type |
-----------------------------------
| pg_db_uni | POSTGRES |
-----------------------------------
| duckdb_employee | DUCKDB |
-----------------------------------
We are planning to develop more extensions like these in the future to fulfill our vision of using Kùzu to seamlessly do graph modeling, querying, and analysis over your raw records wherever they may be residing.
Import/Export database
Because our storage layer is still evolving, migrating between Kùzu versions requires manually exporting
your old Kùzu database node and relationship tables to CSV or Parquet files and
then using the newer Kùzu version, creating a new database with the same tables and copying these files back.
You can now migrate databases between different Kùzu versions without this manual process.
We have introduced two new commands, EXPORT DATABASE
and IMPORT DATABASE
, to streamline this process.
The EXPORT DATABASE
command allows you to export the contents of
the database to a specific directory. The query below exports the database to an absolute path,
/path/to/export
, utilizing the same configuration parameters as COPY FROM
statements.
EXPORT DATABASE '/path/to/export' (FORMAT="csv", HEADER=true);
The data is exported to CSV with headers included, but you can also export to
Parquet, if desired. We also generate several files that contains the Cypher commands needed to
import the database, including the node and relationship tables and macros, back into Kùzu.
You can import the database from /path/to/export
to the database your current CLI or client session
is connected to with the IMPORT DATABASE
command:
IMPORT DATABASE '/path/to/export';
COPY FROM
with subquery
COPY FROM
is Kùzu’s fastest way to do bulk insertion of records into node and relationship tables.
Previously, this feature could only be used to insert data from raw files, such as CSV or Parquet.
In v0.4.0, we added support for using subqueries following the COPY FROM
statement. This feature allows you to
first perform a task like MATCH
and then use the results of that query as input to the COPY FROM
command.
For example, consider that we have a graph with a User
node label and a Follows
relationship type.
We want to create a new Person
node table and a Knows
relationship table, where the goal is to state that
a Person
“knows” another Person
if they follow each other. We can use the COPY FROM
command with a subquery
to achieve this as follows:
// Define node/rel tables
CREATE NODE TABLE Person(name STRING, PRIMARY KEY (name));
CREATE REL TABLE Knows(FROM Person TO Person);
// Run COPY FROM with a subquery
COPY Person FROM (MATCH (a:User) RETURN a.name);
COPY Knows FROM (MATCH (a:User)-[r:Follows]->(b:User) RETURN a.name, b.name);
An alternate use case would be when you want to directly ingest data from an existing object, such as
a Pandas DataFrame and use the results as input to the COPY FROM
command. This can be combined with
predicate filters as follows:
# Assumes that you have a Kùzu connection object named `conn`
# Also assumes that you created a node table named `Person` with columns `name` and `age`
import pandas as pd
df = pd.DataFrame({
"name": ["Adam", "Karissa", "Zhang", "Noura"],
"age": [30, 40, 50, 25]
})
conn.execute("COPY Person FROM (LOAD FROM df WHERE age < 30 RETURN *")
Using subqueries with COPY FROM
opens up a wider range of possibilities for data manipulation and
transformation prior to inserting data into the database.
Bulk insert into a non-empty table
Recall again that COPY FROM
is Kùzu’s fastest way to insert records into tables.
In prior releases, the COPY FROM
command could only be used to bulk insert data into an empty table.
This restriction has now been removed. In Kùzu v0.4.0, you can also bulk insert data into a non-empty table,
making it both easier and faster to append data to an existing table.
Below, we show an example of how COPY FROM
might be used in conjunction with the subquery feature described
earlier. We have a single table named Person
for two CSV files that have the same structure.
// Create node table
CREATE NODE TABLE Person(name STRING, age INT64, PRIMARY KEY (name));
// Run COPY FROM with a subquery
COPY Person FROM (LOAD FROM "person1.csv" RETURN *);
COPY Person FROM (LOAD FROM "person2.csv" RETURN *);
Note that the usual primary key constraints still apply; i.e., if the file person2.csv
contains a record
whose primary key already exists in the Person
table, it will produce a RuntimeError
and the
transaction will be rolled back. From a performance perspective, you should expect some slowdown
in terms of records inserted/second for the subsequent bulk inserts (because the system needs to more I/O to
scan the data that is already stored on disk) but it will still be much faster than
inserting records one at a time via CREATE
commands. We recommend that
you use this approach if you’re inserting large amounts of data into your database.
Scan from Pandas PyArrow backend
Earlier versions of Kùzu provided the ability to scan data from a Pandas DataFrame using the NumPy backend.
In v0.4.0, we added support for PyArrow-backed Pandas DataFrames as well.
Make sure to run pip install -U pyarrow pandas
before trying the example below.
import kuzu
import pandas as pd
db = kuzu.Database("persons")
conn = kuzu.Connection(db)
# Convert the Pandas DataFrame to a PyArrow-backed DataFrame
df = pd.DataFrame({
"name": ["Adam", "Karissa", "Zhang", "Noura"],
"age": [30, 40, 50, 25]
}).convert_dtypes(dtype_backend="pyarrow")
# Scan the PyArrow-backed Pandas DataFrame in Kùzu by referencing the DataFrame object
result = conn.execute("LOAD FROM df RETURN *;")
print(result.get_as_df())
name age
0 Adam 30
1 Karissa 40
2 Zhang 50
3 Noura 25
How does this work under the hood? Internally, Kùzu uses a similar layout to Apache Arrow’s Array
,
allowing it to perform a memcpy operation,
which is more efficient than a conventional copy.
Using memcpy
means we directly access the values in the memory blocks of the underlying Arrow objects, avoiding the
need to move data from the DataFrame’s location in memory.
As Pandas 2.0 evolves, it is adding more and more support for Arrow-backed DataFrames in Python. Using the PyArrow backend in Pandas offers numerous benefits over the NumPy backend, including better support for strings and nulls, improved performance and better interoperability with other full-fledged Arrow-backed DataFrame libraries (like Polars, cuDF, etc.).
Better integration with Polars
Although this feature came out in a minor release just prior to this one (v0.3.2), it’s worth mentioning here.
Kùzu now allows directly outputting the results of a Cypher query as a Polars DataFrame. The query
results are converted to an Arrow table obtained via our get_as_arrow()
method, and then seamlessly passed to a Polars DataFrame via the Polars from_arrow()
method. This feature
was made possible thanks to an external contributor
via relatively few lines of code, shown below.
import polars as pl
# class QueryResult:
# ...
# ...
def get_as_pl(self) -> pl.DataFrame:
return pl.from_arrow(data=self.get_as_arrow())
Apache Arrow is becoming the de facto standard for columnar data interchange in the Python ecosystem. In a future release, we also plan to support native scanning of Polars DataFrames in Python, in a similar way to how we now scan PyArrow-backed Pandas DataFrames.
New data type: ARRAY
and additional functions
In v0.4.0, we consolidated the naming of our LIST
data types for fixed/variable length lists to be more in line with other databases.
This introduces some breaking changes to users who were using VAR-LIST
(now named LIST
), and FIXED-LIST
(now named ARRAY
).
Going forward, we will use the term ARRAY
for fixed-length lists, which is intended for use with with machine learning algorithms
that require vector embeddings that are of a pre-determined length.
We also introduced similarity search functions that operate
on ARRAY
types: cosine similarity, dot product, cross product and inner product. This feature is particularly
useful for users who want to perform search & retrieval using embeddings stored in Kùzu tables.
Internal ID compression
We now apply compression to the internal IDs in the storage layer. Internally, for each relationship, we store,
in each direction, its source and destination node IDs, and a unique relationship ID. All node and relationship
IDs are represented as internal IDs, and compressed as integer values now.
Applying compression on internal IDs can result in significant reduction in the size of a Kùzu database. For
LDBC SF100, we observed a 45% reduction in size for
the data.kz
file within the Kùzu database directory.
Version | Size of data.kz for LDBC SF100 |
---|---|
0.3.0 | 126 GB |
0.4.0 | 69 GB |
Closing Remarks
This post highlighted just a few of the many features and improvements that came along with the 0.4.0 release. It’s recommended to check out our release notes on GitHub for a more comprehensive list.
We are excited to bring these enhancements to the ever-growing Kùzu user community. As always, our many thanks go out to everyone in the Kùzu team, including our interns and our external contributors for their excellent work in making this release possible. We encourage you to try out the latest release on your own workflows and engage with us on Discord!