Kuzu 0.10.0 Release

Welcome back! It’s the start of summer in 🇨🇦, and we’re excited to kick it off with the release of Kuzu 0.10.0! Here are the highlights of this release:

Graph algorithms: The main highlight is our new algo extension, which brings graph algorithms to Kuzu. Graph algorithms, such as PageRank, Louvain, or k-core decomposition algorithms, are useful for extracting meaningful insights from your connected data. Now you can run them natively in Kuzu with a single CALL function. Whether you’re detecting fraud patterns in financial transactions, optimizing supply chain networks, or analyzing social media interactions, running graph algorithms natively in Kuzu can help reveal patterns and make data-driven decisions from your graphs.
Neo4j migration extension: With the neo4j extension, you can now migrate your Neo4j databases over to Kuzu with another single CALL function (and a few parameters).
Android support: Kuzu now supports Android devices with ARMv8-A architecture!
Scanning compressed CSV files: The LOAD FROM and COPY FROM commands now support scanning compressed CSV files.
Free-space management: We introduce a new free space management mechanism that reclaims disk space usage. This can significantly reduce the database size on disk after frequent data modifications.

As always, we’ve continued making Kuzu faster and more scalable, and you should see your recursive queries and JSON scans get faster, and the disk size of your database files get smaller! See some performance numbers below. 👇🏽

Graph algorithms

Kuzu’s graph algorithms extension comes with the following key features:

Parallel: Graph algorithms execute in parallel (except those algorithms that cannot be parallelized). If you’re interested in the implementation details of Kuzu, we internally use a vertex and edge-centric parallelization abstraction, similar to the Ligra paper and uses the system’s own thread pool (instead of, say, OpenMP).
Disk-based: The algorithms run on “projected graphs”, which can be the entire graph in your database or a subgraph of it. Importantly, the algorithms are disk-based, meaning that they can run on projected graphs that are larger than your available memory. So you don’t need to worry about memory limits as you scale your graph workloads.
Native Cypher integration: You run the algorithms via CALL function in Cypher and can combine the outputs with arbitrary Cypher queries.

The initial release includes the following algorithms in our algo extension:

Weakly connected components
Strongly connected components (a parallel BFS-based and a single-threaded one based on Kosaraju’s algorithm)
PageRank
K-Core decomposition
Louvain

In addition, you can compute weighted shortest paths directly in your Cypher path patterns in the MATCH clause using the WSHORTEST keyword, e.g., ()-[:Label* WSHORTEST(property) 1..max]->() See the documentation for weighted shortest paths here. We will be implementing many more algorithms in our future releases, so stay tuned!

Using graph algorithms

All algorithms, except for weighted shortest paths which is provided through the built-in CALL clause, are available via the algo extension. Let’s look at an example usage of graph algorithms using the well-known LDBC SNB dataset. The example below finds the top 10 influencers in the LDBC SNB graph based on the (person)-[:knows]->(person) subgraph and counts how many posts they published.

// Install graph algorthm extension
INSTALL algo;
LOAD algo;

// Create a projected graph with person and knows tables.
CALL project_graph(
    'KnowsGraph', 
    ['person'], 
    ['knows']
);

// Run PageRank on projected graph to find the top 10 influencers.
// Then traverse `postHasCreator` to extract all posts
CALL page_rank('KnowsGraph')
WITH node AS person, rank
ORDER BY rank DESC
LIMIT 10
MATCH (person)<-[:postHasCreator]-(post:Post)
RETURN person, COUNT(*);

We first install and load the algo extension. Then, we create a projected graph from the person and knows tables, on which the algorithm is executed. The projected graph is a subgraph that contains only the nodes and relationships specified. It is evaluated only when the algorithm is executed. Kuzu does not materialize projected graphs in memory, and all data is scanned from disk on-the-fly.

Using graph algorithms on filtered graphs

Graph algorithms can also run on a subset of nodes and relationships using a filtered graph. From the example above, we can find influencers within a certain age group, e.g. 30-45, with a filtered projected graph.

// Create a filtered graph selecting person within the age group
CALL project_graph(
    'FilteredKnowsGraph', 
    { 'person': 'n.birthday >= date("1980-01-01") AND n.birthday <= date("1995-12-31")' },
    ['knows']
);

// Run PageRank on filtered projected graph
CALL page_rank('FilteredKnowsGraph')
WITH node AS person, rank
ORDER BY rank DESC
LIMIT 10
MATCH (person)<-[:postHasCreator]-(post:Post)
RETURN person, COUNT(*);

Using these techniques, we can run graph algorithms efficiently and scalably all within Kuzu! Give Kuzu’s graph algorithms extension a try, and check out the docs here.

Performance benchmarks

Let’s next demonstrate the performance and scalability of our graph algorithms in Kuzu on two benchmark datasets and various thread configurations. The following two datasets are from different domains ranging from millions to billions of edges.

Dataset	# Nodes	# Edges
soc-LiveJournal1	4.8M	68M
datagen-sf10k	100M	9.4B

We report the end-to-end runtimes on a machine with 2xAMD EPYC 9J14 CPUs and 768GB RAM, using 4, 16 and 64 threads for execution.

soc-LiveJournal1

Algorithm	4 threads	16 threads	64 threads
WCC	3.6s	0.9s	0.3s
SCC	9.6s	2.6s	0.9s
SCC-ko	8.3s	8.2s	8.2s
PageRank	19.9s	6.9s	5.1s
K-Core	46.7s	13.3s	9.2s
Louvain	102.2s	42.7s	21.2s
WSP	13.6s	3.8s	1.1s

datagen-sf10k¹

Algorithm	4 threads	16 threads	64 threads
WCC	201.4s	54.3s	19.7s
SCC	9905.6s	2563.7s	849.3s
SCC-ko	273.5s	271.2s	280.0s
PageRank	992.5s	299.4s	146.5s
K-Core	585.6s	151.8s	52.7s
WSP	4.8s	2.5s	0.7s

Observe that except for Kosaraju’s algorithm, which is single threaded, our algorithms are able to scale very well as the number of threads increases. Even on a graph with 9.4 billion edges, we’re able to compute some of the batch algorithms, within seconds!

Neo4j migration extension

A lot of Kuzu users have previously used or are using Neo4j. To make it easier to migrate graph data from Neo4j to Kuzu, we introduce a neo4j extension, which automatically imports nodes and relationships from Neo4j to Kuzu databases.

Install the neo4j extension in Kuzu and then run the following query that specifies the Neo4j database host, username, password, and the node and relationship labels you want to migrate.

CALL neo4j_migrate(
    'url',
    'user_name',
    'password',
    ['node_label_1', 'node_label_2', ...],
    ['rel_label_1', 'rel_label_2', ...]
)

After the function executes, you’ll have your Neo4j database replicated in Kuzu! Note that the Neo4j migration functionality depends on Neo4j’s APOC extension to be installed in your Neo4j instance. You can find more details on setting up the APOC extension in Neo4j here. Read more on Kuzu’s Neo4j extension here.

Android support

If you’re interested in building Android mobile applications using Kuzu, we’re excited to now support Android devices with ARMv8-A architecture! You can use Kuzu’s Java API, which works directly with Android Studio projects. Alternatively, you can also use C/C++ dynamic lib and our shell.

Scanning compressed CSV

Many of our users have been requesting the ability to scan or copy their compressed CSV files directly in Kuzu. We’re happy to add support for scanning or copying .gzip.csv files directly into Kuzu starting from this release.

Consider a user.csv file. Let’s first zip this file to a new file named user.csv.gz as follows:

gzip -k user.csv

The compressed CSV file can now be scanned or copied into a Kuzu table with the same commands you use for copying or scanning uncompressed CSV files:

// scan
LOAD FROM 'user.csv.gz' RETURN *;
// copy
COPY User FROM 'user.csv.gz';

Storing compressed CSV files in Kuzu is useful when importing data from external systems that store data in compressed CSV format, such as object stores. See here for more details on this feature.

Performance Improvements

You’ll notice that Kuzu is now faster than ever, particularly for recursive queries and JSON scans. Additionally, this release includes a new free space management mechanism to help reduce disk space usage, so you can keep your disk utilization in check.

Free Space Management

Reusing disk space usage has been a long-awaited feature in Kuzu. Up until this release, databases became bloated after frequent data modifications, such as updates and deletions. We introduce a new free space management mechanism that is able to reclaim free space in three key scenarios:

When tables are dropped
When tuples within a node group are all deleted
When column chunks are rewritten due to updates or insertions

If you have workloads in which you create and drop tables or insert and delete large numbers of tuples from tables, you can see a significant reduction in disk space usage from this feature!

Faster recursive queries

In v0.9.0, we introduced a parallel framework for recursive queries. This framework used to default to using dense data structures to store the intermediate data for recursive queries. By “dense”, we mean that we allocated space for each node in the graph during the recursive query execution, even if the recursive query traversed only a small fraction of the database. For some queries, this was becoming a bottleneck.

In v0.10.0, we eliminate the bottleneck by starting with a small allocation and dynamically growing the data structure. Expect to see some of your recursive queries getting much faster in this release!

JSON scan

In v0.10,0, we’ve also significantly improved the performance of JSON scanning. Below, we show a micro-benchmark that scans the LDBC-100 Comment table in JSON format.

LOAD FROM 'comment.json' RETURN ID, content;

	1 thread	2 threads	4 threads
v0.9.0	190s	99s	52s
v0.10.0	56s	27s	15s

If you’re inserting data from json sources, there should be a noticeable performance improvement to your workload when you migrate to v0.10.0.

Closing remarks

At Kuzu, we want to make it easier for developers to work with large graphs. With the addition of a fully native graph algorithms extension, we now provide a more complete feature set for your graph analytics needs. All these new features and performance improvements have been a long time in the making, and we’re excited to see users push the boundaries of what’s possible with graph databases. If you’re working with large graphs and want to run graph algorithms such as PageRank, give the latest release a try!

We recently crossed 2000 stars on GitHub, and we’re seeing more activity on our Discord server than ever before. Thank you all for your questions and feedback — we’re always looking for ways to make Kuzu even better. Keep them coming, and see you in the next release!

— The Kuzu Team

Our Louvain implementation currently requires building an in-memory graph, which is memory-intensive, and requires more than the default buffer manager size on datagen-sf10k, so we omit Louvain in this dataset. This will be improved in our future releases. ↩