We’re just a couple months into 2025, and we are happy to announce a significant new minor release: v0.8.2. Along with several bugfixes, this release is feature-packed, warranting its own blog post. One of the highlights is the introduction of an experimental unity_catalog
extension, which allows you to scan/copy from Delta Lake tables managed by Unity Catalog.
We’ve also improved our existing extensions. For those of you on Google Cloud, we have some exciting news! We now support scanning/copying from/writing to files hosted on a Google Cloud Storage (GCS) filesystem. This update leverages our existing httpfs
extension. We also added a usability enhancement to our CLI, which now explicitly excludes confidential information such as S3 access keys
from being stored in the command history file. This helps prevent accidental leakage of sensitive data into your command line history and ensures your credentials remain secure.
Our full-text search extension now supports customizing the stopwords table used in full-text search, which can be helpful in your custom domains where specific words not in the default list need to be excluded from the index.
From a performance perspective, we’ve significantly improved our execution of distinct aggregation queries via a new parallel distinct hash aggregation mechanism. In the following sections, we dive into details.
New features
Unity Catalog extension
Unity Catalog, developed by Databricks, is an open data governance solution that allows you to manage your data catalogs, data access, and data sharing. It provides a unified interface for managing data across different data storage systems, including databases, data lakes, and data warehouses.
For Kuzu users who are entrenched in the data lake ecosystem, we’ve introduced an experimental unity_catalog
extension to enable scanning/copying from Delta Lake tables managed by Unity Catalog.
A quickstart guide on how to setup a local unity catalog server can be found here.
To attach a unity catalog, you have to first install and load the unity_catalog extension:
INSTALL unity_catalog;
LOAD EXTENSION unity_catalog;
Similar to attaching to a relational database, unity catalog can be attached using the ATTACH
statement. The example below attaches to the catalog named unity
:
ATTACH 'unity' AS unity (dbtype uc_catalog)
Once the unity
catalog is attached, you can scan from delta tables hosted by the unity
catalog under the default
schema as follows:
// Scan the numbers table
LOAD FROM unity.default.numbers
RETURN *
One important use case of the Unity Catalog extension is to facilitate seamless data transfer from tables in Unity Catalog to Kuzu.
// Migrate data from the unity.default.numbers table to the `numbers` table in Kuzu
CREATE NODE TABLE numbers (id INT32, score DOUBLE , PRIMARY KEY(id));
COPY numbers FROM unity.numbers;
Check out the docs for more information on how to use Kuzu with Unity Catalog.
Please note that the unity_catalog
extension is experimental and currently only supports scanning from Delta Lake tables.
There are numerous upstream issues that are actively being resolved, so if you are using Kuzu with Unity Catalog, do reach
to us on Discord and let us know how it’s going!
Google Cloud Storage (GCS) support
We’re happy to announce that our httpfs
extension now supports working with files on Google Cloud Storage. Any operation that we earlier supported with Amazon S3 will now also work with GCS. This includes:
- Scanning (
COPY/LOAD FROM
) from files hosted on GCS - Writing to (
COPY TO
) files hosted on GCS - Attaching to a read-only kuzu database hosted on GCS
To work with GCS, you must configure kuzu to use your GCS HMAC keys
, per the official docs:
CALL gcs_access_key_id='${GCS_ACCESS_KEY_ID}';
CALL gcs_secret_access_key='${GCS_SECRET_ACCESS_KEY}'
After configuring your credentials, you can access files on GCS by using urls in the format gcs://${BUCKET_NAME}/${PATH_TO_FILE_IN_BUCKET}
:
// scan from GCS
LOAD FROM 'gs://kuzudb-test/user.csv' RETURN *;
// upload to GCS
COPY (MATCH (p:Person) RETURN p) TO 'gcs://kuzudb-test/person.csv' (header=true);
// attach remote DB in GCS
ATTACH 'gcs://kuzudb-test/tinysnb/' AS tinysnb (dbtype kuzu);
You can check out the documentation for our GCS feature here.
Stopwords customization in Full-text Search (FTS)
Stopwords in full-text search (FTS) are commonly occurring words that are excluded from indexing and query processing to improve efficiency and relevance. Common words like “the”, “and”, “is”, “in”, etc. (which can be language-specific), are deemed non-essential for search relevance and are ignored during the index creation step.
By default, Kuzu uses a pre-defined list of common english stopwords. However it’s perfectly plausible that you may need to customize the stopwords list to be more useful for your domain and use cases. In v0.8.2, a custom stopword list can be given as an optional parameter when an FTS index is built as follows:
CALL CREATE_FTS_INDEX(
'Book', // Table name
'book_index', // Index name
['abstract', 'title'], // Properties to build FTS index on
stemmer := 'porter', // Stemmer to use (optional)
stopwords := 'https://stopwords/porter.csv' // Configure customized stopwords list (optional)
);
The stopwords
parameter can either be:
- A file path to a single string column in a CSV or Parquet file, with each stopword on a separate line.
- A node table with a single string column, with each stopword on a separate row.
This should make it possible for you to tailor the stopword list prior to index creation for your unique use cases.
Performance improvements
Continuing the aggregation parallelization optimizations started in v0.8.0, we have also now parallelized aggregation when using the DISTINCT
keyword using an extension of the same method used for the parallel hash aggregation (non-grouped aggregates, e.g. RETURN COUNT(DISTINCT ...);
, are still done serially and will be parallelized in a future release).
We ran benchmarks on a set of aggregation queries adapted from ClickBench, shown below.
# Q1
MATCH (h:hits) RETURN h.RegionID, COUNT(DISTINCT h.UserID) AS u ORDER BY u DESC LIMIT 10;
# Q2
MATCH (h:hits) RETURN h.RegionID, SUM(h.AdvEngineID), COUNT(*) AS c, AVG(h.ResolutionWidth), COUNT(DISTINCT h.UserID) ORDER BY c DESC LIMIT 10;
# Q3
MATCH (h:hits) WHERE h.MobilePhoneModel <> '' RETURN h.MobilePhoneModel, COUNT(DISTINCT h.UserID) AS u ORDER BY u DESC LIMIT 10;
# Q4
MATCH (h:hits) WHERE h.MobilePhoneModel <> '' RETURN h.MobilePhone, h.MobilePhoneModel, COUNT(DISTINCT h.UserID) AS u ORDER BY u DESC LIMIT 10;
# Q5
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchPhrase, COUNT(DISTINCT h.UserID) AS u ORDER BY u DESC LIMIT 10;
# Q6
MATCH (h:hits) WHERE contains(h.Title, 'Google') AND NOT contains(h.URL, '.google.') AND h.SearchPhrase <> '' RETURN h.SearchPhrase, MIN(h.URL), MIN(h.Title), COUNT(*) AS c, COUNT(DISTINCT h.UserID) ORDER BY c DESC LIMIT 10;
Prior to v0.8.2 all of these queries were run almost entirely single-threaded, so only the single-threaded and 64-thread benchmarks have been included. To showcase the significant improvements through this version and the prior one (v0.8.0 ), the results for v0.7.1 are also included. The timing numbers shown are for experiments run on a machine with 2 AMD EPYC 7551 processors (32 cores) and 512GB DDR4 memory and 2TB SSDs.
Query | Release | 1 thread | 4 threads | 16 threads | 64 threads |
---|---|---|---|---|---|
Q1 | v0.8.2 | 39.1s | 14.6s (2.67x) | 5.3s (7.38x) | 2.84s (13.8x) |
Q1 | v0.8.0 | 32s | — | — | 35s |
Q1 | v0.7.1 | 198s | — | — | 193s |
Q2 | v0.8.2 | 48.7s | 16.6s (2.93x) | 6.1s (7.98x) | 3.20s (15.2x) |
Q2 | v0.8.0 | 43s | — | — | 44s |
Q2 | v0.7.1 | 208s | — | — | 205s |
Q3 | v0.8.2 | 6.7s | 2.3s (2.91x) | 1.0s (6.70x) | 0.79s (8.48x) |
Q3 | v0.8.0 | 6.2s | — | — | 6.5 |
Q3 | v0.7.1 | 16s | — | — | 15.1s |
Q4 | v0.8.2 | 7.4s | 2.4s (3.08x) | 1.1s (6.73x) | 0.81s (9.14x) |
Q4 | v0.8.0 | 6.9s | — | — | 6.8s |
Q4 | v0.7.1 | 28.6s | — | — | 28.6s |
Q5 | v0.8.2 | 24.1s | 7.1s (3.39x) | 2.4s (10.0x) | 1.44s (16.7x) |
Q5 | v0.8.0 | 20s | — | — | 19s |
Q5 | v0.7.1 | 54.7s | — | — | 52.4s |
Q6 | v0.8.2 | 33.3s | 9.3s (3.58x) | 3.1s (10.7x) | 2.14s (15.6x) |
Q6 | v0.8.0 | 33s | — | — | 32s |
Q6 | v0.7.1 | 41.3s | — | — | 33.2s |
As can be seen, there is a manyfold improvement in the performance of these queries in v0.8.2, with orders of magnitude improvements from v0.7.1.
Closing remarks
With every release of Kuzu, we aim to not only fix bugs and address user issues — we also strive to continually improve our query processor’s performance and our overall usability and developer experience. The GCS support feature was only recently requested by a user and we’re excited to see it shipped in rapid time thanks to the hard work of our engineers and interns. If you are looking to use Kuzu in a production setting or have a real use case for which a feature is missing, don’t be shy, please do reach out on GitHub and raise an issue!
We always love hearing from our users, so please share your feedback on the latest features and spread the word about Kuzu. See you in the next release! 🚀