Kùzu 0.0.2 Release

Kùzu Team
Kùzu Team
Developers of Kùzu Inc.

This post is about the second release of Kùzu. However, we want to start with something much more important:

Our hearts, thoughts, and prayers go to all the victims, those who survived and those who passed, in Syria and Türkiye. There will be a very difficult winter for all those who survived so everyone needs to help. Here are two pointers for trustworthy organizations we know of that are trying to help victims on the ground. For Türkiye (where Semih is from), you can donate to Ahbap (Please be aware that the donation currency is in TL and 14 TL = 1 CAD; 19TL = 1 USD); and for Syria you can donate to the White Helmets. Please be generous! We’ll leave pointers to several other organizations below in this footnote1.

Overview of Kùzu 0.0.2

Back to our release. Kùzu codebase is changing fast but this release still has a focus: we have worked quite hard since the last release to integrate Kùzu to import data from different formats and export data to different formats. There are also several important features in the new Cypher clauses and queries we support, additional string processing capabilities, and new DDL statement support. We will give a summary of each of these below.

For installing the new version, please visit the installation guide and the full release notes are here. If you are eager to play with a few Colab notebooks, here are several links:

Exporting Query Results to Pytorch Geometric and NetworkX

Perhaps most excitingly, we have added the first capabilities to integrate with 2 popular graph data science libraries: (i) Pytorch Geometric (PyG) for performing graph machine learning; and (ii) NetworkX for a variety of graph analytics, including visualization.

Pytorch Geometric: QueryResult.get_as_torch_geometric() function

Our Python API now has a new QueryResult.get_as_torch_geometric() function that converts results of queries to PyG’s in-memory graph representation torch_geometric.data. If your query results contains nodes and relationship objects, then the function uses those nodes and relationships to construct either torch_geometric.data.Data or torch_geometric.data.HeteroData objects. The function also auto-converts any numeric or boolean property on the nodes into tensors on the nodes that can be used as features in the Data/HeteroData objects. Any property that cannot be auto-converted and the edge properties are also returned in case you need want to manually put them into the Data/HeteroData objects.

Colab Demonstrations: Here are 2 Colab notebooks that you can play around with to see how you can develop graph learning pipelines using Kùzu as your GDBMSs:

  1. Node property prediction
  2. Link prediction

The examples demonstrate how to extract a subgraph, train graph convolutional or neural networks (GCNs or GNNs), make some node property or link predictions and save them back in Kùzu so you can query these predictions.

NetworkX: QueryResult.get_as_networkx() function

Our Python API now has a new QueryResult.get_as_networkx() function that can convert query results that contain nodes and relationships into NetworkX directed or undirected graphs. Using this function, you can build pipelines that benefits from Kùzu’s DBMS functionalities (e.g., querying, data extraction and transformations, using a high-level query language with very fast performance), and NetworkX’s rich library of graph analytics algorithms.

Colab Demonstration: Here is a Colab notebook that you can play around with that shows how to do basic graph visualization of query results and build a pipeline that computes PageRanks of a subgraph and store those PageRank values back as new node properties in Kùzu and query them.

Data Import from and Export to Parquet and Arrow

We have removed our own CSV reader and instead now use Arrow as our default library when bulk importing data through COPY FROM statements. Using Arrow, we can not only bulk import from CSV files but also from arrow IPC and parquet files. We detect the file type from the suffix of the file; so if the query says COPY user FROM ./user.parquet, we infer that this is a parquet file and parse it so. See the details here.

Multi-labeled or Unlabeled Queries

A very useful feature of the query languages of GDBMSs is their ability to elegantly express unions of join queries. We had written about this feature of GDBMSs in this blog post about What Every Competent GDBMS Should Do (see the last paragraph of Section Feature 4: Schema Querying). In Cypher, a good example of this is to not bind the node and relationship variables to a specific node/relationship labels/tables. Consider this query:

MATCH (a:User)-[e]->(b)
WHERE a.name = 'Karissa'
RETURN a, e, b

This query asks for all types of relationships that Karissa can have to any possible other node (not necessarily of label User) in the query. So if the database contains Likes relationships from Users to Comments, Follows relationships from Users to Users, and LivesIn relationships from Users and Cities, variables e and b can bind to records from all of these relationship and node labels, respectively.

You can also restrict the labels of nodes/rels to a fixed set that contains more than one label using the | syntax. For example you can do:

MATCH (a:User)-[e:Likes|Follows]->(b)
WHERE a.name = 'Karissa'
RETURN a, e, b

This forces e to match to only Likes relationship or Follows relationship records (so excludes the LivesIn records we mentioned above). The | is a syntax adapted from regexes originally and is also used in query languages that support regular path queries.

Kùzu now supports such queries. Our query execution is based on performing scans of each possible node/rel table and index and when a variable x can bind to multiple node/rel tables, L1, L2, ..., Lk, we reserve one vector for each possible property of each node/rel table.
If anyone has any optimizations to do something smarter, it would be very interesting to hear!

Other Important Changes

Enhanced String Features

We’ve added two important features to enhance Kùzu’s ability to store and process strings:

  1. Support of UTF-8 characters. With the help of utf8proc, you can now store string node/relationship properties in Kùzu that has UTF-8 characters;
  2. Support of regex pattern matching with strings. Kùzu now supports Cypher’s =~ operator for regex searches, which will return true if its pattern mathces the entire input string. For example: RETURN 'abc' =~ '.*(b|d).*';.

CASE Expression

We’ve added CASE for conditional expressions. Two forms (Simple Form and General Form) of CASE expression are supported.

ALTER/DROP/SET/DELETE

We added ALTER TABLE and DROP TABLE DDL statements. After creating a new node or relationship table, you can now drop it, rename it, and alter it by adding new columns/properties, renaming or dropping existing columns/properties.

Besides schema level changes, you can change properties of existing nodes/rels with SET statements, and remove existing nodes/rels with DELETE statements.

Disable Relationships with Multiple Source or Destination Labels

We now no longer support defining a relationship between multiple source or destination labels. This is to simplify our storage. But please let us know if you have strong use cases on this.

Enjoy our new release and don’t forget to donate to the earthquake victims.


Footnotes

  1. For Türkiye two other organizations are AFAD, which is the public institute for coordinating natural disaster response and Akut, a volunteer-based and highly organized search and rescue group. For Syria, another campaign I can recommend is Molham Team, which is an organization founded by Syrian refugee students.