This post is about the second release of Kùzu. However, we want to start with something much more important:
Donate to the Victims of Türkiye-Syria Earthquake:
Our hearts, thoughts, and prayers go to all the victims, those who survived and those who passed, in Syria and Türkiye. There will be a very difficult winter for all those who survived so everyone needs to help. Here are two pointers for trustworthy organizations we know of that are trying to help victims on the ground. For Türkiye (where Semih is from), you can donate to Ahbap (Please be aware that the donation currency is in TL and 14 TL = 1 CAD; 19TL = 1 USD); and for Syria you can donate to the White Helmets. Please be generous! We’ll leave pointers to several other organizations below in this footnote1.
Overview of Kùzu 0.0.2
Back to our release. Kùzu codebase is changing fast but this release still has a focus: we have worked quite hard since the last release to integrate Kùzu to import data from different formats and export data to different formats. There are also several important features in the new Cypher clauses and queries we support, additional string processing capabilities, and new DDL statement support. We will give a summary of each of these below.
For installing the new version, please visit the installation guide and the full release notes are here. If you are eager to play with a few Colab notebooks, here are several links:
- General Kùzu Demo
- Export Query Results to Pytorch Geometric: Node Property Prediction Example
- Export Query Results to Pytorch Geometric: Link Prediction Example
- Export Query Results to NetworkX
Exporting Query Results to Pytorch Geometric and NetworkX
Perhaps most excitingly, we have added the first capabilities to integrate with 2 popular graph data science libraries: (i) Pytorch Geometric (PyG) for performing graph machine learning; and (ii) NetworkX for a variety of graph analytics, including visualization.
Pytorch Geometric: QueryResult.get_as_torch_geometric()
function
Our Python API now has a
new QueryResult.get_as_torch_geometric()
function that
converts results of queries to PyG’s in-memory graph representation
torch_geometric.data
.
If your query results contains nodes and relationship objects, then the function uses
those nodes and relationships to construct either torch_geometric.data.Data
or
torch_geometric.data.HeteroData
objects. The function also auto-converts any numeric or boolean property
on the nodes into tensors on the nodes that can be used as features in the Data/HeteroData
objects.
Any property that cannot be auto-converted and the edge properties are also returned in case you need
want to manually put them into the Data/HeteroData
objects.
Colab Demonstrations: Here are 2 Colab notebooks that you can play around with to see how you can develop graph learning pipelines using Kùzu as your GDBMSs:
The examples demonstrate how to extract a subgraph, train graph convolutional or neural networks (GCNs or GNNs), make some node property or link predictions and save them back in Kùzu so you can query these predictions.
NetworkX: QueryResult.get_as_networkx()
function
Our Python API now has a
new QueryResult.get_as_networkx()
function that can convert query results
that contain nodes and relationships into NetworkX directed or undirected graphs. Using this function, you can build pipelines
that benefits from Kùzu’s DBMS functionalities (e.g., querying, data extraction and transformations,
using a high-level query language with very fast performance), and NetworkX’s rich library of
graph analytics algorithms.
Colab Demonstration: Here is a Colab notebook that you can play around with that shows how to do basic graph visualization of query results and build a pipeline that computes PageRanks of a subgraph and store those PageRank values back as new node properties in Kùzu and query them.
Data Import from and Export to Parquet and Arrow
We have removed our own CSV reader and instead now use Arrow
as our default library when bulk importing data through COPY FROM
statements.
Using Arrow, we can not only bulk import
from CSV files but also from arrow IPC and parquet files. We detect the file type
from the suffix of the file; so if the query says COPY user FROM ./user.parquet
,
we infer that this is a parquet file and parse it so. See the details here.
Multi-labeled or Unlabeled Queries
A very useful feature of the query languages of GDBMSs is their
ability to elegantly express unions of join queries.
We had written about this feature of GDBMSs in this blog post about
What Every Competent GDBMS Should Do
(see the last paragraph of Section Feature 4: Schema Querying
).
In Cypher, a good example
of this is to not bind the node and relationship variables to a specific node/relationship
labels/tables. Consider this query:
MATCH (a:User)-[e]->(b)
WHERE a.name = 'Karissa'
RETURN a, e, b
This query asks for all types of relationships that Karissa can have to any possible other
node (not necessarily of label User
) in the query. So if the database contains
Likes
relationships from Users
to Comments
, Follows
relationships
from Users
to Users
, and LivesIn
relationships from Users
and Cities
,
variables e and b can bind to records from all of these
relationship and node labels, respectively.
You can also restrict the labels of nodes/rels to a fixed set that contains
more than one label using the |
syntax.
For example you can do:
MATCH (a:User)-[e:Likes|Follows]->(b)
WHERE a.name = 'Karissa'
RETURN a, e, b
This forces e to match to only Likes relationship or Follows relationship records (so
excludes the LivesIn
records we mentioned above). The |
is a syntax adapted from
regexes originally and is also used in query languages that support regular path queries
.
Kùzu now supports such queries. Our query execution
is based on performing scans of each possible node/rel table and index
and when a variable x
can bind to multiple node/rel tables, L1, L2, ..., Lk
,
we reserve one vector for each possible property of each node/rel table.
If anyone has any optimizations to do something smarter, it would be very interesting
to hear!
Other Important Changes
Enhanced String Features
We’ve added two important features to enhance Kùzu’s ability to store and process strings:
- Support of UTF-8 characters. With the help of utf8proc, you can now store string node/relationship properties in Kùzu that has UTF-8 characters;
- Support of regex pattern matching with strings. Kùzu now supports Cypher’s
=~
operator for regex searches, which will return true if its pattern mathces the entire input string. For example:RETURN 'abc' =~ '.*(b|d).*';
.
CASE Expression
We’ve added CASE for conditional expressions. Two forms (Simple Form and General Form) of CASE expression are supported.
ALTER/DROP/SET/DELETE
We added ALTER TABLE and DROP TABLE DDL statements. After creating a new node or relationship table, you can now drop it, rename it, and alter it by adding new columns/properties, renaming or dropping existing columns/properties.
Besides schema level changes, you can change properties of existing nodes/rels with SET statements, and remove existing nodes/rels with DELETE statements.
Disable Relationships with Multiple Source or Destination Labels
We now no longer support defining a relationship between multiple source or destination labels. This is to simplify our storage. But please let us know if you have strong use cases on this.
Enjoy our new release and don’t forget to donate to the earthquake victims.
Footnotes
-
For Türkiye two other organizations are AFAD, which is the public institute for coordinating natural disaster response and Akut, a volunteer-based and highly organized search and rescue group. For Syria, another campaign I can recommend is Molham Team, which is an organization founded by Syrian refugee students. ↩