Why Indexes Are Needed in a Graph Database

Indexes are an indispensable function in a database system. Graph databases are no exception.

An index is actually a sorted data structure in the database management system. Different database systems adopt different sorting structures.

Popular index types include:

  • B-Tree index
  • B±Tree index
  • B*-Tree index
  • Hash index
  • Bitmap index
  • Inverted index

Each of them uses their own sorting algorithms.

A database index allows efficient data retrieval from databases. Despite of the query performance improvement, there are some disadvantages of indexes:

  • It takes time to create and maintain indexes, which scales with dataset size.
  • Indexes need extra physical storage space.
  • It takes more time to insert, delete, and update data because the index also needs to be maintained synchronously.

Taking the above into consideration, Nebula Graph now supports indexes for more efficient retrieves on properties.

This post gives a detailed introduction to the design and practice of indexes in Nebula Graph.

Core Concepts to Understand Indexes in Nebula Graph

Below is a list of common Nebula Graph index terms we use across the post.

  • Tag: A label associated with a list of properties. Each vertex can associate with multiple tags. Tag is identified with a TagID. You can regard tag as a node table in SQL.
  • Edge: Similar to tag, edge type is a cluster of properties on edges. You can regard edge type as an edge table in SQL.
  • Property: The name-value pairs on tag or edge. Its data type is determined by the tag or edge type.
  • Partition: The minimum logical storage unit of Nebula Graph. A StorageEngine can contain multiple partitions. Partition is divided into leader and follower. We use Raft to guarantee data consistency between leader and follower.
  • Graph space: A physically isolated space for a specific graph. Tags and edge types in one graph are independent with those in another graph. A Nebula Graph cluster can have multiple graph spaces.
  • Index: Index in this post refers specifically to the index of ~~ ~~tag or edge type properties. Its data type depends on tag or edge type.
  • TagIndex: An index created for a tag. You can create multiple indexes for the same tag. Cross-tag composite index is yet to be supported.
  • EdgeIndex: An index created for an edge type. Similarly, you can create multiple indexes for the same edge type. Cross-edge-type composite index is yet to be supported.
  • Scan Policy: The policy to scan indexes. Usually, there are multiple methods to scan indexes to execute one query statement, but the scan policy itself gets to decide which method to use ultimately.
  • Optimizer: Optimize query conditions, such as sorting, splitting, and merging sub-expression nodes of the expression tree of the where clause. It’s used to obtain higher query efficiency.

What’s Required for Indexes to Work in a Graph Database

There are two typical ways to query data in Nebula Graph, or more generally in a graph database:

  1. One is starting from a vertex, retrieving its (N-hop) neighbors along certain edge types.
  2. Another is retrieving vertices or edges which contain specified property values.

In the latter scenario, a high-performance scan is needed to fetch the edges or vertices as well as the property values.

In order to improve the query efficiency of property values, we’ve implemented indexes in Nebula Graph. By sorting the property values of edges or vertices, users can quickly locate a certain property and avoid full scan.

Here’s what we found are required for indexes to work in a graph database:

  • Supporting indexes for properties on tags and edge types.
  • Supporting analysis and generation of index scanning strategy.
  • Supporting index management such as create index, rebuild index, show index, etc.

How Indexes Are Stored in Nebula Graph

Below is a diagram of how indexes are stored in Nebula Graph. Indexes are a part of Nebula Graph’s Storage Service so we place them in the big picture of its storage architecture.

Seen from the above figure, each Storage Server can contain multiple Storage Engines, each Storage Engine can contain multiple Partitions.

Different Partitions are synchronized via Raft protocol. Each Partition contains both data and indexes. The data and indexes of the same vertex or edge will be stored in the same Partition.

#tutorial #graph database #index #database indexes #nebula graph #database

How Indexes Work in Nebula Graph - DZone Database
1.40 GEEK