Edison  Stark

Edison Stark

1598535540

How Indexes Work in Nebula Graph - DZone Database

Why Indexes Are Needed in a Graph Database

Indexes are an indispensable function in a database system. Graph databases are no exception.

An index is actually a sorted data structure in the database management system. Different database systems adopt different sorting structures.

Popular index types include:

  • B-Tree index
  • B±Tree index
  • B*-Tree index
  • Hash index
  • Bitmap index
  • Inverted index

Each of them uses their own sorting algorithms.

A database index allows efficient data retrieval from databases. Despite of the query performance improvement, there are some disadvantages of indexes:

  • It takes time to create and maintain indexes, which scales with dataset size.
  • Indexes need extra physical storage space.
  • It takes more time to insert, delete, and update data because the index also needs to be maintained synchronously.

Taking the above into consideration, Nebula Graph now supports indexes for more efficient retrieves on properties.

This post gives a detailed introduction to the design and practice of indexes in Nebula Graph.

Core Concepts to Understand Indexes in Nebula Graph

Below is a list of common Nebula Graph index terms we use across the post.

  • Tag: A label associated with a list of properties. Each vertex can associate with multiple tags. Tag is identified with a TagID. You can regard tag as a node table in SQL.
  • Edge: Similar to tag, edge type is a cluster of properties on edges. You can regard edge type as an edge table in SQL.
  • Property: The name-value pairs on tag or edge. Its data type is determined by the tag or edge type.
  • Partition: The minimum logical storage unit of Nebula Graph. A StorageEngine can contain multiple partitions. Partition is divided into leader and follower. We use Raft to guarantee data consistency between leader and follower.
  • Graph space: A physically isolated space for a specific graph. Tags and edge types in one graph are independent with those in another graph. A Nebula Graph cluster can have multiple graph spaces.
  • Index: Index in this post refers specifically to the index of ~~ ~~tag or edge type properties. Its data type depends on tag or edge type.
  • TagIndex: An index created for a tag. You can create multiple indexes for the same tag. Cross-tag composite index is yet to be supported.
  • EdgeIndex: An index created for an edge type. Similarly, you can create multiple indexes for the same edge type. Cross-edge-type composite index is yet to be supported.
  • Scan Policy: The policy to scan indexes. Usually, there are multiple methods to scan indexes to execute one query statement, but the scan policy itself gets to decide which method to use ultimately.
  • Optimizer: Optimize query conditions, such as sorting, splitting, and merging sub-expression nodes of the expression tree of the where clause. It’s used to obtain higher query efficiency.

What’s Required for Indexes to Work in a Graph Database

There are two typical ways to query data in Nebula Graph, or more generally in a graph database:

  1. One is starting from a vertex, retrieving its (N-hop) neighbors along certain edge types.
  2. Another is retrieving vertices or edges which contain specified property values.

In the latter scenario, a high-performance scan is needed to fetch the edges or vertices as well as the property values.

In order to improve the query efficiency of property values, we’ve implemented indexes in Nebula Graph. By sorting the property values of edges or vertices, users can quickly locate a certain property and avoid full scan.

Here’s what we found are required for indexes to work in a graph database:

  • Supporting indexes for properties on tags and edge types.
  • Supporting analysis and generation of index scanning strategy.
  • Supporting index management such as create index, rebuild index, show index, etc.

How Indexes Are Stored in Nebula Graph

Below is a diagram of how indexes are stored in Nebula Graph. Indexes are a part of Nebula Graph’s Storage Service so we place them in the big picture of its storage architecture.

Seen from the above figure, each Storage Server can contain multiple Storage Engines, each Storage Engine can contain multiple Partitions.

Different Partitions are synchronized via Raft protocol. Each Partition contains both data and indexes. The data and indexes of the same vertex or edge will be stored in the same Partition.

#tutorial #graph database #index #database indexes #nebula graph #database

What is GEEK

Buddha Community

How Indexes Work in Nebula Graph - DZone Database
Edison  Stark

Edison Stark

1598535540

How Indexes Work in Nebula Graph - DZone Database

Why Indexes Are Needed in a Graph Database

Indexes are an indispensable function in a database system. Graph databases are no exception.

An index is actually a sorted data structure in the database management system. Different database systems adopt different sorting structures.

Popular index types include:

  • B-Tree index
  • B±Tree index
  • B*-Tree index
  • Hash index
  • Bitmap index
  • Inverted index

Each of them uses their own sorting algorithms.

A database index allows efficient data retrieval from databases. Despite of the query performance improvement, there are some disadvantages of indexes:

  • It takes time to create and maintain indexes, which scales with dataset size.
  • Indexes need extra physical storage space.
  • It takes more time to insert, delete, and update data because the index also needs to be maintained synchronously.

Taking the above into consideration, Nebula Graph now supports indexes for more efficient retrieves on properties.

This post gives a detailed introduction to the design and practice of indexes in Nebula Graph.

Core Concepts to Understand Indexes in Nebula Graph

Below is a list of common Nebula Graph index terms we use across the post.

  • Tag: A label associated with a list of properties. Each vertex can associate with multiple tags. Tag is identified with a TagID. You can regard tag as a node table in SQL.
  • Edge: Similar to tag, edge type is a cluster of properties on edges. You can regard edge type as an edge table in SQL.
  • Property: The name-value pairs on tag or edge. Its data type is determined by the tag or edge type.
  • Partition: The minimum logical storage unit of Nebula Graph. A StorageEngine can contain multiple partitions. Partition is divided into leader and follower. We use Raft to guarantee data consistency between leader and follower.
  • Graph space: A physically isolated space for a specific graph. Tags and edge types in one graph are independent with those in another graph. A Nebula Graph cluster can have multiple graph spaces.
  • Index: Index in this post refers specifically to the index of ~~ ~~tag or edge type properties. Its data type depends on tag or edge type.
  • TagIndex: An index created for a tag. You can create multiple indexes for the same tag. Cross-tag composite index is yet to be supported.
  • EdgeIndex: An index created for an edge type. Similarly, you can create multiple indexes for the same edge type. Cross-edge-type composite index is yet to be supported.
  • Scan Policy: The policy to scan indexes. Usually, there are multiple methods to scan indexes to execute one query statement, but the scan policy itself gets to decide which method to use ultimately.
  • Optimizer: Optimize query conditions, such as sorting, splitting, and merging sub-expression nodes of the expression tree of the where clause. It’s used to obtain higher query efficiency.

What’s Required for Indexes to Work in a Graph Database

There are two typical ways to query data in Nebula Graph, or more generally in a graph database:

  1. One is starting from a vertex, retrieving its (N-hop) neighbors along certain edge types.
  2. Another is retrieving vertices or edges which contain specified property values.

In the latter scenario, a high-performance scan is needed to fetch the edges or vertices as well as the property values.

In order to improve the query efficiency of property values, we’ve implemented indexes in Nebula Graph. By sorting the property values of edges or vertices, users can quickly locate a certain property and avoid full scan.

Here’s what we found are required for indexes to work in a graph database:

  • Supporting indexes for properties on tags and edge types.
  • Supporting analysis and generation of index scanning strategy.
  • Supporting index management such as create index, rebuild index, show index, etc.

How Indexes Are Stored in Nebula Graph

Below is a diagram of how indexes are stored in Nebula Graph. Indexes are a part of Nebula Graph’s Storage Service so we place them in the big picture of its storage architecture.

Seen from the above figure, each Storage Server can contain multiple Storage Engines, each Storage Engine can contain multiple Partitions.

Different Partitions are synchronized via Raft protocol. Each Partition contains both data and indexes. The data and indexes of the same vertex or edge will be stored in the same Partition.

#tutorial #graph database #index #database indexes #nebula graph #database

Benchmarking the Mainstream Open Source Distributed Graph Databases

The deep learning and knowledge graph technologies have been developing rapidly in recent years. Compared with the “black box” of deep learning, knowledge graphs are highly interpretable, thus are widely adopted in such scenarios as search recommendations, intelligent customer support, and financial risk management.

Meituan has been digging deep in the connections buried in the huge amount of business data over the past few years and has gradually developed the knowledge graphs in nearly ten areas, including cuisine graphs, tourism graphs, and commodity graphs. The ultimate goal is to enhance the smart local life.

Compared with the traditional RDBMS, graph databases can store and query knowledge graphs more efficiently. It gains obvious performance advantage in multi-hop queries to select graph databases as the storage engine. Currently, there are dozens of graph database solutions out there on the market.

It is imperative for the Meituan team to select a graph database solution that can meet the business requirements and to use the solution as the basis of Meituan’s graph storage and graph learning platform. The team has outlined the basic requirements as below per our business status quo:

  1. It should be an open-source project which is also business-friendly

By having control over the source code, the Meituan team can ensure data security and service availability.

  1. It should support clustering and should be able to scale horizontally in terms of both storage and computation capabilities

The knowledge graph data size in Meituan can reach hundreds of billions of vertices and edges in total and the throughput can reach tens of thousands of QPS. With that being said, the single-node deployment cannot meet Meituan’s storage requirements.

  1. It should work under OLTP scenarios with the capability of multi-hop queries at the millisecond level.

To ensure the best search experience for Meituan users, the team has strictly restricted the timeout value within all chains of paths. Therefore, it is unacceptable to respond to a query at the second level.

  1. It should be able to import data in batch

The knowledge graph data is usually stored in data warehouses like Hive. The graph database should be equipped with the capability to quickly import data from such warehouses to the graph storage to ensure service effectiveness.

The Meituan team has tried the top 30 graph databases on DB-Engines and found that most well-known graph databases only support single-node deployment with their open-source edition, for example, Neo4j, ArangoDB, Virtuoso, TigerGraph, RedisGraph. This means that the storage service cannot scale horizontally and the requirement to store large-scale knowledge graph data cannot be met.

After thorough research and comparison, the team has selected the following graph databases for the final round: Nebula Graph (developed by a startup team who originally came from Alibaba), Dgraph (developed by a startup team who originally came from Google), and HugeGraph (developed by Baidu).

A Summary of The Testing Process

Hardware Configuration

  1. Database instances: Docker containers running on different machines
  2. Single instance resources: 32 Cores, 64 GB Memory, 1 TB SSD (Intel® Xeon® Gold 5218 CPU @ 2.30 GHz)
  3. Number of instances: Three

#database #tutorial #graph database #database performance #nebula graph #dgraph #graph database adoption

Mikel  Okuneva

Mikel Okuneva

1599897600

Data Migration From JanusGraph to Nebula Graph - Practice at 360 Finance

Speaking of graph data processing, we have had experience in using various graph databases. In the beginning, we used the stand-alone edition of AgensGraph. Later, due to its performance limitations, we switched to JanusGraph, a distributed graph database. I introduced details on how to migrate data in my article “Migrate tens of billions of graph data into JanusGraph (only in Chinese)”. As the data size and the number of business calls grew, a new problem appeared: Each query consumed too much time. In some business scenarios, a single query took up to 10 seconds, and with increase of the data size, a more complicated single query needed two or three seconds. These problems had seriously affected the performance of the entire business process and the development of related businesses.

The architecture design of JanusGraph determines that a single query is time-consuming. The core reason is that its storage depends on the external storage, and JanusGraph cannot control the external storage well. In our production environment, an HBase cluster is used, which makes it impossible for all queries to be pushed down to the storage layer for processing. Instead, data can only be queried from HBase to the JanusGraph Server memory and then filtered accordingly.

#database #tutorial #graph database #database performance #nebula graph #graph database adoption

Ruth  Nabimanya

Ruth Nabimanya

1620663480

Which Database Is Right For You?Graph Database vs. Relational Database

At the very beginning of most development endeavors lies an important question: What database do I choose? There is such an abundance of database technologies at this moment, it’s no wonder many developers don’t have the time or energy to research new ones. If you are one of those developers and you aren’t very familiar with graph databases in general, you’ve come to the right place!

In this article, you will learn about the main differences between a graph database and a relational database, what kind of use-cases are best suited for each database type, and what are their strengths and weaknesses.

How Does a Graph Database Differ from a Relational Database?

The Graph Data Model

The Relational Data Model

When to use a Graph Database?

When not to use a Graph Database

Is a Graph Database Worth it?

#graph-database #relational-database #graph-theory #graph-analysis #data-analytics #networks #data #database

Loma  Baumbach

Loma Baumbach

1598022420

Analyzing Relationships in Game of Thrones With NetworkX, Gephi, and Nebula Graph (Part 1)

The hit series Game of Thrones by HBO is popular all over the world. Besides the unexpected plot twists and turns, the series is also known for its complex and highly intertwined character relationships. In this post, we will access the open source graph database Nebula Graph with NetworkX and visualize the complex character connections in Game of Thrones with Gephi.

Introduction to the Dataset

The dataset we used in this article is: A Song of Ice and Fire Volume One to Volume Five[1].

  • Character set (vertices set): Each character in the book is stored as a vertex, and the vertex has only one property, i.e. name.
  • Relation set (edges set): If two characters connect directly or indirectly in the book, there is an edge between them. The edge has only one property, i.e. weight. The weight represents the intimacy level of the relationship.

The preceding vertices set and edges set constitute a graph, which is stored in the graph database Nebula Graph[2].

Community Detection: The Girvan-Newman Algorithm

We used the built-in community detection algorithm Girvan-Newman provided by NetworkX[3] to divide communities for our graph network.

Below are some explanations for the algorithm:

_In the network graph, the closely connected part can be regarded as a community. Connections among vertices are relatively close within each community, while the connections between the two communities are loose. Community detection is the process of finding the communities contained in a given network graph. Girvan-Newman is a community detection algorithm based on the betweenness. Its basic idea is to progressively remove edges from the original network according to the edge betweenness until the entire network is broken down into communities. By removing these edges, the groups are separated from one another and so the underlying community structure of the network is revealed. Therefore, the Girvan-Newman algorithm is actually a splitting method. The algorithm’s steps for community detection are summarized below: _

_(1)The betweenness of all existing edges in the network is calculated first. _

_(2)The edge(s) with the highest betweenness are removed. _

(3)Steps 2 and 3 are repeated until no edges remain.

With this explanation, let’s see how to use the algorithm.

1. Detect communities with the Girvan-Newman algorithm. The NetworkX sample code is as follows:

Python

1

comp = networkx.algorithms.community.girvan_newman(G)

2

k = 7

3

limited = itertools.takewhile(lambda c: len(c) <= k, comp)

4

communities = list(limited)[-1]

2. Add a community property to each vertex in the graph. The property value is the community number where the vertex is located.

Python

1

community_dict = {}

2

community_num = 0

3

for community in communities:

4

    for character in community:

5

        community_dict[character] = community_num

6

        community_num += 1

7

        nx.set_node_attributes(G, community_dict, 'community')

8

Vertex Style: The Betweenness Centrality Algorithm

Next we will adjust the size for the vertex and the size for the character name marked on the vertex. We will use NetworkX’s Betweenness Centrality algorithm to achieve our goals.

The importance of each vertex in the graph can be measured by the centrality of it. Different centrality definitions are adopted in different networks to describe the importance of the vertices in the network. Betweenness Centrality judges the importance of a vertex based on how many shortest paths pass through it.

1. Calculate the value of the betweenness centrality for each vertex.

Python

1

betweenness_dict = nx.betweenness_centrality(G) ## Run betweenness centrality

2. Add a new betweenness property for each vertex in the graph.

Python

1

x.set_node_attributes(G, betweenness_dict, 'betweenness')

#graph database #gephi #graph visualization #nebula graph #networkx #database