Agenda

Modeling Publication Data in a Graph
What are Graph Queries
Writing Graph Queries
Conclusion

1. Modeling Publication Data in a Graph

This is part 3 of a series exploring data extraction and modeling. If you’ve followed along with the series so far, welcome back!. If you’re new, I’ll give a brief rundown of what we’ve done so far. In Part 1, we explored using NLP and entity extraction on biomedical literature related to Covid-19. In Part 2, we learned how to take our data and model it using TigerGraph Cloud. Check out those articles to get an in-depth look at how we’ve done everything so far.

Now in Part 3, we are going to look at writing graph search queries to easily analyze the data in our graph. You will need a fully formed graph to write queries. You can follow the steps in Part 2 to create a graph from scratch, or you can import the graph we created into TigerGraph Cloud. The graph, along with all of the data we used, can be found here.

To import the graph, create a blank solution following these steps. Once on the homepage of your solution, click Import an Existing Solution.

Import Solution on Homepage

Unfortunately, you still have to map and load the data manually. But, I take you through exactly how to do this in Part 2 under agenda items 4 and 5.

2. What Are Graph Queries?

Before we start writing queries, we should probably understand what they are. Graph queries are essentially commands that search through a graph and perform some operation. Queries can be used to find certain vertices or edges, do computations, or even update the graph. Since graphs also have visual representations, all of this can also be done with a UI, like the one provided by TigerGraph Cloud. But, when working with large amounts of data or when trying to create fine-tuned graph searches, using a visual interface is very inefficient. Hence, we can write queries to quickly traverse a graph and extract or insert whatever data we want.

3. Query Structure

GSQL offers lots of different methods for querying. We will focus on searches. At the core of graph searches is something called a SELECT statement. As the name suggests, the select statement is used to select a set of vertices or edges. The SELECT statement comes with several parameters to narrow down the focus of your search.

The **FROM **clause specifies what type of edge or vertex you are choosing.

The WHERE clause lets you declare specific conditions for the vertices or edges.

The ACCUM and **POST-ACCUM **clauses let you handle Accumulators, which are special GSQL variables that gather information as you search (the information can be numbers, sets of vertices or edges, etc.).

The HAVING clause, similar to the **WHERE **clause, lets you provide additional conditions; however, these will be applied after the previous clauses.

The ORDER BY clause lets you order the gathered edges or vertices by some attribute value.

Finally, the LIMIT clause constrains the number of results of your search.

You can find all of these details, along with other parameters and query methods, on the TigerGraph documentation page.

3. Writing Graph Queries

Almost any search you could think of for a graph can be handled with the SELECT statement and its corresponding clauses. To prove that fact, let’s practice writing some queries.

All of the following queries can be found on my GitHub page.

These queries are in order from simplest to most complex.

Publications With a Given License

**Goal: **Findall publications that fall under a given license type.

Code:

CREATE QUERY LicensePub(String l) FOR GRAPH MyGraph {

/* Finds all publications with a given license type   
   Sample Inputs: cc0, cc-by, green-oa, cc-by-nc, no-cc  */
Seed = {LICENSE.*};
Pubs = SELECT p
           FROM Seed:s-(PUB_HAS_LICENSE:e)-PUBLICATION:p
           WHERE s.id == l;
PRINT Pubs[Pubs.id] AS Publications;
}

**Explanation: **Let’s break downwhat our code is doing.We want to select all Publication vertices that connect to a specific License vertex. So, we traverse from all LICENSE vertices to all PUBLICATION vertices with the condition that the license id is whatever we specify (i.e. cc0, no-cc, etc.). Then, we just print our results. There are two things to notice in our print statement.

If we simply write PRINT Pubs , our output will print the Publications with all of their associated data (title, abstract, etc). So, to filter the output data, we can specify which attributes we want using brackets. In our example, we only print out the ids by writing PRINT Pubs[Pubs.id] .
The use of the AS statement is purely cosmetic, and it just changes the name of the resulting list that is printed. This is useful for when you are extracting data to be used in other contexts, but is not necessary for writing queries.

Now, let’s save and install our code. When we run it, we get an input box that looks like this:

Interface after running license query

As an example, I entered ‘cc0’ as the license code. When I click run query, I get an image that looks like this:

Resulting publication vertices after running license query

This shows each publication vertex that has the license we specified. But, this view is quite messy. We can instead view the JSON output by clicking on the **<…> **icon on the left side. The JSON output should look like this.

JSON output for license query

This looks much cleaner! We can also see the effects of our print statement adjustments. The name of the resulting list is “Publications”, and the only vertex attribute printed is the id.

For the following queries, I will only show the JSON output.

#gsql #data-analysis #query #graph-database #data-science #data analysisa