Create a Graph Database in Neo4j using Python

One of the most common questions I am asked by data scientists taking their first foray into graphs with Neo4j is how to get data into the database. In a previous post I showed how to do this in one of a few different ways using the Neo4j browser UI set up through Docker. In this post I will show how you can use your own data generated with Python to populate the database. I will also show you how to use a different Neo4j database setup using the Neo4j Sandbox.

A Google Colab notebook with the code for this post can be found here. (There are instructions in that notebook of how to connect Colab to Kaggle for getting your data downloaded more quickly.)

Necessary tools

  1. The Neo4j Python driver (version 4.2 at the writing of this post)
  2. Jupyter Notebook/Lab or Google Colab Notebook (optional)
  3. Pandas

Data cleaning with Python

Now we can actually start doing some data munging with Python. For the sake of this post we are going to use the arXiv Dataset found on Kaggle, which contains more than 1.7M scholarly STEM papers. (At the writing of this post, it is on Version 18.) You should go ahead and download that data to your local machine.

