Creating a graph using e-commerce data and make a RFM analysis
Relational databases are logical way to manage data, but on the other hand, alternative approaches such as graph database can be more useful in many cases. It’s known that huge companies in various industries such as eBay, Airbnb, Cisco and many others use the graph database. At this point, Neo4j shows itself as a graph database platform for managing the data.
In this article, I’ll try to explain how to create an example graph from the e-commerce data, using Neo4j and also touch on RFM Analysis.
First of all, you can find the e-commerce data I mentioned before here. At first glance, it’s clearly seen that the data consists of transactions. Therefore, the data includes a series of columns such as customer, purchased products, quantity and date of transaction.
It would be a step in the right direction to plan the schema before inserting data to Neo4j. The schema aimed to be builded in present study is as follows,
We can start with the customers now. Creating a constraint before creating nodes both prevents duplication and performs better because it uses MERGE
locks. You can create the constraint as follows,
CREATE CONSTRAINT ON (customer:Customer) ASSERT customer.customerID IS UNIQUE
Please notice that, having uniqueness for a property value is only useful in the graph if the property exists. Then you can create customer nodes as follows,
:auto
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
WITH
toInteger(line.CustomerID) AS CustomerID,
line WHERE NOT line.CustomerID IS null
MERGE(customer:Customer {customerID: CustomerID})
ON CREATE SET customer.country = line.Country
After creating customer nodes, it’ll be even easier to create product and transaction nodes. Likewise, firstly it would be correct to create constraint for product nodes.
CREATE CONSTRAINT ON (product:Product) ASSERT product.stockCode IS UNIQUE
There is an important point in here, when you create a constraint, Neo4j will create an index. Cypher will use that index for lookups just like other indexes. Therefore, there’s no need to create a separate index. In fact, if you try to create a constraint when there’s already an index, you’ll get an error.
After taking into account all of these, you can create product nodes as follows,
:auto
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
MERGE(product:Product {stockCode: line.StockCode})
ON CREATE SET product.description = line.Description
As you can see above, ON CREATE
statement is used when creating nodes. If the node needs to be created, merge a node and set the properties. Similarly, you can also use the statement ON MATCH
if the node already exists.
It’ll be nice to create transaction nodes just before start dealing with relationships as follows,
CREATE CONSTRAINT ON (transaction:Transaction) ASSERT transaction.transactionID IS UNIQUE;
:auto
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
MERGE(transaction:Transaction {transactionID: line.InvoiceNo})
ON CREATE SET transaction.transactionDate = line.InvoiceDate
Looking at the Cypher statements above, you can see that semicolon is used to separate Cypher statements. In general, you don’t need to end a Cypher statement with a semi-colon, but if you want to execute multiple Cypher statements, you must separate them.
The nodes in the graph are ready, but these nodes have no connection with each other. The connections capture the semantic relationships and context of the nodes in the graph. As it’s known, 3 types of nodes are available in the graph: Customer, transaction and product. As I mentioned at the beginning of this section, having relationships between customer-transaction and transaction-product will make this graph much more logical. The customer MADE a transaction and the transaction CONTAINS products. Here is the Cypher statement to building MADE relationships,
:auto
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
WITH toInteger(line.CustomerID) AS CustomerID,
line.InvoiceNo AS InvoiceNo
MATCH (customer:Customer {customerID: CustomerID})
MATCH (transaction:Transaction {transactionID: InvoiceNo})
MERGE (customer)-[:MADE]->(transaction)
Let’s finalize the graph by creating CONTAINS relationships,
:auto
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
WITH toInteger(line.CustomerID) AS CustomerID, line.InvoiceNo AS InvoiceNo
MATCH (customer:Customer {customerID: CustomerID})
MATCH (transaction:Transaction {transactionID: InvoiceNo})
MERGE (customer)-[:MADE]->(transaction)
You can now check the schema of the graph with the statement CALL db.schema.visualization()
. The result is like that,
Keep in mind that you can create the graph in a different way. For instance, the transaction could have been a relationship, not a node, and we could call it BOUGHT. As you can imagine, which one you choose depends on your business problem. At this point, you should set the rules and build the structure accordingly.
RFM analysis is a behavior-based approach grouping customers into segments. It groups the customers on the basis of their previous purchase transactions. Here are the three dimensions of RFM,
Segmenting customers using RFM analysis is an important point for companies that sell in many industries. Because companies want to know the customers that are valuable to them and to ensure loyalty for all their customers.
After mentioning the dimensions of the RFM and the significance of the customer segmentation, we can get the recency, frequency and monetary value with following python code,
from py2neo import Graph
import pandas as pd
host = 'localhost'
port = 7687
user = ''
password = ''
graph = Graph(
host=host,
port=port,
user=user,
password=password
)
tx = graph.begin()
query = """
MATCH (c:Customer)-[r1:MADE]->(t:Transaction)-[r2:CONTAINS]->(:Product)
WITH SUM(r2.price) AS monetary,
COUNT(DISTINCT t) AS frequency,
c.customerID AS customer,
MIN(
duration.inDays(
date(datetime({epochmillis: apoc.date.parse(t.transactionDate, 'ms', 'MM/dd/yyyy')})),
date()
).days
) AS recency
RETURN customer, recency, frequency, monetary
"""
# create the dataframe
results = tx.run(query).data()
df = pd.DataFrame(results)
## edit the recency value
df['recency'] = df['recency'] - df['recency'].min()
Then, it would be a correct step to define the segments by creating percentiles for dimensions. Please keep in mind that the segmentation here can be taken to a much more advanced level and is often not that simple. Real world problems can be more complex.
## three quantiles to rfm values
df['r_val'] = pd.qcut(df['recency'], q=3, labels=range(3, 0, -1))
df['f_val'] = pd.qcut(df['frequency'], q=3, labels=range(1, 4))
df['m_val'] = pd.qcut(df['monetary'], q=3, labels=range(1, 4))
## create the segment value
df['rfm_val'] = (
df['r_val'].astype(str) +
df['f_val'].astype(str) +
df['m_val'].astype(str)
)
## example names for segments
mapping = {
'Best customers': '333',
'No purchases recently': '133',
'Low loyalty': '111',
'New customers': '311'
}
## print the results
for k, v in mapping.items():
print(k + ',')
print(df[df.rfm_val == v].drop('customer', axis=1).describe().T)
print()
#machine-learning #neo4j #python #developer