Information Extraction is a process of extracting information in a more structured way i.e., the information which is machine-understandable. It consists of sub fields which cannot be easily solved. Therefore, an approach to store data in a structured manner is Knowledge Graph which is a set of three-item sets called Triple where the set combines a subject, a predicate and an object.In this article, we will discuss how to build a knowledge graph using Python and Spacy.

Let’s get started.

Code Implementation

Import all the libraries required for this project.

import spacy
from spacy.lang.en import English
import networkx as nx
import matplotlib.pyplot as plt

These hubs will be the elements that are available in Wikipedia. Edges are the connections interfacing these elements to each other. We will extricate these components in an unaided way, i.e., we will utilize the punctuation of the sentences.

The primary thought is to experience a sentence and concentrate the subject and the item as and when they are experienced. First, we need to pass the text to the function. The text will be broken down and place each token or word in a category. After we have arrived at the finish of a sentence, we clear up the whitespaces which may have remained and afterwards we’re all set, we have gotten a triple. For example in the statement “Bhubaneswar is categorised as a Tier-2 city” it will give a triple focusing on the main subject(Bhubaneswar, categorised, Tier-2 city).

Below we have defined the code to get triples that can be used to build knowledge graphs.

def getSentences(text):
    nlp = English()
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    document = nlp(text)
    return [sent.string.strip() for sent in document.sents]
def printToken(token):
    print(token.text, "->", token.dep_)
def appendChunk(original, chunk):
    return original + ' ' + chunk
def isRelationCandidate(token):
    deps = ["ROOT", "adj", "attr", "agent", "amod"]
    return any(subs in token.dep_ for subs in deps)
def isConstructionCandidate(token):
    deps = ["compound", "prep", "conj", "mod"]
    return any(subs in token.dep_ for subs in deps)
def processSubjectObjectPairs(tokens):
    subject = ''
    object = ''
    relation = ''
    subjectConstruction = ''
    objectConstruction = ''
    for token in tokens:
        printToken(token)
        if "punct" in token.dep_:
            continue
        if isRelationCandidate(token):
            relation = appendChunk(relation, token.lemma_)
        if isConstructionCandidate(token):
            if subjectConstruction:
                subjectConstruction = appendChunk(subjectConstruction, token.text)
            if objectConstruction:
                objectConstruction = appendChunk(objectConstruction, token.text)
        if "subj" in token.dep_:
            subject = appendChunk(subject, token.text)
            subject = appendChunk(subjectConstruction, subject)
            subjectConstruction = ''
        if "obj" in token.dep_:
            object = appendChunk(object, token.text)
            object = appendChunk(objectConstruction, object)
            objectConstruction = ''
    print (subject.strip(), ",", relation.strip(), ",", object.strip())
    return (subject.strip(), relation.strip(), object.strip())
def processSentence(sentence):
    tokens = nlp_model(sentence)
    return processSubjectObjectPairs(tokens)
def printGraph(triples):
    G = nx.Graph()
    for triple in triples:
        G.add_node(triple[0])
        G.add_node(triple[1])
        G.add_node(triple[2])
        G.add_edge(triple[0], triple[1])
        G.add_edge(triple[1], triple[2])
    pos = nx.spring_layout(G)
    plt.figure(figsize=(12, 8))
    nx.draw(G, pos, edge_color='black', width=1, linewidths=1,
            node_size=500, node_color='skyblue', alpha=0.9,
            labels={node: node for node in G.nodes()})
    plt.axis('off')
    plt.show()

#python #data-science #developer

How to Build a Knowledge Graph using Python and Spacy
58.15 GEEK