In this article, we will be creating a structured document database based on the Institute for the Study of War (ISW) production library. ISW creates informational products for diplomatic and intelligence professionals to gain a deeper understanding of conflicts occurring around the world.
To see the original code and notebook associated with this article, follow this link. To access the final structured dataset, hosted on Kaggle, follow this link.
This article will be an exercise in web extraction, natural language processing (NLP), and named entity recognition (NER). For the NLP, we will primarily be using the open-source Python libraries **NLTK **and Spacy. This article is intended to be demonstration of a use-case for web extraction and NLP, not a comprehensive beginner tutorial to the usage of either technique. If you are new to NLP or web extraction, I would urge you to follow a different guide or look through the Spacy, BeautifulSoup, and NLTK documentation pages.
#Import libraries
import requests
import nltk
import math
import re
import spacy
import regex as re
import pandas as pd
import numpy as np
import statistics as stats
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import json
#You will need to download some packages from NLTK.
from bs4 import BeautifulSoup
from nltk import *
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
#In most environments, you will need to install NER-D.
!pip install ner-d
from nerd import ner
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
#machine-learning #nlp #data #data-science #database