Something From Nothing: Use NLP and ML to Extract and Structure Web Data

Introduction

In this article, we will be creating a structured document database based on the Institute for the Study of War (ISW) production library. ISW creates informational products for diplomatic and intelligence professionals to gain a deeper understanding of conflicts occurring around the world.

To see the original code and notebook associated with this article, follow this link. To access the final structured dataset, hosted on Kaggle, follow this link.

This article will be an exercise in web extraction, natural language processing (NLP), and named entity recognition (NER). For the NLP, we will primarily be using the open-source Python libraries **NLTK **and Spacy. This article is intended to be demonstration of a use-case for web extraction and NLP, not a comprehensive beginner tutorial to the usage of either technique. If you are new to NLP or web extraction, I would urge you to follow a different guide or look through the Spacy, BeautifulSoup, and NLTK documentation pages.

#Import libraries

	import requests
	import nltk
	import math
	import re
	import spacy
	import regex as re
	import pandas as pd
	import numpy as np
	import statistics as stats
	import matplotlib.pyplot as plt
	import matplotlib.cm as cm
	import json

	#You will need to download some packages from NLTK. 

	from bs4 import BeautifulSoup 
	from nltk import *
	nltk.download('stopwords')
	nltk.download('punkt')
	from nltk.corpus import stopwords

	#In most environments, you will need to install NER-D. 

	!pip install ner-d
	from nerd import ner

	from sklearn.feature_extraction.text import CountVectorizer
	from sklearn.feature_extraction.text import TfidfTransformer
	from sklearn.cluster import MiniBatchKMeans
	from sklearn.feature_extraction.text import TfidfVectorizer

#machine-learning #nlp #data #data-science #database

Introduction

towardsdatascience.com

Something From Nothing: Use NLP and ML to Extract and Structure Web Data