HTML Data Cleaning in Python for NLP

The most important step of any data-driven project is obtaining quality data. Without these preprocessing steps, the results of a project can easily be biased or completely misunderstood. Here, we will focus on cleaning data that is composed of scraped web pages.

Obtaining the data

There are many tools to scrape the web. If you are looking for something quick and simple, the URL handling module in Python called urllib might do the trick for you. Otherwise, I recommend scrapyd because of the possible customizations and robustness.

It is important to ensure that the pages you are scraping contain rich text data that is suitable for your use case.

From HTML to text

Once we have obtained our scraped web pages, we begin by extracting the text out of each web page. Websites have lots of tags that don’t contain useful information when it comes to NLP, such as <script> and <button>. Thankfully, there is a Python module called boilerpy3 that makes text extraction easy.

We use the ArticleExtractor to extract the text. This extractor has been tuned for news articles that works well for most HTMLs. You can try out other extractors listed in the documentation for boilerpy3 and see what works best for your dataset.

Next, we condense all newline characters (\n and \r) into one \ncharacter. This is done so that when we split the text up into sentences by \n and periods, we don’t get sentences with no words.

import os
	import re
	from boilerpy3 import extractors

	## Condenses all repeating newline characters into one single newline character
	def condense_newline(text):
	    return '\n'.join([p for p in re.split('\n|\r', text) if len(p) > 0])

	## Returns the text from a HTML file
	def parse_html(html_path):
	    ## Text extraction with boilerpy3
	    html_extractor = extractors.ArticleExtractor()
	    return condense_newline(html_extractor.get_content_from_file(html_path))

	## Extracts the text from all html files in a specified directory
	def html_to_text(folder):
	    parsed_texts = []
	    filepaths = os.listdir(folder)

	    for filepath in filepaths:
	        filepath_full = os.path.join(folder, filepath)
	        if filepath_full.endswith(".html"):
	            parsed_texts.append(parse_html(filepath_full))
	    return parsed_texts

	## Your directory to the folder with scraped websites
	scraped_dir = './scraped_pages'
	parsed_texts = html_to_text(scraped_dir)

If the extractors from boilerpy3 are not working for your web pages, you can use beautifulsoup to build your own custom text extractor. Below is an example replacement of the parse_html method.

from bs4 import BeautifulSoup

	## Returns the text from a HTML file based on specified tags
	def parse_html(html_path):
	    with open(html_path, 'r') as fr:
	        html_content = fr.read()
	        soup = BeautifulSoup(html_content, 'html.parser')

	        ## Check that file is valid HTML
	        if not soup.find():
	            raise ValueError("File is not a valid HTML file")

	        ## Check the language of the file
	        tag_meta_language = soup.head.find("meta", attrs={"http-equiv": "content-language"})
	        if tag_meta_language:
	            document_language = tag_meta_language["content"]
	            if document_language and document_language not in ["en", "en-us", "en-US"]:
	                raise ValueError("Language {} is not english".format(document_language))

	        ## Get text from the specified tags. Add more tags if necessary.
	        TAGS = ['p']
	        return ' '.join([remove_newline(tag.text) for tag in soup.findAll(TAGS)])

#website #python #programming #nlp #data-science

Obtaining the data

From HTML to text

towardsdatascience.com

HTML Data Cleaning in Python for NLP