Estimates state that 70%–85% of the world’s data is text (unstructured data). Most of the English and EU business data formats as byte text, MS Word, or Adobe PDF. [1]

Organizations web displays of Adobe **Postscript Document Format **documents (PDF). [2]

In this blog, I detail the following :

  1. Create a file path from the web file name and local file name;
  2. Change byte encoded Gutenberg project file into a text corpus;
  3. Change a PDF document into a text corpus;
  4. Segment continuous text into a Corpus of word text.

Converting Popular Document Formats into Text

1. Create local filepath from the web filename or local filename

The following function will take either a local file name or a remote file URL and return a filepath object.

#in file_to_text.py
--------------------------------------------
from io import StringIO, BytesIO
import urllib

def file_or_url(pathfilename:str) -> Any:
    """
    Reurn filepath given local file or URL.
    Args:
        pathfilename:

    Returns:
        filepath odject istance

    """
    try:
        fp = open(pathfilename, mode="rb")  ## file(path, 'rb')
    except:
        pass
    else:
        url_text = urllib.request.urlopen(pathfilename).read()
        fp = BytesIO(url_text)
    return fp

2. Change Unicode Byte encoded file into a o Python Unicode String

You will often encounter text blob downloads in the size 8-bit Unicode format (in the romantic languages). You need to convert 8-bit Unicode into Python Unicode strings.

#in file_to_text.py
--------------------------------------------
def unicode_8_to_text(text: str) -> str:
    return text.decode("utf-8", "replace")
import urllib
from file_to_text import unicode_8_to_text
text_l = 250
text_url = r'http://www.gutenberg.org/files/74/74-0.txt' 
gutenberg_text =  urllib.request.urlopen(text_url).read()
%time gutenberg_text = unicode_8_to_text(gutenberg_text)
print('{}: size: {:g} \n {} \n'.format(0, len(gutenberg_text) ,gutenberg_text[:text_l]))
output =>

#nlp #python #machine-learning #programming #pdf

Natural Language Processing in Production: Converting PDF and Gutenberg
1.50 GEEK