Estimates state that 70%–85% of the world’s data is text (unstructured data). Most of the English and EU business data formats as byte text, MS Word, or Adobe PDF. [1]
Organizations web displays of Adobe **Postscript Document Format **documents (PDF). [2]
In this blog, I detail the following :
The following function will take either a local file name or a remote file URL and return a filepath object.
#in file_to_text.py
--------------------------------------------
from io import StringIO, BytesIO
import urllib
def file_or_url(pathfilename:str) -> Any:
"""
Reurn filepath given local file or URL.
Args:
pathfilename:
Returns:
filepath odject istance
"""
try:
fp = open(pathfilename, mode="rb") ## file(path, 'rb')
except:
pass
else:
url_text = urllib.request.urlopen(pathfilename).read()
fp = BytesIO(url_text)
return fp
You will often encounter text blob downloads in the size 8-bit Unicode format (in the romantic languages). You need to convert 8-bit Unicode into Python Unicode strings.
#in file_to_text.py
--------------------------------------------
def unicode_8_to_text(text: str) -> str:
return text.decode("utf-8", "replace")
import urllib
from file_to_text import unicode_8_to_text
text_l = 250
text_url = r'http://www.gutenberg.org/files/74/74-0.txt'
gutenberg_text = urllib.request.urlopen(text_url).read()
%time gutenberg_text = unicode_8_to_text(gutenberg_text)
print('{}: size: {:g} \n {} \n'.format(0, len(gutenberg_text) ,gutenberg_text[:text_l]))
output =>
#nlp #python #machine-learning #programming #pdf