One Benchmark Framework for Your Natural Language Processing Project

In this article, I will show how to retrieve close to one million public text or PDF documents. Some of these documents are raw text, some are clean text, and some include categorical labelling. I will also introduce **KILT, **abenchmark framework for natural language models.

Image for post

Thousands of PDF, Word, and Text Documents to Download for your NLP Project. Source: Unsplash

List of Lists of Public NLP Datasets.

The following are non-inclusive lists of lists of NLP datasets:

Raw text

Awesome-Public-Datasets;
Project Gutenberg: File Repository;
Project Gutenberg: Top 100 EBooks as of 8/15/2020;
Google Books API for Python;
Google Books Ngram Viewer;
Google datasets;
textacy datasets;
Kaggle datasets;
fast.ai datasets;
USC Machine Learning Repository datasets;
pyquora: A Python module to fetch and parse data from Quora;
Zillow: Real Estate and Mortgage Data;
readthedocs.org;

#machine-learning #deep-learning #dataset #unsupervised-learning #naturallanguageprocessing

List of Lists of Public NLP Datasets.

Raw text

medium.com

One Benchmark Framework for Your Natural Language Processing Project