In this article, I will show how to retrieve close to one million public text or PDF documents. Some of these documents are raw text, some are clean text, and some include categorical labelling. I will also introduce **KILT, **abenchmark framework for natural language models.

Image for post

Thousands of PDF, Word, and Text Documents to Download for your NLP Project. Source: Unsplash

List of Lists of Public NLP Datasets.

The following are non-inclusive lists of lists of NLP datasets:

Raw text

  1. Awesome-Public-Datasets;
  2. Project Gutenberg: File Repository;
  3. Project Gutenberg: Top 100 EBooks as of 8/15/2020;
  4. Google Books API for Python;
  5. Google Books Ngram Viewer;
  6. Google datasets;
  7. textacy datasets;
  8. Kaggle datasets;
  9. fast.ai datasets;
  10. USC Machine Learning Repository datasets;
  11. pyquora: A Python module to fetch and parse data from Quora;
  12. Zillow: Real Estate and Mortgage Data;
  13. readthedocs.org;

#machine-learning #deep-learning #dataset #unsupervised-learning #naturallanguageprocessing

One Benchmark Framework for Your Natural Language Processing Project
1.15 GEEK