A comprehensive list of data repositories for every type of problem. Given the nature of my job, I have to work on new projects every week solving a different problem. My work requires me to parse through a lot of different kinds of datasets to design and develop instructions for Data Science aspirants.
Given the nature of my job, I have to work on new projects every week solving a different problem. My work requires me to parse through a lot of different kinds of datasets to design and develop instructions for Data Science aspirants.
The blog contains a few useful datasets and data repositories categorized in different classes of problems and industries.
Data Repositories on the web:
Google Dataset Portal
- Google Dataset Search — a search engine for researchers to locate online data.
- datasetlist — offers a list of the biggest machine learning datasets from across the web.
- UCI — one of the oldest repositories with data classified by types of problems, attributes type, data type, the field of study, etc.
- fastai-datasets — datasets for Image classification, NLP and Image localization
- NLP-datasets — Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing
- Bifrost — for visual datasets classified by task, application, class, label, and format.
Open Dataset Image
- ImageNet — ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.
- CT Medical Images — designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The data consists of a tiny subset of images from the cancer imaging archive.
- Flickr-faces — Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN).
- objectnet — A new kind of vision dataset borrowing the idea of controls from other areas of science.
- CelebFaces — Large-scale CelebFaces attributes
- Animal Faces-HQ dataset (AFHQ) — a dataset of animal faces, consisting of 15,000 high-quality images at 512×512 resolution.
- nlp-datasets — Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP).
- 1 trillion n-grams — linguistic data consortium. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
- litbank — LitBank is an annotated dataset of 100 works of English-language fiction to support tasks in natural language processing and the computational humanities.
- BookCorpus — these are scripts to reproduce BookCorpus by yourself.
- rasa-nlu-training-data — Crowd-sourced training data for the development and testing of Rasa NLU models.
- Google book Ngram — it is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in sources printed between 1500 and 2019 in Google’s text corpora in English, Chinese, French, German, Hebrew, Italian, Russian, or Spanish.