A New Question Classification Dataset

**The Problem.**One of the biggest challenges during large pandemics such as the current COVID-19 pandemic is being able to keep people up-to-date with the latest and most relevant information. Even though reputable sources like the CDC and FDA maintain FAQ websites for COVID-19, users might still struggle to find their questions, and many common questions will remain unanswered.

I’ve worked with other researchers to compile COVID-Q [Dataset Link], a dataset of COVID-19 questions, in the hopes that this dataset will be useful to other researchers. Our full paper can be found here.

Image for post

The Dataset — General Overview. COVID-Q is a dataset of 1,690 questions about COVID-19 from thirteen online sources. The dataset is annotated by classifying questions into 15 question categories and by grouping questions that ask the same thing into 207 question classes.

COVID-Q can be used for several question understanding tasks:

The question categories can be used as a standard text classification task to determine the general category of information that a question is asking about.
The question classes can be used for retrieval question answering. In this task, a system has a database of questions and answers. Given a new question, the system must find the question in the database that asks the same thing as the given question and return the corresponding answer.

Image for post

Distribution of questions in COVID-Q by source. The reported number of questions excludes unrelated, vague, and nonsensical questions that have been removed. A * denotes sources for which questions came from FAQ pages.

**Data Collection and Processing. **To collect the data, thirteen sources were scraped to gather questions about COVID-19; seven of these sources were official FAQ websites from reputable organizations like the CDC and FDA and six sources were crowd-based (e.g. Quora, Yahoo Answers).

#data-science #linguistics #covid19 #nlp #data analysis

towardsdatascience.com

A New Question Classification Dataset