**The Problem.**One of the biggest challenges during large pandemics such as the current COVID-19 pandemic is being able to keep people up-to-date with the latest and most relevant information. Even though reputable sources like the CDC and FDA maintain FAQ websites for COVID-19, users might still struggle to find their questions, and many common questions will remain unanswered.

I’ve worked with other researchers to compile COVID-Q [Dataset Link], a dataset of COVID-19 questions, in the hopes that this dataset will be useful to other researchers. Our full paper can be found here.

The Dataset — General Overview. COVID-Q is a dataset of 1,690 questions about COVID-19 from thirteen online sources. The dataset is annotated by classifying questions into 15 question categories and by grouping questions that ask the same thing into 207 question classes.

COVID-Q can be used for several question understanding tasks:

  • The question categories can be used as a standard text classification task to determine the general category of information that a question is asking about.
  • The question classes can be used for retrieval question answering. In this task, a system has a database of questions and answers. Given a new question, the system must find the question in the database that asks the same thing as the given question and return the corresponding answer.

Distribution of questions in COVID-Q by source. The reported number of questions excludes unrelated, vague, and nonsensical questions that have been removed. A * denotes sources for which questions came from FAQ pages.

**Data Collection and Processing. **To collect the data, thirteen sources were scraped to gather questions about COVID-19; seven of these sources were official FAQ websites from reputable organizations like the CDC and FDA and six sources were crowd-based (e.g. Quora, Yahoo Answers).

 A New Question Classification Dataset
