**The Problem.**One of the biggest challenges during large pandemics such as the current COVID-19 pandemic is being able to keep people up-to-date with the latest and most relevant information. Even though reputable sources like the CDC and FDA maintain FAQ websites for COVID-19, users might still struggle to find their questions, and many common questions will remain unanswered.
I’ve worked with other researchers to compile COVID-Q [Dataset Link], a dataset of COVID-19 questions, in the hopes that this dataset will be useful to other researchers. Our full paper can be found here.
The Dataset — General Overview. COVID-Q is a dataset of 1,690 questions about COVID-19 from thirteen online sources. The dataset is annotated by classifying questions into 15 question categories and by grouping questions that ask the same thing into 207 question classes.
COVID-Q can be used for several question understanding tasks:
Distribution of questions in COVID-Q by source. The reported number of questions excludes unrelated, vague, and nonsensical questions that have been removed. A * denotes sources for which questions came from FAQ pages.
**Data Collection and Processing. **To collect the data, thirteen sources were scraped to gather questions about COVID-19; seven of these sources were official FAQ websites from reputable organizations like the CDC and FDA and six sources were crowd-based (e.g. Quora, Yahoo Answers).
#data-science #linguistics #covid19 #nlp #data analysis