During the early period of the COVID-19 outbreak in December, my wife and I were in a cozy cocoon awaiting the birth of our son. After his birth, it was clear that the outbreak of COVID-19 was taking hold of the world. I got to thinking more about my own birth at the end of 1985, a few months before the Chernobyl disaster in April of 1986. It seems like in an ever evolving world, new life and new challenges will always go hand in hand. So whenever my son slept (not as much as I would have liked), I quietly picked up my computer and began to wade, then swim, and finally dive into natural language processing (NLP) in python.

In March of 2020, the White House Office of Science and Technology Policy released the CORD 19 dataset and a call to action:

“a call to action to the Nation’s artificial intelligence experts to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19"

Image for post

Photo by Caleb Perez on Unsplash

CORD 19 was the perfect opportunity to develop code to find relevant and timely information on the new coronavirus. It was overwhelming the number of NLP packages and techniques available (e.g. RoBERTa, which is also the name of my mother-in-law who heralded the news of the new virus to us), and the list is still expanding. In this article, I will demonstrate how I put some of these NLP packages together to build an extractive summary code, called CORD crusher. I will zoom in on the components of my NLP code, explain their function, and show how they fit together. The five main steps were:

  1. Divide data into time ranges by publication year

2. Extract keywords and group papers according to a broad subject

3. Build topics from keywords for each subject

4. Refine keywords into more specific topic phrases

5. Search CORD 19 text and rank by similarity

#naturallanguageprocessing #covid19 #data-science #text-mining

CORD Crusher: Slicing the CORD 19 Data into Summaries
1.15 GEEK