Download and parse TREC-COVID data

This is the first on a series of blog posts that will show you how to improve a text search application, from downloading data to fine-tuning BERT models.

You can also run the steps contained here from Google Colab.

The team behind vespa.ai have built and open-sourced a CORD-19 search engine. Thanks to advanced Vespa features such as Approximate Nearest Neighbors Search and Tranformers support via ONNX it comes with the most advanced NLP methodology applied to search that is currently available.

Our first step is to download relevance judgments to be able to evaluate current query models deployed in the application and to train better ones to replace those already there.

Download the data

The files used in this section can be found at https://ir.nist.gov/covidSubmit/data.html. We will download both the topics and the relevance judgements data. Do not worry about what they are just yet, we will explore them soon.

!wget https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
!wget https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt

#search #vespa #pyvespa #machine-learning #nlp #data-science

Download the data

towardsdatascience.com

Download and parse TREC-COVID data