Collecting, transforming and cleaning JSTOR metadata in Python

A simple guide into parsing meta-data from JSTOR data for research database using the ElementTree XML.

JSTOR database is one of the leading sources of research articles in more than 50 disciplines of science. In Data for Research section, researchers can access datasets for use in research and teaching about the articles and books released in the library. Data available through the service include metadata, n-grams, and word counts for most articles, book chapters, research reports, and pamphlets on JSTOR. However, the output of the data requests are not simple csv. or txt. documents, but XML files that require some processing and cleaning to work effectively with the data. In R, the package Jstor, released in the mid of 2020, made the whole process far simpler.

To make accessing larger volumes of data for data scientists and researchers easier, in this article, I show the python code for parsing the XML outputs, explain the process of collecting the data from JSTOR data for research database, and show a nice application of this type of data.

A simple guide into parsing meta-data from JSTOR data for research database using the ElementTree XML.

Collecting data

Data transformation

Data cleaning

towardsdatascience.com

Collecting, transforming and cleaning JSTOR metadata in Python