Archiving and Logging Your Use of Public Data

One worry that I always have when downloading data sets off the internet is their impermanence. Links die, data changes, ashes to ashes, dust to dust.

That’s why I’ve been introducing the Wayback Machine into my workflow. But even then, it’s tough to be consistent with whether I’m downloading data off an archived website or a live website and it’s tough to understand what I did in the past.

What I’ve done through my work with the Survey of Consumer Finances (SCF) is implement a system of simultaneously archiving and logging the data that I use. Below is a summary of what I’ve done but if you just want to see the code, scroll to the bottom of this post for a gist of the functions I’ve implemented with respect to the SCF.

Building Off of WaybackPy

The big thing I wanted to accomplish with this project was to make sure that I was using recent Wayback archives as much as possible when downloading data. Else, with any data that did not have archives, I wanted to be sure it got archived on Wayback for future use.

The best package I found for this was WaybackPy, although I needed to make a few changes to make it work for my purposes.

First, I needed to implement attributes that would allow me to see the age of the latest archive. This way I could check to see if a new archive would be needed given a provided archive_age_limit .

By way of illustration, if you were to get the len() of a WaybackPy Url object passing through www.google.com as an argument you’d get 0, or the number of days since the last archive.

import waybackpy

url = “https://www.google.com/"
waybackpy_url_obj = waybackpy.Url(url)
print(len(waybackpy_url_obj))

#data-science #workflow #consumer-finance #work #federal-reserve

Building Off of WaybackPy

towardsdatascience.com

Archiving and Logging Your Use of Public Data