Building datasets for language model training and fine-tuning can be very tedious. I learned this the hard way while trying to gather a conversational text dataset and a niche song lyrics dataset, both for training a single GPT-2 model. Only after several hours of semi-automated scraping and manual cleaning did I come across Genius and its API. It really was a godsend. Not only is it easy to get set up with and use the API, but it also boasts a plethora of artists that aren’t necessarily household names. This makes it a great option for creating datasets of both mainstream and niche song lyrics. The process is rendered even more effortless when stacked with lyricsgenius, a package created by John Miller that significantly simplifies the task of filtering data retrieved by the Genius API.
This writeup will revolve around the use-case of constructing a training dataset for a generative language model, like GPT. To be clear, this will not include steps to actually build a model. We’ll walk through the process of setting up the API client and then writing a function to fetch song lyrics of k songs and save the lyrics to a .txt file.
#dataset #music #song-lyrics #python