Building datasets for language model training and fine-tuning can be very tedious. I learned this the hard way while trying to gather a conversational text dataset and a niche song lyrics dataset, both for training a single GPT-2 model. Only after several hours of semi-automated scraping and manual cleaning did I come across Genius and its API. It really was a godsend. Not only is it easy to get set up with and use the API, but it also boasts a plethora of artists that aren’t necessarily household names. This makes it a great option for creating datasets of both mainstream and niche song lyrics. The process is rendered even more effortless when stacked with lyricsgenius, a package created by John Miller that significantly simplifies the task of filtering data retrieved by the Genius API.

This writeup will revolve around the use-case of constructing a training dataset for a generative language model, like GPT. To be clear, this will not include steps to actually build a model. We’ll walk through the process of setting up the API client and then writing a function to fetch song lyrics of k songs and save the lyrics to a .txt file.

Setting up the API Client

  1. Review the API documentation page.
  2. Review the API Terms of Service.
  3. From the documentation page, click API Client management page to navigate to the Sign-up/Log-in page.
  4. Complete the form using the signup, or login (if you have an account), method of your choice and click Create Account. This should take you to your API Clients page or re-route you back to the home page. If you are sent back to the home page, scroll down to the page footer and click Developers.

#dataset #music #song-lyrics #python

How to Collect Song Lyrics with Python
9.65 GEEK