The 3rd week of the Metis Data Science Bootcamp is behind us now, and so is the 2nd project of the bootcamp. Below I detail the project I did, which was to scrape data from MyAnimeList (MAL), and then use linear regression models to predict user scores based on the features of the anime.
MAL states that it is “the world’s largest anime and manga database and community”. Most if not all of the information on the website is user-generated, from the scores and reviews given to an anime to the synopsis and list of characters in said anime.
Example of a listing on MAL
The user-generated score is a good proxy for the quality and popularity of a given anime. Hence I wondered if I could use the data that MAL had on the anime to predict the kind of score it would receive.
This would be valuable information to have for an anime production company deciding on its next production, or a TV network deciding what anime to comission or license for broadcast.
I used BeautifulSoup to scrape the data for 16,706 anime entries between 23 and 24 April 2020.
How I partitioned the MAL data
Out of these entries, 5,165 were left out of the final testing and training dataset because they had no score (aka target value) attached to them. This left me with 11,541 entries, or 2/3rds of the original dataset. I then split off 1/5th (or 2,279) of the remaining entries into a testing dataset, and kept the remaining 9,262 entries for training and validation.
Below is the list of 19 features that were ultimately considered.
Types of associated properties
Features that were scraped but not used include the producers, licensors and studios, as well as broadcast timing in Japan. Features that were not scraped include the characters, voice actors and other staff involved in the anime. All these features can be included in any further investigation of MAL data.
#data-science #metis #anime #bootcamp #data analysis