The 3rd week of the Metis Data Science Bootcamp is behind us now, and so is the 2nd project of the bootcamp. Below I detail the project I did, which was to scrape data from MyAnimeList (MAL), and then use linear regression models to predict user scores based on the features of the anime.

TL;DR:

  • Scraped and cleaned 19 features from 11,541 anime entries
  • Transformed features through dummy variables; clipping outliers; applying log10 and sqrt; and multiplying/dividing features
  • Selected 11 features through statsmodel’s OLS; scikit-learn’s LASSO; metrics such as R² and MSE; and intuition and domain knowledge
  • **LASSO **edged out plain linear regression and Ridge in performance, but choice of linear regression model ultimately didn’t make a difference
  • The final model can predict whether an anime has high or middling scores, but has** negative skew** and a relatively wide spread
  • Different types of anime also had very different distributions for episodes and duration
  • Next steps would be to apply linear regression models separately to different anime types and to investigate more sophisticated models

I. Background

MAL states that it is “the world’s largest anime and manga database and community”. Most if not all of the information on the website is user-generated, from the scores and reviews given to an anime to the synopsis and list of characters in said anime.

Image for post

Example of a listing on MAL

The user-generated score is a good proxy for the quality and popularity of a given anime. Hence I wondered if I could use the data that MAL had on the anime to predict the kind of score it would receive.

This would be valuable information to have for an anime production company deciding on its next production, or a TV network deciding what anime to comission or license for broadcast.

II. Gathering the data

I used BeautifulSoup to scrape the data for 16,706 anime entries between 23 and 24 April 2020.

Image for post

How I partitioned the MAL data

Out of these entries, 5,165 were left out of the final testing and training dataset because they had no score (aka target value) attached to them. This left me with 11,541 entries, or 2/3rds of the original dataset. I then split off 1/5th (or 2,279) of the remaining entries into a testing dataset, and kept the remaining 9,262 entries for training and validation.

Below is the list of 19 features that were ultimately considered.

Image for post

Image for post

Types of associated properties

Features that were scraped but not used include the producers, licensors and studios, as well as broadcast timing in Japan. Features that were not scraped include the characters, voice actors and other staff involved in the anime. All these features can be included in any further investigation of MAL data.

#data-science #metis #anime #bootcamp #data analysis

MyAnimeList user scores: Fun with web scraping and linear regression
5.55 GEEK