This study aims to recommend animes to people by using myanimelist user ratings. The recommendation method is Frequent Pattern Mining, the used tool is Apache Spark. For data preprocessing, Ptyhon-Pandas library is used via jupyter notebook. Animes are not as popular as tv series or movies. So, finding good recommendations is more difficult. I hope my findings can help someone :)

Data Selection

Firstly, rating.csv which includes myanimelist user scores is selected. The columns are:

· user_id — non identifiable randomly generated user id.

· anime_id — the anime that user has rated.

· rating — rating out of 10 the user has assigned (-1 if the user watched it but didn’t assign a rating).

Secondly, anime.csv is selected for getting anime type like TV, Movie, OVA from the same source with rating.csv. Finally, the related column is selected from AnimeList.csv. Since all of the data are fetched from myanimelist, they can be joined to each other by _anime_id _attribute.

Data Preprocessing

In this part, data is prepared for rule mining algorithm. Lower ratings and unimportant types are dropped, season data are merged.

A. Binning

First of all, it is not proper to consider lower ratings for making recommendations.

Image for post

Fig. 1. The histogram that shows rating distribution

For deciding, above histogram is drawn. “-1” value is used if the user didn’t prefer to give a rating to an anime. But it doesn’t mean user didn’t like it. Because people preferred to rate anime if they like too much. So, “-1” values are considered. By ignoring “-1”, the mean rating value is 7.80. Among these too high rating points, 0–5 points can be ignored. To summarize, 6–10 and -1 points are considered, 0–5 points are counted as dislike.

B. Type Filtering

Some kind of anime types consists of several episodes which include side stories about main animes. They must not be considered for rule mining. So, OVA, ONA, Music and Special animes must be dropped. Type data and rating data joined, except TV or Movie animes, all data is removed. It is also seen that most of the dropped animes didn’t rated. As a result, the unrated animes became more valuable than before.

C. Season data merging

Each different season of an anime has its own anime_id. As an example, there five seasons and anime_ids for Sailor Moon.

  • Sailor Moon -530
  • Sailor Moon R -740
  • Sailor Moon S -532
  • Sailor Moon Super S -1239
  • Sailor Moon Sailor Stars -996

#machine-learning #data-mining #data-science #anime #data analysis

Anime recommendations by using Collaborative Filtering
2.90 GEEK