Rusty  Shanahan

Rusty Shanahan


MyAnimeList user scores: Fun with web scraping and linear regression

The 3rd week of the Metis Data Science Bootcamp is behind us now, and so is the 2nd project of the bootcamp. Below I detail the project I did, which was to scrape data from MyAnimeList (MAL), and then use linear regression models to predict user scores based on the features of the anime.


  • Scraped and cleaned 19 features from 11,541 anime entries
  • Transformed features through dummy variables; clipping outliers; applying log10 and sqrt; and multiplying/dividing features
  • Selected 11 features through statsmodel’s OLS; scikit-learn’s LASSO; metrics such as R² and MSE; and intuition and domain knowledge
  • **LASSO **edged out plain linear regression and Ridge in performance, but choice of linear regression model ultimately didn’t make a difference
  • The final model can predict whether an anime has high or middling scores, but has** negative skew** and a relatively wide spread
  • Different types of anime also had very different distributions for episodes and duration
  • Next steps would be to apply linear regression models separately to different anime types and to investigate more sophisticated models

I. Background

MAL states that it is “the world’s largest anime and manga database and community”. Most if not all of the information on the website is user-generated, from the scores and reviews given to an anime to the synopsis and list of characters in said anime.

Image for post

Example of a listing on MAL

The user-generated score is a good proxy for the quality and popularity of a given anime. Hence I wondered if I could use the data that MAL had on the anime to predict the kind of score it would receive.

This would be valuable information to have for an anime production company deciding on its next production, or a TV network deciding what anime to comission or license for broadcast.

II. Gathering the data

I used BeautifulSoup to scrape the data for 16,706 anime entries between 23 and 24 April 2020.

Image for post

How I partitioned the MAL data

Out of these entries, 5,165 were left out of the final testing and training dataset because they had no score (aka target value) attached to them. This left me with 11,541 entries, or 2/3rds of the original dataset. I then split off 1/5th (or 2,279) of the remaining entries into a testing dataset, and kept the remaining 9,262 entries for training and validation.

Below is the list of 19 features that were ultimately considered.

Image for post

Image for post

Types of associated properties

Features that were scraped but not used include the producers, licensors and studios, as well as broadcast timing in Japan. Features that were not scraped include the characters, voice actors and other staff involved in the anime. All these features can be included in any further investigation of MAL data.

#data-science #metis #anime #bootcamp #data analysis

What is GEEK

Buddha Community

MyAnimeList user scores: Fun with web scraping and linear regression
Autumn  Blick

Autumn Blick


What's the Link Between Web Automation and Web Proxies?

Web automation and web scraping are quite popular among people out there. That’s mainly because people tend to use web scraping and other similar automation technologies to grab information they want from the internet. The internet can be considered as one of the biggest sources of information. If we can use that wisely, we will be able to scrape lots of important facts. However, it is important for us to use appropriate methodologies to get the most out of web scraping. That’s where proxies come into play.

How Can Proxies Help You With Web Scraping?

When you are scraping the internet, you will have to go through lots of information available out there. Going through all the information is never an easy thing to do. You will have to deal with numerous struggles while you are going through the information available. Even if you can use tools to automate the task and overcome struggles, you will still have to invest a lot of time in it.

When you are using proxies, you will be able to crawl through multiple websites faster. This is a reliable method to go ahead with web crawling as well and there is no need to worry too much about the results that you are getting out of it.

Another great thing about proxies is that they will provide you with the chance to mimic that you are from different geographical locations around the world. While keeping that in mind, you will be able to proceed with using the proxy, where you can submit requests that are from different geographical regions. If you are keen to find geographically related information from the internet, you should be using this method. For example, numerous retailers and business owners tend to use this method in order to get a better understanding of local competition and the local customer base that they have.

If you want to try out the benefits that come along with web automation, you can use a free web proxy. You will be able to start experiencing all the amazing benefits that come along with it. Along with that, you will even receive the motivation to take your automation campaigns to the next level.

#automation #web #proxy #web-automation #web-scraping #using-proxies #website-scraping #website-scraping-tools

A Deep Dive into Linear Regression

Let’s begin our journey with the truth — machines never learn. What a typical machine learning algorithm does is find a mathematical equation that, when applied to a given set of training data, produces a prediction that is very close to the actual output.

Why is this not learning? Because if you change the training data or environment even slightly, the algorithm will go haywire! Not how learning works in humans. If you learned to play a video game by looking straight at the screen, you would still be a good player if the screen is slightly tilted by someone, which would not be the case in ML algorithms.

However, most of the algorithms are so complex and intimidating that it gives our mere human intelligence the feel of actual learning, effectively hiding the underlying math within. There goes a dictum that if you can implement the algorithm, you know the algorithm. This saying is lost in the dense jungle of libraries and inbuilt modules which programming languages provide, reducing us to regular programmers calling an API and strengthening further this notion of a black box. Our quest will be to unravel the mysteries of this so-called ‘black box’ which magically produces accurate predictions, detects objects, diagnoses diseases and claims to surpass human intelligence one day.

We will start with one of the not-so-complex and easy to visualize algorithm in the ML paradigm — Linear Regression. The article is divided into the following sections:

  1. Need for Linear Regression

  2. Visualizing Linear Regression

  3. Deriving the formula for weight matrix W

  4. Using the formula and performing linear regression on a real world data set

Note: Knowledge on Linear Algebra, a little bit of Calculus and Matrices are a prerequisite to understanding this article

Also, a basic understanding of python, NumPy, and Matplotlib are a must.

1) Need for Linear regression

Regression means predicting a real valued number from a given set of input variables. Eg. Predicting temperature based on month of the year, humidity, altitude above sea level, etc. Linear Regression would therefore mean predicting a real valued number that follows a linear trend. Linear regression is the first line of attack to discover correlations in our data.

Now, the first thing that comes to our mind when we hear the word linear is, a line.

Yes! In linear regression, we try to fit a line that best generalizes all the data points in the data set. By generalizing, we mean we try to fit a line that passes very close to all the data points.

But how do we ensure that this happens? To understand this, let’s visualize a 1-D Linear Regression. This is also called as Simple Linear Regression

#calculus #machine-learning #linear-regression-math #linear-regression #linear-regression-python #python

Sival Alethea

Sival Alethea


Beautiful Soup Tutorial - Web Scraping in Python

The Beautiful Soup module is used for web scraping in Python. Learn how to use the Beautiful Soup and Requests modules in this tutorial. After watching, you will be able to start scraping the web on your own.
📺 The video in this post was made by
The origin of the article:
🔥 If you’re a beginner. I believe the article below will be useful to you ☞ What You Should Know Before Investing in Cryptocurrency - For Beginner
⭐ ⭐ ⭐The project is of interest to the community. Join to Get free ‘GEEK coin’ (GEEKCASH coin)!
☞ **-----CLICK HERE-----**⭐ ⭐ ⭐
Thanks for visiting and watching! Please don’t forget to leave a like, comment and share!

#web scraping #python #beautiful soup #beautiful soup tutorial #web scraping in python #beautiful soup tutorial - web scraping in python

5 Regression algorithms: Explanation & Implementation in Python

Take your current understanding and skills on machine learning algorithms to the next level with this article. What is regression analysis in simple words? How is it applied in practice for real-world problems? And what is the possible snippet of codes in Python you can use for implementation regression algorithms for various objectives? Let’s forget about boring learning stuff and talk about science and the way it works.

#linear-regression-python #linear-regression #multivariate-regression #regression #python-programming

Angela  Dickens

Angela Dickens


Regression: Linear Regression

Machine learning algorithms are not your regular algorithms that we may be used to because they are often described by a combination of some complex statistics and mathematics. Since it is very important to understand the background of any algorithm you want to implement, this could pose a challenge to people with a non-mathematical background as the maths can sap your motivation by slowing you down.

Image for post

In this article, we would be discussing linear and logistic regression and some regression techniques assuming we all have heard or even learnt about the Linear model in Mathematics class at high school. Hopefully, at the end of the article, the concept would be clearer.

**Regression Analysis **is a statistical process for estimating the relationships between the dependent variables (say Y) and one or more independent variables or predictors (X). It explains the changes in the dependent variables with respect to changes in select predictors. Some major uses for regression analysis are in determining the strength of predictors, forecasting an effect, and trend forecasting. It finds the significant relationship between variables and the impact of predictors on dependent variables. In regression, we fit a curve/line (regression/best fit line) to the data points, such that the differences between the distances of data points from the curve/line are minimized.

#regression #machine-learning #beginner #logistic-regression #linear-regression #deep learning