How I used basic NLP to look at how gender correlates to certain subjects on Netflix.

Image for post

Image for post

Image for post

A few visuals from this article. All images in this article have been visualized by the author.

A couple of years ago when I was first exposed to data science, I was amazed by a data article called “She Giggles, He Gallops” which analyzed gender in screen direction in thousands of screenplays. It specifically identified all the verbs following “he” and “she” to investigate gender tropes. Now that I’ve learned more about data science, I thought I’d try to apply a similar analysis to a different dataset.

Nowadays, streaming services have completely taken over the movie and TV industry. They’re the primary medium for viewers, and these platforms shape our culture through the movies they choose to feature. Netflix, being one of the larger streaming services, has some of the most influence in this domain, so I wanted to take a look at gender representation in its selection.

In this article, I’m going to explain the necessary intuition behind how I went about my analysis and showcase the skeleton of the code that corresponds to it. If you want to access the full code or the datasets used, you can find them on my Github repository for this project.

The Data

The third-party Netflix search engine, Flixable, released a dataset that lists all the movies and TV shows on Netflix as of 2019. It provides a couple of different attributes with the title, which include its cast and description. With these two attributes, we can look at trends in the gender(s) of the cast, and the content of the movie or TV show. For my analysis, I chose to look at the leading actor/actress’s gender.

The Flixable dataset however does not include the genders for the cast. I’m not going to label all 6000+ entries by hand, so I got some help from IMDb, which has several public datasets that are updated daily. One of these datasets details the names of actors, actresses, writers, etc. and their professions. Given the cast of a title, I found that I can match up the cast member names to the names in the IMDb dataset, see if they are labeled as an ‘actor’ or ‘actress’ and automate labeling gender that way.

#nlp #programming #netflix #data-science #gender #data analytic

She Treats, He Recruits: Analyzing Gender in Netflix’s Catalog
1.35 GEEK