Day 4 and 5 of 100 Days of Data Science.Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle. I’ll start by importing some useful libraries that we need in this task.
Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.
I’ll start by importing some useful libraries that we need in this task.
import pandas as pd
## for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('dark_background')
Once you downloaded data from the Kaggle, you will have 3 files. As this is a prediction competition, you have train, test, and sample_submission file. For this project, my motive is only to perform data analysis and visuals. I am going to ignore test.csv and sample_submission.csv files.
%time train = pd.read_csv('./data/tmdb-box-office-prediction/train.csv')
## output
CPU times: user 258 ms, sys: 132 ms, total: 389 ms
Wall time: 403 ms
id: Integer unique id of each movie
belongs_to_collection: Contains the TMDB Id, Name, Movie Poster, and Backdrop URL of a movie in JSON format.
budget: Budget of a movie in dollars. Some row contains 0 values, which mean unknown.
genres: Contains all the Genres Name & TMDB Id in JSON Format.
homepage: Contains the official URL of a movie.
imdb_id: IMDB id of a movie (string).
original_language: Two-digit code of the original language, in which the movie was made.
original_title: The original title of a movie in original_language.
overview: Brief description of the movie.
popularity: Popularity of the movie.
poster_path: Poster path of a movie. You can see full poster image by adding URL after this link → https://image.tmdb.org/t/p/original/
production_companies: All production company name and TMDB id in JSON format of a movie.
production_countries: Two-digit code and the full name of the production company in JSON format.
release_date: The release date of a movie in mm/dd/yy format.
runtime: Total runtime of a movie in minutes (Integer).
spoken_languages: Two-digit code and the full name of the spoken language.
status: Is the movie released or rumored?
tagline: Tagline of a movie
title: English title of a movie
Keywords: TMDB Id and name of all the keywords in JSON format.
cast: All cast TMDB id, name, character name, gender (1 = Female, 2 = Male) in JSON format
crew: Name, TMDB id, profile path of various kind of crew members job like Director, Writer, Art, Sound, etc.
revenue: Total revenue earned by a movie in dollars.
box-office data-analysis data-visualization data-science python
🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...
🔥To access the slide deck used in this session for Free, click here: https://bit.ly/GetPDF_DataV_P 🔥 Great Learning brings you this live session on 'Data Vis...
🔥Intellipaat Python for Data Science Course: https://intellipaat.com/python-for-data-science-training/In this python for data science video you will learn e...
Master Applied Data Science with Python and get noticed by the top Hiring Companies with IgmGuru's Data Science with Python Certification Program. Enroll Now
Many a time, I have seen beginners in data science skip exploratory data analysis (EDA) and jump straight into building a hypothesis function or model. In my opinion, this should not be the case.