I’ve worked with SQL for several years before I started using Pandas in Python. I eventually got used to it but I struggled with the syntax differences between them when I started.
As you learn pandas, you eventually wonder how you can implement join in pandas like in SQL. I’m going to write down about it in this article.
import pandas as pd
df_team = [
[1, "Arizona Diamondbacks"]
, [2, "Colorado Rockies"]
, [3, "Los Angeles Dodgers"]
, [4, "San Diego Padres"]
, [5, "San Francisco Giants"]
]
team_columns = ["Id", "Name"]
teams = pd.DataFrame(data=df_team, columns=team_columns)
teams
df_player = [
[1, "Cody Bellinger", 3, "OF"]
, [2, "Mookie Betts", 3, "OF"]
, [3, "Nolan Arenado", 2, "3B"]
, [4, "Trevor Story", 2, "SS"]
, [5, "Fernando Tatis Jr.", 4, "SS"]
, [6, "Buster Posey", 5, "C"]
, [7, "Starling Marte", 1, "OF"]
, [8, "Manny Machado", 4, "3B"]
]
player_columns = ["Id", "Name", "TeamId", "Position"]
players = pd.DataFrame(data=df_player, columns=player_columns)
players
#To make example easier, only 3 players
df_homerun = [
[2, 16]
, [5, 15]
, [8, 16]
]
homerun_columns = ["PlayerId", "Homerun"]
homeruns = pd.DataFrame(data=df_homerun, columns=homerun_columns)
homeruns
JOIN is SQL syntax which merges 2+ datasets (tables) into a single dataset. For example, as above, how do you define the players who belong to the Dodgers?
Firstly, you search in teams dataset and find Dodgers of Id is 3. And then, you search in players dataset with TeamId: 3.
This is not hard, but how about the home runs of the players on the Dodgers? You find all of the players in dodgers and search in the homeruns dataset with all of PlayerIds.
JOIN syntax makes this process skipped. You can display team name, player name and home run all at once.
JOIN has broadly two patterns: INNER JOIN and OUTER JOIN. Let’s see how to write each JOIN pattern in Pandas.
#python #pandas #data-science #sql