I’ve worked with SQL for several years before I started using Pandas in Python. I eventually got used to it but I struggled with the syntax differences between them when I started.

As you learn pandas, you eventually wonder how you can implement join in pandas like in SQL. I’m going to write down about it in this article.


import pandas as pd

df_team = [
  [1, "Arizona Diamondbacks"]
  , [2, "Colorado Rockies"]
  , [3, "Los Angeles Dodgers"]
  , [4, "San Diego Padres"]
  , [5, "San Francisco Giants"]
team_columns = ["Id", "Name"]
teams = pd.DataFrame(data=df_team, columns=team_columns)

Image for post


df_player = [
  [1, "Cody Bellinger", 3, "OF"]
  , [2, "Mookie Betts", 3, "OF"]
  , [3, "Nolan Arenado", 2, "3B"]
  , [4, "Trevor Story", 2, "SS"]
  , [5, "Fernando Tatis Jr.", 4, "SS"]
  , [6, "Buster Posey", 5, "C"]
  , [7, "Starling Marte", 1, "OF"]
  , [8, "Manny Machado", 4, "3B"]
player_columns = ["Id", "Name", "TeamId", "Position"]
players = pd.DataFrame(data=df_player, columns=player_columns)

Image for post


#To make example easier, only 3 players
df_homerun = [
  [2, 16]
  , [5, 15]
  , [8, 16]
homerun_columns = ["PlayerId", "Homerun"]
homeruns = pd.DataFrame(data=df_homerun, columns=homerun_columns)

Image for post


What is JOIN?

JOIN is SQL syntax which merges 2+ datasets (tables) into a single dataset. For example, as above, how do you define the players who belong to the Dodgers?

Firstly, you search in teams dataset and find Dodgers of Id is 3. And then, you search in players dataset with TeamId: 3.

This is not hard, but how about the home runs of the players on the Dodgers? You find all of the players in dodgers and search in the homeruns dataset with all of PlayerIds.

JOIN syntax makes this process skipped. You can display team name, player name and home run all at once.

JOIN has broadly two patterns: INNER JOIN and OUTER JOIN. Let’s see how to write each JOIN pattern in Pandas.

#python #pandas #data-science #sql

How to Implement JOIN in Pandas
1.10 GEEK