Discover five effective methods to merge two dataframes using Pandas, including merge()
, join()
, concat()
, append()
, combine_first()
, and update()
, a powerful tool for data manipulation and analysis in Python.
A pandas dataframe is a two-dimensional data structure represented with the help of rows and columns. Data can be stored in a dataframe in various ways, be it by importing from an Excel file, a CSV file, or a database. While importing data from such sources, we can have multiple dataframes, but wouldn't it be tedious to perform similar operations on different dataframes? Now, what do we do in such a case? Merge all of them into one, and instead of performing the same operation multiple times, we could just do it at once and get over with our task. Hence the concept of merging dataframes in Pandas is important. Let's look at all those powerful functions that would make your task easier.
The merge() is very similar to SQL join function. There are various parameters to be considered while using this function. While using the merge() function we write it as:-
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
This function is used when we have to join columns or indexes in every possible combination, be it column to column or index to column, etc. Let us discuss these parameters in more detail.
The return type for this function is a dataframe created after two objects/ dataframes are merged.
Code Example 1:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'employee': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
'group': ['Chemical', 'Engineering', 'Chemical', 'HR']})
df2 = pd.DataFrame({'employee': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
'hire_date': [2004, 2008, 2012, 2014]})
df1.head()
Output:-
employee group
0 Gauri Chemical
1 Keshav Engineering
2 Harshit Chemical
3 Komal HR
df2.head()
Output:-
employee hire_date
0 Gauri 2004
1 Keshav 2008
2 Harshit 2012
3 Komal 2014
df3 = pd.merge(df1, df2)
print(df3)
Output:-
employee group hire_data
0 Gauri Chemical 2004
1 Keshav Engineering 2008
2 Harshit Chemical 2012
3 Komal HR 2014
Code Example 2:
import pandas as pd
import numpy as np
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Smriti', 'Rahul', 'Atul']})
df5 = pd.merge(df3,df4)
print(df5)
Output:-
employee group hire_date supervisor
0 Keshav Engineering 2008 Rahul
1 Komal HR 2014 Atul
Code Example 3:
import pandas as pd
import numpy as np
df6 = pd.DataFrame({'group': ['Chemical', 'Chemical',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
df7= pd.merge(df1, df6)
print(df7)
Output:-
employee group skills
0 Gauri Chemical math
1 Gauri Chemical spreadsheets
2 Harshit Chemical math
3 Harshit Chemical spreadsheets
4 Keshav Engineering coding
5 Keshav Engineering linux
6 Komal HR spreadsheets
7 Komal HR organization
Code Example 4:
import pandas as pd
import numpy as np
df8=pd.merge(df1, df2, on='employee')
print(df8)
Output:-
employee group hire_date
0 Gauri Chemical 2004
1 Keshav Engineering 2008
2 Harshit Chemical 2012
3 Komal HR 2014
Code Example 5:
import pandas as pd
import numpy as np
df3 = pd.DataFrame({'name': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
'salary': [70000, 80000, 120000, 90000]})
df9=pd.merge(df1, df3, left_on="employee", right_on="name")
print(df9)
Output:-
employee group name salary
0 Gauri Chemical Gauri 70000
1 Keshav Engineering Keshav 80000
2 Harshit Chemical Harshit 120000
3 Komal HR Komal 90000
Code Example 6:
import pandas as pd
import numpy as np
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
print(df1a)
print(df2a)
Output:-
group
employee
Gauri Chemical
Keshav Engineering
Harshit Chemical
Komal HR
hire_date
employee
Gauri 2004
Keshav 2008
Harshit 2012
Komal 2014
Code Example 7:
import pandas as pd
import numpy as np
df3a = pd.merge(df1a, df2a, left_index=True, right_index=True)
print(df3a)
Output:-
group hire_date
employee
Gauri Chemical 2004
Keshav Engineering 2008
Harshit Chemical 2012
Komal HR 2014
Code Example 8:
import pandas as pd
import numpy as np
df10 = pd.merge(df1,df2,indicator=True)
print(df10)
Output:-
employee group hire_date _merge
0 Gauri Chemical 2004 both
1 Keshav Engineering 2008 both
2 Harshit Chemical 2012 both
3 Komal HR 2014 both
Code Example 9:
import pandas as pd
import numpy as np
df10 = pd.merge(df1,df2,how='inner')
print(df10)
Output:-
employee group hire_date
0 Gauri Chemical 2004
1 Keshav Engineering 2008
2 Harshit Chemical 2012
3 Komal HR 2014
Code Example 10:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': ['s', 't', 'u', 's'],
'V': [10, 20, 30, 50]})
df2 = pd.DataFrame({'B': ['s', 't', 'u', 's'],
'V': [50, 60, 70, 80]})
df10=pd.merge(df1,df2, left_on='A', right_on='B',suffixes=('_l', '_r'))
print(df10)
Output:-
A V_l B V_r
0 s 10 s 50
1 s 10 s 80
2 s 50 s 50
3 s 50 s 80
4 t 20 t 60
5 u 30 u 70
Code Example 11:
import pandas as pd
import numpy as np
df10 = pd.merge(df1,df2,sort=True)
print(df10)
Output:-
employee group hire_date
0 Gauri Chemical 2004
1 Harshit Chemical 2012
2 Keshav Engineering 2008
3 Komal HR 2014
This method is used to join columns with other dataframe either on a column or index. While using the join() function we write it as:-
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False).
The merge() and join() functions perform similar operations, but the parameters taken into consideration are very few as compared to the merge() function. Let us look into each parameter in detail, similar to how we did for the merge() function.
This function's return type is a dataframe with columns from both data frames used.
Code Example 1:
The df1 and df2 are the same as mentioned in the previous examples.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'employee': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
'group': ['Chemical', 'Engineering', 'Chemical', 'HR']})
df2 = pd.DataFrame({'employee': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
'hire_date': [2004, 2008, 2012, 2014]})
print(df1.join(df2, lsuffix='_1', rsuffix='_2'))
Output:-
employee_1 group employee_2 hire_date
0 Gauri Chemical Gauri 2004
1 Keshav Engineering Keshav 2008
2 Harshit Chemical Harshit 2012
3 Komal HR Komal 2014
Code Example 2:
import pandas as pd
import numpy as np
print(df1.join(df2.set_index('employee'), on='employee'))
Output:-
employee group hire_date
0 Gauri Chemical 2004
1 Keshav Engineering 2008
2 Harshit Chemical 2012
3 Komal HR 2014
Code Example 3:
import pandas as pd
import numpy as np
print(df1.join(df2.set_index('employee'), on='employee', how='inner'))
Output:-
employee group hire_date
0 Gauri Chemical 2004
1 Keshav Engineer 2008
2 Harshit Chemical 2012
3 Komal HR 2014
Code Example 4:
import pandas as pd
import numpy as np
print(df1.join(df2.set_index('employee'), on='employee', how='inner', sort=True))
Output:-
employee group hire_date
0 Gauri Chemical 2004
2 Harshit Chemical 2012
1 Keshav Engineering 2008
3 Komal HR 2014
This function is used to append rows of other data frames to the given dataframe. In contrast, only those columns not already present in the calling dataframe are added in appending columns. It isn’t used after version 1.4.0 instead, we use the concat() function in the advanced versions of pandas.
Code Example 1:
import pandas as pd
import numpy as np
print(df1.append(df2))
Output:-
employee group hire_date
0 Gauri Chemical NaN
1 Keshav Engineering NaN
2 Harshit Chemical NaN
3 Komal HR NaN
0 Gauri NaN 2004.0
1 Keshav NaN 2008.0
2 Harshit NaN 2012.0
3 Komal NaN 2014.0
Code Example 2:
import pandas as pd
import numpy as np
print(df1.append(df2, ignore_index=True))
Output:-
employee group hire_date
0 Gauri Chemical NaN
1 Keshav Engineering NaN
2 Harshit Chemical NaN
3 Komal HR NaN
4 Gauri NaN 2004.0
5 Keshav NaN 2008.0
6 Harshit NaN 2012.0
7 Komal NaN 2014.0
Code Example 3:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=['x', 'y'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'), index=['x', 'y'])
print(df.append(df2))
Output:-
A B
x 1 2
y 3 4
x 5 6
y 7 8
Code Example 4:
import pandas as pd
import numpy as np
print(df.append(df2, ignore_index=True))
Output:-
A B
0 1 2
1 3 4
2 5 6
3 7 8
Code Example 5:
import pandas as pd
import numpy as np
print(df.append(df2, ignore_index=True, verify_integrity=True))
Output:-
A B
0 1 2
1 3 4
2 5 6
3 7 8
This function is used to merge two dataframes across rows or columns. Among the various available parameters, we have an axis whose value can be 0 or 1. When the axis = 0 the data frames are concatenated on the basis of rows, whereas when the axis = 1, they are concatenated on the basis of columns. While using the concat() function we write it as:-
pandas.concat(objs, axis=0, join='outer',ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
Let us see how all these parameters contribute to the Concat function.
The return type for this function is a dataframe when the concatenation is done along the columns. Also, if the concatenation happens between a series and a dataframe, a dataframe is returned.
Code Example 1:
import pandas as pd
import numpy as np
s = pd.Series([1,2,3])
df = pd.DataFrame()
df = pd.concat([df, s], axis=1)
print(df)
Output:-
0 0 1 1 2 2 3
Code Example 2:
import pandas as pd
import numpy as np
d1 = pd.Series(['1', '2'])
d2 = pd.Series(['4', '3'])
print(pd.concat([d1, d2]))
Output:-
0 1
1 2
0 4
1 3
dtype: object
Code Example 3:
import pandas as pd
import numpy as np
print(pd.concat([d1, d2], sort=True))
Output:-
0 1
1 2
0 4
1 3
dtype: object
Code Example 4:
import pandas as pd
import numpy as np
print(pd.concat([d1, d2], ignore_index=True))
Output:-
0 1
1 2
2 4
3 3
dtype: object
Code Example 5:
import pandas as pd
import numpy as np
pd.concat([d1, d2], keys=['d1', 'd2'])
Output:-
d1 0 1
1 2
d2 0 4
1 3
dtype: object
Code Example 6:
import pandas as pd
import numpy as np
pd.concat([d1, d2], keys=['d1', 'd2'],
names=['Series', 'Row'])
Output:-
Series Row
d1 0 1
1 2
d2 0 c
1 d
dtype: object
Code Example 7:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'X': [2, 0], 'Y': [4, 4]})
df2 = pd.DataFrame({'X': [1, 1], 'Y': [3, 3]})
print(pd.concat([df1,df2]))
Output:-
X Y
0 2 4
1 0 4
0 1 3
1 1 3
This function can help update a dataframe with null values and non-null values taken from other dataframe. It is nothing but the union of two data frames provided. The return type for this function is a dataframe.
The update() function is used when data modification in a dataframe is required from another dataframe.
DataFrame.update(other, join='left', overwrite=True, filter_func=None, errors='ignore')
We will now see the parameters that are used with the update function.
The return type is None, as it directly updates the dataframe provided rather than creating and returning a new one.
Code Example 1:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'P': [None, 0], 'Y': [None, 4]})
df2 = pd.DataFrame({'P': [4, 1], 'Y': [5, 3]})
df1.combine_first(df2)
Output:-
P Y
0 4.0 5.0
1 0.0 4.0
Code Example 2:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X': [10, 20, 30],
'Y': [123, 456, 789]})
n_df = pd.DataFrame({'X': [41, 52, 63],
'Y': [47, 38, 98]})
df.update(n_df)
print(df)
Output:-
X Y
0 41 47
1 52 38
2 63 98
We can see that the update function modifies the first dataframe df with the values in the other dataframe n_df and returns the df dataframe instead of creating another one, so the return type is None for the update() function.
This article dealt with various powerful functions on how to merge two dataframes in pandas. We saw various functions like merge(), join(), concat(), append(), combine_first(), and update(). Now, how do we decide when to use which function()?
In the end, it is all about experimenting with your data. Try hands-on with all these functions and see if you get an error or the desired result. The better you know your data, the better results you will achieve. Keep Experimenting, Keep Learning.
#pandas #python