5 Ways to Merge Two Dataframes in Pandas

Discover five effective methods to merge two dataframes using Pandas, including merge(), join(), concat(), append(), combine_first(), and update(), a powerful tool for data manipulation and analysis in Python.

Introduction to Merge Two Dataframes Pandas

A pandas dataframe is a two-dimensional data structure represented with the help of rows and columns. Data can be stored in a dataframe in various ways, be it by importing from an Excel file, a CSV file, or a database. While importing data from such sources, we can have multiple dataframes, but wouldn't it be tedious to perform similar operations on different dataframes? Now, what do we do in such a case? Merge all of them into one, and instead of performing the same operation multiple times, we could just do it at once and get over with our task. Hence the concept of merging dataframes in Pandas is important. Let's look at all those powerful functions that would make your task easier.

Method 1: Using merge()

The merge() is very similar to SQL join function. There are various parameters to be considered while using this function. While using the merge() function we write it as:-

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) 

This function is used when we have to join columns or indexes in every possible combination, be it column to column or index to column, etc. Let us discuss these parameters in more detail.

  • right:- This is the object we will merge with.
  • how:- This contains five different parameters to choose from the right, left, inner, and outer cross. It has the default value “inner.” These parameters denote the type of merge to be performed, for example, left join, right join, inner join, etc. It is similar to the joins we have studied in SQL.
  • on:- This specifies the column or index on which the merge is supposed to happen. If the value for on is None, the dataframe will be merged based on columns in both available dataframes.
  • left_on:- When this parameter is selected columns or indexes are merged in the left dataframe.
  • right_on:-When this parameter is selected columns or indexes are merged in the right dataframe.
  • left_index:- It takes in bool values i.e True/False. We use the index from the left dataframe as the merge keys.
  • right_index:- It is similar to left_index. The only difference is that we use the index from the right dataframe as the merge key.
  • sort:- It also takes in bool values and the main function is to sort the merge keys lexicographically in the output dataframe. If the parameter value is False the order of the merge keys depends on what type of merge is being performed.
  • Suffixes:- It is the sequence of length two. The values are of string datatype and indicate the suffix to be added to the overlapping column names on the left and right respectively after the dataframes are merged. Its default value is (“_x”, “_y”).
  • copy:- It takes in bool values. The default value is True. Then, it creates a copy of the resulting dataframe.
  • indicator:- It takes either bool values or string values. If the parameter value is True, it adds a column to the resultant dataframe named “_merge” that contains information on the source of each row. We can change the name of the column by providing a different name as a string value to the parameter.
  • validate:- If the value is None, it checks what type of merge operation has been performed. For example, one-to-one, one-to-many, many-to-one, or many-to-many. In the one-to-one merge, the merge keys are unique in both datasets, and so on.

The return type for this function is a dataframe created after two objects/ dataframes are merged.

Code Example 1:


import pandas as pd
import numpy as np

df1 = pd.DataFrame({'employee': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
                    'group': ['Chemical', 'Engineering', 'Chemical', 'HR']})
                    
df2 = pd.DataFrame({'employee': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
                    'hire_date': [2004, 2008, 2012, 2014]})

df1.head()

Output:-

	employee	group
0	Gauri	Chemical
1	Keshav	Engineering
2	Harshit	Chemical
3	Komal	HR

df2.head()

Output:-
	employee hire_date
0	Gauri	    2004
1	Keshav	    2008
2	Harshit	    2012
3	Komal	    2014
              
df3 = pd.merge(df1, df2)
print(df3) 

Output:-


     employee        group	     hire_data
0	Gauri	    Chemical	        2004
1	Keshav	   Engineering         2008
2	Harshit	    Chemical	        2012
3	Komal	        HR	        2014 

Code Example 2:

import pandas as pd
import numpy as np

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Smriti', 'Rahul', 'Atul']})
df5 = pd.merge(df3,df4)
print(df5)

Output:-


      employee	        group	     hire_date	    supervisor
0	Keshav	    Engineering       2008	        Rahul
1	Komal	        HR	       2014	        Atul

Code Example 3:

import pandas as pd
import numpy as np

df6 = pd.DataFrame({'group': ['Chemical', 'Chemical',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
df7= pd.merge(df1, df6)
print(df7)

Output:-

	employee group	       skills
0	Gauri	Chemical	math
1	Gauri	Chemical	spreadsheets
2	Harshit	Chemical	math
3	Harshit	Chemical	spreadsheets
4	Keshav	Engineering	coding
5	Keshav	Engineering	linux
6	Komal	HR	    spreadsheets
7	Komal	HR	    organization

Code Example 4:

import pandas as pd
import numpy as np

df8=pd.merge(df1, df2, on='employee')
print(df8)

Output:-

	employee	group	    hire_date
0	Gauri	    Chemical	    2004
1	Keshav	    Engineering    2008
2	Harshit	    Chemical	    2012
3	Komal	    HR	            2014

Code Example 5:

import pandas as pd
import numpy as np

df3 = pd.DataFrame({'name': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
                    'salary': [70000, 80000, 120000, 90000]})
df9=pd.merge(df1, df3, left_on="employee", right_on="name")
print(df9)

Output:-


    employee	 group	        name	    salary
0	Gauri	 Chemical	Gauri	    70000
1	Keshav	 Engineering	Keshav	    80000
2	Harshit	 Chemical	Harshit	    120000
3	Komal	 HR	        Komal       90000

Code Example 6:

import pandas as pd
import numpy as np

df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
print(df1a)
print(df2a)

Output:-

              group
employee           
Gauri        Chemical
Keshav    Engineering
Harshit      Chemical
Komal              HR
          
              hire_date
employee           
Gauri          2004
Keshav         2008
Harshit        2012
Komal          2014

Code Example 7:

import pandas as pd
import numpy as np

df3a = pd.merge(df1a, df2a, left_index=True, right_index=True)
print(df3a)

Output:-

	        group	hire_date
employee		
Gauri	    Chemical	    2004
Keshav	    Engineering    2008
Harshit	    Chemical	    2012
Komal	    HR	            2014

Code Example 8:

import pandas as pd
import numpy as np

df10 = pd.merge(df1,df2,indicator=True)
print(df10)

Output:-


    employee	group	    hire_date	_merge
0	Gauri	Chemical	2004	both
1	Keshav	Engineering	2008	both
2	Harshit	Chemical	2012	both
3	Komal	HR	        2014	both

Code Example 9:

import pandas as pd
import numpy as np

df10 = pd.merge(df1,df2,how='inner')
print(df10)

Output:-

	employee group	hire_date
0	Gauri	Chemical	2004
1	Keshav	Engineering	2008
2	Harshit	Chemical	2012
3	Komal	HR	        2014

Code Example 10:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'A': ['s', 't', 'u', 's'],
                    'V': [10, 20, 30, 50]})
df2 = pd.DataFrame({'B': ['s', 't', 'u', 's'],
                    'V': [50, 60, 70, 80]})

df10=pd.merge(df1,df2, left_on='A', right_on='B',suffixes=('_l', '_r'))
print(df10)

Output:-

   A   V_l B   V_r
0  s   10  s   50
1  s   10  s   80
2  s   50  s   50
3  s   50  s   80
4  t   20  t   60
5  u   30  u   70

Code Example 11:

import pandas as pd
import numpy as np

df10 = pd.merge(df1,df2,sort=True)
print(df10)

Output:-


      employee	group	    hire_date
0	Gauri	Chemical	2004
1	Harshit	Chemical	2012
2	Keshav	Engineering	2008
3	Komal	HR	        2014


Method 2: Using join()

This method is used to join columns with other dataframe either on a column or index. While using the join() function we write it as:-

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False).

The merge() and join() functions perform similar operations, but the parameters taken into consideration are very few as compared to the merge() function. Let us look into each parameter in detail, similar to how we did for the merge() function.

  • other:- The index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.
  • on:- It takes in a string or a list of string values containing the column or index name on which the join is supposed to happen with respect to the other dataframe. If no value is provided, index-to-index join takes place.
  • how:- This contains four different parameters to choose from right, left, inner, and outer. It has the default value “left”. For example, while using the left value we use the index of the calling dataframe whereas while using the inner, we form the intersection of indexes of both dataframes without altering the order of the calling dataframe.
  • lsuffix:- It takes in string value by default. It provides the suffix value to use with the left dataframe’s overlapping column.
  • rsuffix:- It provides the suffix value for using the right dataframe’s overlapping column.
  • sort:- It takes in a bool value, the default value being False. It helps sort the resultant dataframe in a lexicographical manner using the join key. If the parameter value is False, the order of the join keys depends on what type of join is being performed.

This function's return type is a dataframe with columns from both data frames used.

Code Example 1:

The df1 and df2 are the same as mentioned in the previous examples.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'employee': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
                    'group': ['Chemical', 'Engineering', 'Chemical', 'HR']})
                    
df2 = pd.DataFrame({'employee': ['Gauri', 'Keshav', 'Harshit', 'Komal'],
                    'hire_date': [2004, 2008, 2012, 2014]})
print(df1.join(df2, lsuffix='_1', rsuffix='_2'))

Output:-

 employee_1        group     employee_2  hire_date
0      Gauri     Chemical      Gauri       2004
1     Keshav  Engineering     Keshav       2008
2    Harshit     Chemical    Harshit       2012
3      Komal           HR      Komal       2014

Code Example 2:

import pandas as pd
import numpy as np

print(df1.join(df2.set_index('employee'), on='employee'))

Output:-

	employee     group	    hire_date
0	Gauri	    Chemical	    2004
1	Keshav	    Engineering    2008
2	Harshit	    Chemical	    2012
3	Komal	      HR	    2014

Code Example 3:

import pandas as pd
import numpy as np

print(df1.join(df2.set_index('employee'), on='employee', how='inner'))

Output:-

	employee    group	   hire_date
0	Gauri	    Chemical	    2004
1	Keshav	    Engineer       2008
2	Harshit	    Chemical	    2012
3	Komal	    HR	            2014

Code Example 4:

import pandas as pd
import numpy as np

print(df1.join(df2.set_index('employee'), on='employee', how='inner', sort=True))

Output:-

  employee        group  hire_date
0    Gauri     Chemical       2004
2  Harshit     Chemical       2012
1   Keshav  Engineering       2008
3    Komal           HR       2014

Method 3: Using append()

This function is used to append rows of other data frames to the given dataframe. In contrast, only those columns not already present in the calling dataframe are added in appending columns. It isn’t used after version 1.4.0 instead, we use the concat() function in the advanced versions of pandas.

Code Example 1:

import pandas as pd
import numpy as np

print(df1.append(df2))

Output:-

      employee   group     hire_date
0    Gauri     Chemical        NaN
1   Keshav  Engineering        NaN
2  Harshit     Chemical        NaN
3    Komal           HR        NaN
0    Gauri          NaN     2004.0
1   Keshav          NaN     2008.0
2  Harshit          NaN     2012.0
3    Komal          NaN     2014.0

Code Example 2:

import pandas as pd
import numpy as np

print(df1.append(df2, ignore_index=True))

Output:-

  employee        group  hire_date
0    Gauri     Chemical        NaN
1   Keshav  Engineering        NaN
2  Harshit     Chemical        NaN
3    Komal           HR        NaN
4    Gauri          NaN     2004.0
5   Keshav          NaN     2008.0
6  Harshit          NaN     2012.0
7    Komal          NaN     2014.0

Code Example 3:

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=['x', 'y'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'), index=['x', 'y'])
print(df.append(df2))

Output:-

   A  B
x  1  2
y  3  4
x  5  6
y  7  8

Code Example 4:

import pandas as pd
import numpy as np

print(df.append(df2, ignore_index=True))

Output:-

   A  B
0  1  2
1  3  4
2  5  6
3  7  8

Code Example 5:

import pandas as pd
import numpy as np

print(df.append(df2, ignore_index=True, verify_integrity=True))

Output:-

  A  B
0  1  2
1  3  4
2  5  6
3  7  8

Method 4: Using concat()

This function is used to merge two dataframes across rows or columns. Among the various available parameters, we have an axis whose value can be 0 or 1. When the axis = 0 the data frames are concatenated on the basis of rows, whereas when the axis = 1, they are concatenated on the basis of columns. While using the concat() function we write it as:-

pandas.concat(objs, axis=0, join='outer',ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

Let us see how all these parameters contribute to the Concat function.

  • objs:- It takes in a sequence or mapping of Series or DataFrame objects as the parameter values.
  • axis:- The default value for this is 0, where axis = 0 means index based and axis = 1 means column based.
  • join:- This defines how the concatenation will occur, for example, outer or inner join. The default value is “outer”.
  • ignore_index:- It takes in bool values. If the value is True we tend to ignore the index values along the axis it is concatenated. The default value is False.
  • keys:- This helps us construct hierarchical indexing at the outermost level using the sequence of keys passed as the parameter values.
  • names:- It contains names of the levels in the hierarchical index obtained.
  • verify_integrity:- It takes in bool values and checks whether the new concatenated axis contains redundant values or not. The default value for this parameter is False.
  • sort:- It takes in bool values, the default value being False. During the outer join, it is responsible for sorting the non-concatenated axis if they are not aligned, whereas it has no role during the inner join as the axes are already sorted.
  • copy:- It creates the copy of the resultant object. If the value is set to False, no data is copied.

The return type for this function is a dataframe when the concatenation is done along the columns. Also, if the concatenation happens between a series and a dataframe, a dataframe is returned.

Code Example 1:

import pandas as pd
import numpy as np

s = pd.Series([1,2,3])
df = pd.DataFrame()
df = pd.concat([df, s], axis=1)
print(df)

Output:-

   0 0  1 1  2 2  3

Code Example 2:

import pandas as pd
import numpy as np

d1 = pd.Series(['1', '2'])
d2 = pd.Series(['4', '3'])
print(pd.concat([d1, d2]))

Output:-
0    1
1    2
0    4
1    3
dtype: object

Code Example 3:

import pandas as pd
import numpy as np

print(pd.concat([d1, d2], sort=True))

Output:-
0    1
1    2
0    4
1    3
dtype: object

Code Example 4:

import pandas as pd
import numpy as np

print(pd.concat([d1, d2], ignore_index=True))

Output:-
0    1
1    2
2    4
3    3
dtype: object

Code Example 5:

import pandas as pd
import numpy as np

pd.concat([d1, d2], keys=['d1', 'd2'])

Output:-
d1  0    1
    1    2
d2  0    4
    1    3
dtype: object

Code Example 6:

import pandas as pd
import numpy as np

pd.concat([d1, d2], keys=['d1', 'd2'],
          names=['Series', 'Row'])

Output:-
Series  Row
d1      0      1
        1      2
d2      0      c
        1      d
dtype: object

Code Example 7:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'X': [2, 0], 'Y': [4, 4]})
df2 = pd.DataFrame({'X': [1, 1], 'Y': [3, 3]})
print(pd.concat([df1,df2]))

Output:-

        X      Y
0	2	4
1	0	4
0	1	3
1	1	3

Method 5: Using Combine_first() and Update():-

This function can help update a dataframe with null values and non-null values taken from other dataframe. It is nothing but the union of two data frames provided. The return type for this function is a dataframe.

The update() function is used when data modification in a dataframe is required from another dataframe.

DataFrame.update(other, join='left', overwrite=True, filter_func=None, errors='ignore')

We will now see the parameters that are used with the update function.

  • other:- The update() functions only when at least one common row or column label is present concerning the original DataFrame.
  • join:- Only the left join can be implemented in this function.
  • overwrite:- It takes boolean values. When the value is set to True, it will overwrite not null values with the help of the other dataframe, whereas when the value is set to False, only NA values are updated. The default value for this is True.
  • filter_func:- This function can choose if it wants to replace any value other than NA. All those values that are to be updated will return True for this function.
  • errors:- It can have two values raised or ignored. The default is “ignore”. If the value is set to “raise” it will raise a “Value Error” if the NA value location overlaps in both dataframes.

The return type is None, as it directly updates the dataframe provided rather than creating and returning a new one.

Code Example 1:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'P': [None, 0], 'Y': [None, 4]})
df2 = pd.DataFrame({'P': [4, 1], 'Y': [5, 3]})
df1.combine_first(df2)

Output:-

         P	     Y
0	4.0	    5.0
1	0.0	    4.0

Code Example 2:

import pandas as pd
import numpy as np

df = pd.DataFrame({'X': [10, 20, 30],
                   'Y': [123, 456, 789]})
n_df = pd.DataFrame({'X': [41, 52, 63],
                       'Y': [47, 38, 98]})
df.update(n_df)
print(df)

Output:-


   X    Y
0  41  47
1  52  38
2  63  98

We can see that the update function modifies the first dataframe df with the values in the other dataframe n_df and returns the df dataframe instead of creating another one, so the return type is None for the update() function.

Conclusion

This article dealt with various powerful functions on how to merge two dataframes in pandas. We saw various functions like merge(), join(), concat(), append(), combine_first(), and update(). Now, how do we decide when to use which function()?

  • merge()- This function can be used in almost all cases, but if a certain condition is given, we might have to bring in some other function in use.
  • use the concat() function to merge two data frames row-wise.
  • Found a missing value in your dataframe? Use combine_first() or update() function.
  • Planning to add a new row in your dataframe while merging? We have the append() function.

In the end, it is all about experimenting with your data. Try hands-on with all these functions and see if you get an error or the desired result. The better you know your data, the better results you will achieve. Keep Experimenting, Keep Learning.

#pandas #python 

5 Ways to Merge Two Dataframes in Pandas
1.05 GEEK