Image for post

Photo by Bryson Hammer on Unsplash

Data in the real world exists as a collection of distinct information. More often than not this information is organised as tables. Each table collecting data on a different aspect of the ecosystem. For example, a school might organise its data into the following tables

  1. Teacher information
  2. Student attendance
  3. Student marks
  4. Salary information, etc…

We can clearly see that tables 2 and 3 will have something in common (student IDs for one), just like tables 1 and 4 (teacher IDs for instance).

Such overlap of information across tables is quite common and becomes the basis for a join. **A join **lets you combine information from different tables. In R we have 6 types of joins. Let’s explore each of them with an example. But first let’s look at the data we are going to be working with.

Dummy Data

We will visit the world of Hogwarts and pick up some data.

library(dplyr)
citizen_df <- tribble(
  ~Person, ~Citizenship,
  "Harry",   "UK",
  "Harry",   "USA",
  "Ron",   "India",
  "Ron",   "Pakistan",
  "Hermoine",   "UK",
  "Hermoine",   "USA",
  "Hermoine",   "Russia",
  "Dumbeldore", "Poland",
  "Dumbeldore", "Singapore"
)
age_df <- tribble(
  ~Wizard, ~Age,
  "Harry",   21,
  "Ron",   25,
  "Hermoine",   31,
  "Hagrid", 45
)

Age Data

Let’s get some wizards’ and witches’ age data.

Image for post

Image for post

Citizenship Data

Let us also figure out their citizenship status.

Image for post

Image for post

Joins

We will explore the following joins

  1. Inner Join
  2. Left Join
  3. Right Join
  4. Full Join
  5. Anti Join
  6. Semi Join

When we join a table X with a table Y there needs to be a column that is common to both. This is called the **key **by which the two tables are joined. Based on this **key **column we join the two tables. For example in the wizard data above both tables contain a column that contains the names of the wizards. In the age table it is called wizard column and in the citizenship table it is called the person column. Let’s see how to join.

Inner Joins

Inner join is the most commonly used join. It keeps those rows which contain data that is common to both the tables. e.g. if we perform an inner join on the above two tables we will loose the rows containing **Hagrid **and Dumbeldore.

Diagrammatically, it looks like this. Note that the final table contains 3 columns instead of two but preserves only the common rows.

Image for post

Image for post

Inner Join

After the join we can easily identify both the age and citizenship of our wizards and witches from the same table.

Let’s code this up. Note the use of **by(), **which is doneto explicitly specify the column that is common to both the tables. This is essential if the common column has different names in the two tables.

inner_join_df <- inner_join(age_df, citizen_df, 
                             by = c("Wizard" =  "Person"))

Image for post

Image for post

Outer Joins

There are 3 types of outer joins —** left, right and full join**. An inner join keeps information that is common to both the tables, but an outer join retains information that appears in at least one of the two tables. In that sense, they are akin to intersection and union respectively.

Left Join

The left join keeps all the rows of the left table. If it is unable to find a corresponding row in the right table then it fills it as NA.

Image for post

Image for post

left_join_df <- left_join(age_df, citizen_df, 
                           by = c("Wizard" = "Person"))

Image for post

Image for post

Notice how Hagrid is left without any citizenship information as the **right table **has no row corresponding to Hagrid.

Right Join

Similarly, the right join keeps all the rows of the right table. If it is unable to find a corresponding row in the left table then it fills it as NA.

Image for post

Image for post

#data-science #web-development #programming #technology #coding

Beginner’s Guide to Using Joins in R
1.15 GEEK