All About Table Joining In PySpark

As a data scientist, we spend most of our time on feature engineering. To develop features that widely cover multiple dimensions of the targets, we usually need to extract information from multiple data sources. As a result, I find table joining to be a particularly useful technique to merge features of high dimensionality. Table joining is not difficult to understand, and I find that I am always following some similar thinking processes. Here, I will use a dummy example to illustrate my typical routine of writing a table joining and conducting a quick quality check on the results.

1. A Quick Look At All The Tables

In this example, we will predict whether a list of suspicious credit card users are fraud. For better visualisation, I will be using a tiny dataset, say we have a total list of 5 credit card users, and 2 of them are suspicious credit card users. We need to extract information of the suspicious user 10001 and 10003 from the profile table and the transaction table, to generate a feature table for further prediction modelling.

2. Joining On The Profile Table

The first important decision we need to make when designing a table joining is which joining method to use, and the choices include left join, right join, inner join, or outer join. Here, the profile table holds the age information of all the 5 credit card users, but we are only interested in getting the age value for the two suspicious users. We can think of the suspicious user list as our base table here, and we only need to extract the relevant information for the two users in this suspicious list. Thus, we can use either suspicious_user_list.join(profile_table, ‘left’) or profile_table.join(suspicious_user_list, ‘right’). Both ways ensure that we are using the suspicious user list as the base table, and the profile information of userid 10002, 10004, 10005 is discarded in the output feature table.

Another decision we need to make when designing a table joining is the key column. It is quite straightforward here that we should use the userid column, as it uniquely identifies a user in both the suspicious user list and the profile table. However, things are not always so straightforward. Sometimes, multiple columns need to be used as key columns so that a user can be uniquely identified. For example, if we are observing a user on different dates, then the userid together with the observation date will uniquely identify a user.

When we write table joining, we should always do a quality check on the results. Before the table joining, we can count the suspicious user list, and the row count is 2. After the table joining, we count the output feature table. If we do not get the same row count of 2, then something must have gone wrong in the joining. For example, if the resulting row count is 5, then it is highly likely that the profile table is used as the base table.

#sql #feature-engineering #spark #python #data-science

1. A Quick Look At All The Tables

2. Joining On The Profile Table

towardsdatascience.com

All About Table Joining In PySpark