Combining different data sources is a time suck!

Combining data from different sources can be a big time suck for data scientists. d6tjoin is a python library that lets you join pandas dataframes quickly and efficiently.

Coauthored with Haijing Li, Data Analyst in Financial Services, MS Business Analytics@Columbia University.

Example

I have made up this example to illustrate what d6tjoin is capable of.

Suppose several companies’ stocks have gained my attention for a while and I have came up with a strategy to score those companies’ performances in a 1-5 point scale. Backtesting on history data will help me evaluate if stock price really reflects those scores and find out how I want to trade according to those scores. Information I need for backtesting is contained in the following two datasets: df_price contains historical stock prices of year 2019 and df_score contains scores updated regularly by myself.

df_price

df_score

To prepare for backtesting, I need to merge “score” column to df_price. Obviously, ticker name and date should be the merge keys. But there are two problems: 1.Values in “ticker” of df_price and of df_valuation are not identical; 2.Scores were recorded on a monthly basis and I want each row in df_price to be assigned with the most recent assuming next score would not be available until next update date.

#2020 jul tutorials # overviews #data processing #pandas #python

Fuzzy Joins in Python with d6tjoin
1.80 GEEK