Recently I came across this dataset, where I needed to analyze the sales recording of digital products. I got the dataset of having almost 572000 rows and 12 columns. I was so excited to work on such big data. With great enthusiasm, I gave a quick view of data, and I found the same name repeatedly taking different rows. (Ah! Much time for data cleaning!).
On some product column, it contains iphone while on other it was written Iphone, or iphone 7 + and iphone 7 Plus, and for some regular customer list, some have Pushpa Yadav while on other Puspa Yadav (the name is Pseudo).
Sneak Peek View of the Problem (Not From Original Dataset), Image by Author.
These are the same product name and customer name but were taken as different forms, i.e., deal with different versions of the same name. These sorts of problems are common scenarios for data scientists to tackle during data analysis. This scenario has a name called** data matching** or **fuzzy matching (probabilistic data matching) or simply data deduplication **or string/ name matching.
A common reason might be:
Whatsoever, as a data scientist or analyst, it is our responsibility to match those data to create a master record for further analysis.
So, I jotted out the action to solve the problem:
1. Manually check and solve.
2. Look for useful libraries/ resources that the great mind of the community has shared.
The first choice was truly cumbersome for such an extended dataset (572000* 12 ), so I started looking into different libraries and tried thee approaches with a fuzzy matcher (not available on conda Now), fuzzywuzzy which on core uses Levenshtein distance and difflib.
However, on using them, I found that for such big data, it is more time-consuming. Thus, I need to look for a faster and effective method.
After many days of frustration, I finally came to know about the solution shared by Chris van den Berg.
As our goal here is not just to match the strings but also match it in a faster way. Thus the concept of ngram, TF-IDF with cosine similarity, comes into play.
Before going into the working demo (code-work), let’s understand the basics.
N-grams are extensively used in text mining and natural language processing, which are a set of co-occurring words within a given sentence or (word file).To find n-gram, we move one word forward( can move any step as per need).
For example, for the room type “Standard Room Ocean View Waikiki Tower”
If N=3 (known as Trigrams), then the n-grams (3-grams) would be:
Formula for Ngrams, Image by Author.
Explore more about ngrams from the research paper
Why n=3?
Question might arises why n = 3, can be taken as n= 1(unigrams ) or n=2 (bi-grams).
The intuition here is that bi-grams and tri-grams can capture contextual information compared to just unigrams. For example, “Room Ocean View” carries more meaning than only “Room,” “Ocean,” and “View” when observed independently.
#naturallanguageprocessing #machine-learning #python #nlp #data-science