Recently I came across this dataset, where I needed to analyze the sales recording of digital products. I got the dataset of having almost 572000 rows and 12 columns. I was so excited to work on such big data. With great enthusiasm, I gave a quick view of data, and I found the same name repeatedly taking different rows. (Ah! Much time for data cleaning!).

On some product column, it contains iphone while on other it was written Iphone, or iphone 7 + and iphone 7 Plus, and for some regular customer list, some have Pushpa Yadav while on other Puspa Yadav (the name is Pseudo).

Sneak Peek View of the Problem (Not From Original Dataset), Image by Author.

These are the same product name and customer name but were taken as different forms, i.e., deal with different versions of the same name. These sorts of problems are common scenarios for data scientists to tackle during data analysis. This scenario has a name called** data matching** or **fuzzy matching (probabilistic data matching) or simply data deduplication **or string/ name matching.

Why might there be “different but similar data”?

A common reason might be:

  • Typing error during data entry.
  • Abbreviations usages.
  • System data entry is not well validated to check for such errors.
  • Others.

Whatsoever, as a data scientist or analyst, it is our responsibility to match those data to create a master record for further analysis.

So, I jotted out the action to solve the problem:

1. Manually check and solve.

2. Look for useful libraries/ resources that the great mind of the community has shared.

The first choice was truly cumbersome for such an extended dataset (572000* 12 ), so I started looking into different libraries and tried thee approaches with a fuzzy matcher (not available on conda Now), fuzzywuzzy which on core uses Levenshtein distance and difflib.

However, on using them, I found that for such big data, it is more time-consuming. Thus, I need to look for a faster and effective method.

After many days of frustration, I finally came to know about the solution shared by Chris van den Berg.

We will cover

  • ngram
  • Vectorization
  • TF-IDF
  • Cosine similarity with sparse_dot_topn: The leading juice
  • Working demo ( Code-work)

As our goal here is not just to match the strings but also match it in a faster way. Thus the concept of ngram, TF-IDF with cosine similarity, comes into play.

Before going into the working demo (code-work), let’s understand the basics.

N-grams

N-grams are extensively used in text mining and natural language processing, which are a set of co-occurring words within a given sentence or (word file).To find n-gram, we move one word forward( can move any step as per need).

For example, for the room type “Standard Room Ocean View Waikiki Tower”

If N=3 (known as Trigrams), then the n-grams (3-grams) would be:

  • Standard Room Ocean
  • Room Ocean View
  • Ocean View Waikiki
  • View Waikiki Tower

Formula for Ngrams, Image by Author.

Explore more about ngrams from the research paper

Why n=3?

Question might arises why n = 3, can be taken as n= 1(unigrams ) or n=2 (bi-grams).

The intuition here is that bi-grams and tri-grams can capture contextual information compared to just unigrams. For example, “Room Ocean View” carries more meaning than only “Room,” “Ocean,” and “View” when observed independently.

#naturallanguageprocessing #machine-learning #python #nlp #data-science

Surprisingly Effective Way To Name Matching In Python
21.05 GEEK