It is often the case when working with external data that a common identifier such as a numerical key does not exist. In place of a unique identifier, a person’s full name can be used as part of a universal or composite key to link data, however, this is not a fail-safe solution.

Let’s take for example the name Alan Turing; disparate data sources could have recorded the calling name Al Turing. Data entry may innocently record: Alan,_ Allan_,_ Allen_, or worse, undetected typos (Alam Turing) into their databases. Enterprise document scanning solutions (OCR) are also rife with misreadings.

Image for post

A human agent could intuitively assign these variations to the same entity of Alan Turing through the cognitive process of applying soft-logic to approximate the spelling and **phonetic **(sound) characteristics. Often shortened **hypocorisms **don’t always have these characteristics and are part of the agents’ learned associations i.e. Charles → Chip.

What follows is a study of applying machine learning to achieve semblance of human-like logic and semantics for alternative name identification.

Data Collection

I scraped multiple lists of common alternative spellings for first-names, around 17,500 pairings. The names are restricted to **ASCII **and include many Unicode-decoded cross-cultural examples to avoid over-fitting to western name conventions.

The intuition of using first names as the core data for our model is to integrate ensemble methods on name-components, requiring exact matching on surnames to ensure greater precision/less false positives at the cost of some recall.

I decided to make the classes imbalanced (1:4) as under-sampling the negative class lead to a noticeable artificial bias towards positive class. It is difficult to approximate the a priori probabilities for each class, but it is assumed that the classes are imbalanced in favor of the negative class.

#data-science #python #programming #machine-learning #artificial-intelligence

Fuzzy Name Matching with Machine Learning
1.25 GEEK