Sometimes it can be really hard to tell whether two things are actually one thing. Is ABC Co. the same thing as ABC Company? How about CBR Construction and Construction CBR Inc.?AAD International and Aerial Automated Defense International? Sometimes you just have to make a judgment call. Easy enough, but now try making thousands of judgment calls per day. Sounds like a job for a computer. Where a computer falls short, however, is when you get examples like the AAD one; there’s just enough similarity for a computer to flag as a possible match, but not quite enough similarity to deal with it automatically. This is where the occasional judgment call becomes useful when used in combination with automated matching.

I work at MaRS Discovery District in Toronto, and to create our analytics products, we pull in data about Canadian ventures from a whole pile of different datasources. My team then analyzes that data to get a comprehensive picture of the Canadian venture ecosystem. The issue with our data is that ventures go by many names, and there is no great way to easily and automatically match ventures from one datasource with another.

Take for example a fake venture that exists in multiple datasources under different names. In Datasource 1, it’s called ‘Cletus Potatoes’. In Datasource 2, it’s called ‘Cletus’ Potatoes Inc.’. In Datasource 3, it’s called ‘Potatopocalypse’ (actually the name of one of their products, being used in place of the business name). In our database, the name is just ‘Cletus Potatoes’.

We can match the venture from Datasource 1 with the venture in our database with a fair amount of confidence, since the string is an exact match (Cletus Potatoes = Cletus Potatoes). The venture in Datasource 2 has the suffix ‘Inc.’ Okay, we can easily enough remove common stopwords like ‘Inc.’ and ‘Incorporated’ when we do our matching, so that Cletus Potatoes = Cletus Potatoes Inc. However, the venture name from Datasource 3 doesn’t even resemble the name in our DB, so we’ll have rely completely on metadata to make that match, and chances are that we’ll have to manually review that match to make sure it’s right (maybe there’s a rough match with the Datasource 3 URL being www.cletuspots.com/potatopocalypse and our DB URL being www.cletuspots.com — enough for these two to be flagged as a possible match).

That venture might be just one in a batch of tens of thousands of companies, and in a batch that big the number of inexact matches requiring manual reviews can easily climb into the thousands. One person (me!) manually reviewing thousands of matches is impossible to do in a reasonable timeline, and unfortunately, it isn’t an option for us to be happy with an almost accurate venture matching. Some ventures have a disproportionate impact on the venture ecosystem, and if we miss a match for a big player, then we could end up multiplying their impact by treating them as multiple companies when they should be treated as one. So, these matches must be reviewed.

Now how do we solve the problem of needing highly accurate matches but me not having the time to manually review them all to ensure high accuracy?

#serverless #data-engineering #slack

Using Slack to Optimize Manual Entity Resolution
1.50 GEEK