How neural networks can be used to help convert molecular structure diagrams into their corresponding International Chemical Identifier (InChI) text strings.

Background and Motivation

As we enter into a new frontier of predominantly digital media and publications, it becomes exceedingly paramount to learn how to reconcile the old way of doing things with the new. In the field of chemistry, it has been common practice for decades to represent chemical compounds by their structural forms in what is known as the Skeletal formula. Past publications are full of these diagrams, but, now, as we become increasingly reliant on having computers parse documents and interpret these molecular images correctly, it is of tremendous importance to come up with a way to convert these structures into a form that can be readily understood and manipulated by a machine.[2] The first step towards achieving this occurred in the early 2000s when scientists from around the world came together to develop the International Chemical Identifier (InChI) labels which could be used to represent any compound’s specific composition and bond organization as a machine readable string.[3]

The creation of InChI was a monumental achievement that helped move the field towards a more digital friendly form, but this new labeling system also left decades of old publications with the Skeletal structure diagrams to be forgotten and underutilized. In order to help expand access to previously published chemical research, it is imperative to develop a way to accurately and efficiently analyze published scientific documents and identify the presence of known compounds. Having a model that can peruse old papers and determine the presence of known chemical compounds will speed up current innovation by helping researchers avoid the pitfalls of re-exploring or republishing already documented chemistries.[2]

A primary barrier that has prevented such a solution from being found at this point is that most public datasets of these Skeletal formulas along with their corresponding InChI labels are too sparse for contemporary machine learning models, and even the best models can currently only produce accuracies of roughly 90% under ideal image conditions (which is not typically the case for scans of old scientific publications).[2] This has caused progress in the mission to stagnate, but, thanks to the team at Bristol-Myers Squibb, a dataset containing over 4 million structural images and their corresponding InChI labels has been made available, and this is what we will be using to train our model on and hopefully expose a niche but vital application of machine learning techniques. Before expanding on how we plan on solving this problem using our current deep learning knowledge, however, we will first present some initial exploratory data analysis of the data we have been provided with.

#data-science #machine-learning #artificial-intelligence #deep-learning

Deep Learning for Molecular Translation
1.10 GEEK