Microsoft has always been active in the AI space with initiatives for developing innovative, interactive, and immersive applications. Whether be it Azure Machine learning servicesAzure DatabricksAzure Cognitive services or ML.net, Microsoft always strives to package AI as ready to use services in the form of APIs/ SDK across a variety of programming languages and platforms.

Image caption generators are models built on computer vision and natural language processing to discern the important features of an image and describe a caption that encompasses the context in a human-like way. The most used concepts for this problem are convolution neural networks and LSTM. LSTM (Long term short memory) frequently crops up in caption generators. It is a type of recurrent neural network that based on the previous word, predicts the next word.

So what technologies exactly go into making a cohesive narrative producing AI system?

Image for post

Microsoft’s Pix2Story in action

The Pix2Story white paper explains in detail the technologies used for pioneering this caption generator bot. I have attempted to simplify and present 2 of the key drivers behind Pix2story.

Natural Language Processing (NLP) is obviously a major chunk of it, but along with that as the input is an image, something that gives the context of the image is very important.

The basic idea is to get contextual information about what the picture depicts, and then using this context to build a meaningful narrative. In this article, I would be going through two of the technological bases used to create pix2story. The first is skip-thought vectors, and encoder-decoder model and the second is ‘show, attend and tell’ a caption generator using visual attention.

What are skip-thought vectors?

In the simplest terms, it is an unsupervised learning method using neural networks which given a sentence, will replace it with an equal vector. Breaking this down, unsupervised learning means there is no ‘right’ and ‘wrong’ when we start training the model. The errors are calculated as the training takes place and the model is refined in the process.

#naturallanguageprocessing #image-captioning #microsoft-azure #machine-learning #neural-networks

Neural storytelling: Skip thought vectors and Visual attention
3.05 GEEK