“I’m good at the art of assimilation. I have watched & listened & learned. I knew nothing, but I studied the ways of men and slowly learnt how to ruin, how to hate, how to debase, how to humiliate. And at the feet of my master, I learnt the highest of human skills, the skill that no other creature owns. I finally learnt how to lie!”

Can you guess who uttered those lines? Let me give you 2 options:

A) A ChatBot B) The Creature from Frankenstein.

I’ll reveal the answer at the end of this article.

Introduction:

The purpose of a good chatbot is to observe and listen and learn and study the ways of men (and women!), and to learn many different human skills, in order to engage in a good conversation. Chatbots serve in 2 different settings within the purview of conversational agency.

Goal Oriented Dialog: These are the ones engaged in Online Ticket / Restaurant booking and other customer services. They usually have a fixed set of “intents” and corresponding “responses” (as well as “actions” mapped to the intents, that are taken in the background). They also have a knowledge base (database) at their disposal which they can access by way of API calls.
Open Domain Dialog: These are the ones that can engage in open ended chit chat in a wide range of topics. Recent advancements in open domain chatbots have been vastly due to scaling neural network models — by having more parameters and training on huge corpora. Eg: Meena from Google. Meena has 2.6B parameters and is trained on 341GB of text from social media conversations. And compared to OpenAI GPT-2, Meena has 1.7x greater model complexity and was trained on 8.5x more data.

But the researchers from FAIR attempt to show that scaling alone is insufficient and there is a lot more to be accounted for to generate a good conversation — for the chatbot to display human-like traits, like:

Personality
Engaging-ness
Empathy
Domain knowledge/expertise

Enter “Blender Bot”, FAIR’s champion conversational agent which they have open sourced recently.

In this 3 part series about Blender, I will try to explain the data sets used, the evaluation methods, the workings of the transformer architectures used and the model architecture with their training objectives, one by one. In the first part, let us discuss in detail about the data sets used and also see an overview of the limitations and failure cases. The paper is a bit system-heavy, so prior understanding of Attention, Transformers, BERT and Language Models in general would help tie all the pieces together seamlessly (not required for Part-1).

Data Sets:

Different data sets and (fake) tasks are used during the pre-training and fine-tuning stages of the model.

Pre-training:

BERT is pre-trained on the Toronto Book Corpus and Wikipedia. Such a training will not help in this case, because we are dealing with dialog generation and not just sentence associations. Therefore public domain data from Reddit and subreddits are used as the source of truth. Around 1.5B training examples are generated. The goal here is to generate a comment, conditioned on the full thread leading up to the comment. Cleaning the reddit data is a challenging process. A particular comment is not-used in the following cases:

if the author is a known bot
if it is from a non-English subreddit
if it is longer than 2048 characters or fewer than 5 characters
if it contains a URL
if it starts with a non-ASCII character
if it is at a depth > 7 in the thread
if it is a removed/deleted comment

In spite of cleaning it, the data still suffers from toxicity, noise and the fact that they are not 2-way conversations but group discussions.

#artificial-intelligence #data-science #ai #chatbots #nlp #data analysis

Introduction:

Data Sets:

Pre-training:

towardsdatascience.com

Blender Bot — Part 1: The Data