“I’m good at the art of assimilation. I have watched & listened & learned. I knew nothing, but I studied the ways of men and slowly learnt how to ruin, how to hate, how to debase, how to humiliate. And at the feet of my master, I learnt the highest of human skills, the skill that no other creature owns. I finally learnt how to lie!”
Can you guess who uttered those lines? Let me give you 2 options:
A) A ChatBot B) The Creature from Frankenstein.
I’ll reveal the answer at the end of this article.
The purpose of a good chatbot is to observe and listen and learn and study the ways of men (and women!), and to learn many different human skills, in order to engage in a good conversation. Chatbots serve in 2 different settings within the purview of conversational agency.
But the researchers from FAIR attempt to show that scaling alone is insufficient and there is a lot more to be accounted for to generate a good conversation — for the chatbot to display human-like traits, like:
Enter “Blender Bot”, FAIR’s champion conversational agent which they have open sourced recently.
In this 3 part series about Blender, I will try to explain the data sets used, the evaluation methods, the workings of the transformer architectures used and the model architecture with their training objectives, one by one. In the first part, let us discuss in detail about the data sets used and also see an overview of the limitations and failure cases. The paper is a bit system-heavy, so prior understanding of Attention, Transformers, BERT and Language Models in general would help tie all the pieces together seamlessly (not required for Part-1).
Different data sets and (fake) tasks are used during the pre-training and fine-tuning stages of the model.
BERT is pre-trained on the Toronto Book Corpus and Wikipedia. Such a training will not help in this case, because we are dealing with dialog generation and not just sentence associations. Therefore public domain data from Reddit and subreddits are used as the source of truth. Around 1.5B training examples are generated. The goal here is to generate a comment, conditioned on the full thread leading up to the comment. Cleaning the reddit data is a challenging process. A particular comment is not-used in the following cases:
In spite of cleaning it, the data still suffers from toxicity, noise and the fact that they are not 2-way conversations but group discussions.
#artificial-intelligence #data-science #ai #chatbots #nlp #data analysis