Back Translation in Text Augmentation by nlpaug

Back Translation in Text Augmentation by nlpaug

English is one of the languages which has lots of training data for translation while some language may not has enough data to train a machine translation model. Sennrich et al. used the back-translation method to generate more training data to improve translation model performance.

English is one of the languages which has lots of training data for translation while some language may not has enough data to train a machine translation model. Sennrich et al. used the back-translation method to generate more training data to improve translation model performance.

Given that we want to train a model for translating English (source language) → Cantonese (target language) and there is not enough training data for Cantonese. Back-translation is translating target language to source language and mixing both original source sentences and back-translated sentences to train a model. So the number of training data from the source language to target language can be increased.

In the previous story, the back-translation method is mentioned to generate synthetic data for the NLP task. Such that, we can have more training for model training especially for low-resource NLP tasks and languages.

This story will cover how Facebook AI Research (FAIR) team trains a model for translation and how can we leverage the pre-trained model to generate more training data for your model. By leveraging subword models, large-scale back-translation, and model ensembling, Ng et al. (2019) win WMT 19 award. They worked on two language pairs and four language directions which are translating English ← → Germany (EN ← → DE) and English ← →Russian (EN ← →RU). They demonstrated how to use back-translation to boost up model performance. After that, I will show how can we write a few lines to generate synthetic data by using back-translation. Here are some details about data processing, data augmentation, and translation model.

Data Processing

Subword

In the earlier stage of NLP, word level and character level tokens are used to train a model. In the state-of-the-art NLP model, the sub-word (in between a word and character level) is the standard way in the tokenization stage. For example, it uses “trans” and “lation” to represent “translation” because of occurrence frequency. You may have a look at 3 different sub-word algorithms from here. Ng et al. pick bye pair encodings (BPE) with 32K and 24 split operations for EN←→DE and EN← →RU tokenization respectively.

Data Filtering

To make sure only sentence pairs with the correct language, Ng et al. use langid (Lui et al., 2012) to filter out invalid data. langid is a language identification tool that tells you what language does text belongs to.

If sentences contain more than 250 tokens or length ratio between source and target exceeding 1.5, it will be excluded in model training. I suspected that it may introduce too much noise information to the model.

Image for post

The data size for different filtering methods (Ng et al., 2019)

The third filtering way is targeting monolingual data. To keep high-quality monolingual data, Ng et al. adopt the Moore-Lewis method (2010) for removing noisy data from the larger corpus. In short, Moore and Lewis score text by the difference of source data language model and the larger corpus language model. After picking a high-quality corpus, it will use the back-translation model to generate a pair of training data for the translation model.

back-translation data-augmentation nlp machine-learning data-science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

Data Augmentation in Deep Learning | Data Science | Machine Learning

Data Augmentation is a technique in Deep Learning which helps in adding value to our base dataset by adding the gathered information from various sources to improve the quality of data of an organisation.

15 Machine Learning and Data Science Project Ideas with Datasets

Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.

Best Free Datasets for Data Science and Machine Learning Projects

This post will help you in finding different websites where you can easily get free Datasets to practice and develop projects in Data Science and Machine Learning.

50 Data Science Jobs That Opened Just Last Week

Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.