Facebook Announces TransCoder AI to Translate Code

Facebook AI Research has announced TransCoder, a system that uses unsupervised deep-learning to convert code from one programming language to another. TransCoder was trained on more than 2.8 million open source projects and outperforms existing code translation systems that use rule-based methods.

The team described the system in a paper published on arXiv. TransCoder is inspired by other neural machine translation (NMT) systems that use deep-learning to translate text from one natural language to another and is trained only on monolingual source data. To compare the performance of the model, the Facebook team collected a validation set of 852 functions and associated unit tests in each of the system’s target languages: Java, Python, and C++. Compared to existing systems, TransCoder performed better on this validation set than existing commercial solutions: by up to 33 percentage points compared to j2py, a Java-to-Python translator. Although the team restricted its work to only those three languages, they claim it can “easily be extended to most programming languages.”

Automated tools for translating source code from one language to another, also known as source-to-source compilers, transcompilers, or transpilers, have existed since the 1970s. Most of these tools work similar to a standard code compiler: they parse the source code into an abstract syntax tree (AST). The AST is then converted back into source code in a different language, usually by applying re-write rules. Transpilers are useful in several scenarios. For example, some languages, such as CoffeeScript and TypeScript, are intentionally designed to use a transpiler to convert from a more developer-friendly language into a more broadly-supported one. Sometimes it is helpful to transpile entire codebases from source languages that are obsolete or deprecated; for example the 2to3 transpiler tool used to port Python code from the deprecated version 2 to version 3. However, transpilers are far from perfect, and creating one requires significant development effort (and often customization).

TransCoder builds on advances in natural-language processing (NLP), in particular unsupervised NMT. The model uses a Transformer-based sequence-to-sequence architecture which consists of an attention-based encoder and decoder. Since obtaining a dataset for supervised learning would be difficult—it would require many pairs of equivalent code samples in both the source and target languages—the team opted to used monolingual datasets to do unsupervised learning, using three strategies. First, the model is trained on input sequences that have random tokens masked; the model must learn to predict the correct value for the masked tokens. Next, the model is trained on sequences that have been corrupted by randomly masking, shuffling, or removing tokens; the model must learn to output the corrected sequence. Finally, two version of these models are trained in parallel to do back-translation; one model learns to translate from the source to target language, and the other learns to translate back to the source.

#facebook #neural networks #deep learning #ai # ml & data engineering #news

infoq.com

Facebook Announces TransCoder AI to Translate Code