a Deep and Light-weight Transformer named DeLighT that allocates parameters more efficiently among the transformer blocks or layers

Transformer and its numerous variants achieve excellent performance today in various machine learning applications including sequence-to-sequence modeling, language modeling and computer vision tasks.

