Why are there so many Tokenization methods in HF Transformers?

https://youtu.be/bWLvGGJLzF8

HuggingFace's transformers library is the de-facto standard for NLP - used by practitioners worldwide, it's powerful, flexible, and easy to use. It achieves this through a fairly large (and complex) code-base, which has resulted in the question:

"Why are there so many tokenization methods in HuggingFace transformers?"

Tokenization is the process of encoding a string of text into transformer-readable token ID integers. In this video we cover five different methods for this - do these all produce the same output, or is there a difference between them?

📙 Check out the Medium article or if you don't have Medium membership here's a free access link!

I also made a NLP with Transformers course, here's 70% off if you're interested!