How Improve the Performance of Chinese Text Tokenization in Python and Jieba

By reading this piece, you will learn to add your own customs word to Jieba in order to improve the performance of the tokenization. When dealing with domain-specific Natural Language Processing (NLP) tasks, it is essential to have control over the tokenization process as there are quite a lot of custom words that are defined as noun. On top of that, it is going to be a lot worse for Chinese as is not a space-tokenized language. There might arise a situation in which the end result is not what you expected. Let’s have a look at the following example.

于吉大招叫什么

From the perspective of a normal person, this sentence should be tokenized as follow:

['于吉', '大招', '叫', '什么']

However, that is not the case when you tokenize it with Jieba as you will end up with the following result

['于', '吉大招', '叫', '什么']

Let’s proceed to the next section to see how we can resolve this problem.

#jieba #nlp #tokenization #chinese #python #programming

levelup.gitconnected.com

How Improve the Performance of Chinese Text Tokenization in Python and Jieba