By reading this piece, you will learn to add your own customs word to Jieba in order to improve the performance of the tokenization. When dealing with domain-specific Natural Language Processing (NLP) tasks, it is essential to have control over the tokenization process as there are quite a lot of custom words that are defined as noun. On top of that, it is going to be a lot worse for Chinese as is not a space-tokenized language. There might arise a situation in which the end result is not what you expected. Let’s have a look at the following example.

于吉大招叫什么

From the perspective of a normal person, this sentence should be tokenized as follow:

['于吉', '大招', '叫', '什么']

However, that is not the case when you tokenize it with Jieba as you will end up with the following result

['于', '吉大招', '叫', '什么']

Let’s proceed to the next section to see how we can resolve this problem.

#jieba #nlp #tokenization #chinese #python #programming

How Improve the Performance of Chinese Text Tokenization in Python and Jieba
2.20 GEEK