By reading this piece, you will learn to add your own customs word to Jieba
in order to improve the performance of the tokenization. When dealing with domain-specific Natural Language Processing (NLP) tasks, it is essential to have control over the tokenization process as there are quite a lot of custom words that are defined as noun. On top of that, it is going to be a lot worse for Chinese as is not a space-tokenized language. There might arise a situation in which the end result is not what you expected. Let’s have a look at the following example.
于吉大招叫什么
From the perspective of a normal person, this sentence should be tokenized as follow:
['于吉', '大招', '叫', '什么']
However, that is not the case when you tokenize it with Jieba as you will end up with the following result
['于', '吉大招', '叫', '什么']
Let’s proceed to the next section to see how we can resolve this problem.
#jieba #nlp #tokenization #chinese #python #programming