Bite-Sized Tips To Make Chinese Full-Text Search

Today I’d like to highlight the challenge of search functionality in Chinese. In this article, we will go through the main difficulties of full-text search implementation for CJK languages and how to overcome them with the help of Manticore Search.

Chinese search difficulties

The Chinese language belongs to the so-called CJK language family (Chinese, Japanese, and Korean). They are probably the most complicated languages for full-text search implement as in them word meanings heavily depend on numerous hieroglyphs variations and their sequences and the characters are not split up into words.

Chinese languages specifics:

Chinese hieroglyphs do not have an upper or lower case. They have only one notion, regardless of the context.
There is no additional decoration for the letters, as in Arabic, for example.
There are no spaces between words in sentences.

So what’s the point? To find an exact match in a full-text search, we have to face the challenge of tokenization whose main task is to break down the text into low-level units of values that can be searched by the user.

#search-engine #search-index #tokenization #machine-learning #search #china

Chinese search difficulties

hackernoon.com

Bite-Sized Tips To Make Chinese Full-Text Search