Today I’d like to highlight the challenge of search functionality in Chinese. In this article, we will go through the main difficulties of full-text search implementation for CJK languages and how to overcome them with the help of Manticore Search.

Chinese search difficulties

The Chinese language belongs to the so-called CJK language family (Chinese, Japanese, and Korean). They are probably the most complicated languages for full-text search implement as in them word meanings heavily depend on numerous hieroglyphs variations and their sequences and the characters are not split up into words.

Chinese languages specifics:

  • Chinese hieroglyphs do not have an upper or lower case. They have only one notion, regardless of the context.
  • There is no additional decoration for the letters, as in Arabic, for example.
  • There are no spaces between words in sentences.

So what’s the point? To find an exact match in a full-text search, we have to face the challenge of tokenization whose main task is to break down the text into low-level units of values that can be searched by the user.

#search-engine #search-index #tokenization #machine-learning #search #china

Bite-Sized Tips To Make Chinese Full-Text Search
1.20 GEEK