Text Analysis Within a Full-Text Search Engine

Text Analysis Within a Full-Text Search Engine

In this article, take a look at text analysis within a full-text search engine.

Full-Text Search refers to techniques for searching text content within a document or a collection of documents that hold textual content. A Full-Text search engine examines all the textual content within documents as it tries to match a single search term or several terms, text analysis being a pivotal component.

You’ve probably heard of the most well-known Full-Text Search engine: Lucene with Elasticsearch built on top of it. Couchbase’s Full-Text Search (FTS) Engine is powered by Bleve, and this article will showcase the various ways to analyze text within this engine.

Bleve is an open-sourced text indexing and search library implemented in Go, developed in-house at Couchbase.

Couchbase’s FTS engine supports indexes that subscribe to data residing within a Couchbase Server and indexes data that it ingests from the server. It’s a distributed system – meaning it can partition data across multiple nodes in a cluster and searches involve scattering the request and gathering responses from across all nodes within the cluster before responding to the application.

The FTS engine distributes documents ingested for an index across a configurable number of partitions and these partitions could reside across multiple nodes within a cluster. Each partition follows the same set of rules that the FTS index is configured with – to analyze and index text into the full-text search database.

The text analysis component of a Full-Text search engine is responsible for breaking down the raw text into a list of words – which we’ll refer to as tokens. These tokens are more suitable for indexing in the database and searching.

Couchbase’s FTS Engine handles text indexing for JSON documents. It builds an index for the content that is analyzed and stores into the database – the index along with all the relevant metadata needed to link the tokens generated to the original documents within which they reside.

An Inverted index is the data structure chosen to index the tokens generated from text, to make search queries faster. This index links every token generated to documents that contain the token.

For example, take the following documents ..

The inverted index for the tokens generated from the 2 documents above would resemble this..

Here’s a diagram highlighting the components of the full-text search engine ..

A Text Analyzer

The components of a text analyzer can broadly be classified into 2 categories:

  • Tokenizer
  • Filters

Couchbase’s engine further categorizes filters into:

  • Character filters
  • Token filters

Before we dive into the function of each of these components, here’s an overview of a text analyzer ..

Tokenizer

A tokenizer is the first component to which the documents are subjected to. As the name suggests, it breaks the raw text into a list of tokens. This conversion will depend on a rule-set defined for the tokenizer.

Stock tokenizers...

Take this sample text for an example: “_this is my email ID: [email protected]

A couple of configurable tokenizers...

  • Exception .. This tokenizer allows the user to enter exception patterns (regular expressions) over the stock tokenizers.
  • Regexp .. This tokenizer extracts text that matches the pattern (a regular expression) as tokens.

For example:

json couchbase search go text analysis full-text search bleve full-text full-text-indexing

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Laravel 7 Full Text Search MySQL

Larave full text search app. Here, you'll learn how to implement full text search in laravel app. This tutorial also work with laravel 5, 5.5, 6, 7 version

Implement a full text search using Azure Cognitive Search in ASP.NET Core

This article shows how to implement a full text search in ASP.NET Core using Azure Cognitive Search. The search results are returned using paging and the search index can be created, deleted from a…

Bite-Sized Tips To Make Chinese Full-Text Search

In this article, we will go through the main difficulties of full-text search implementation for CJK languages and how to overcome them with the help of Manticore Search.

Effective Full-text search: go simple

Full-text search should not slow down your application. Learn the basics and understand that the solution is not always just code! Then you probably faced with the problem caused by LIKE ‘%...%'.

A Guide: Text Analysis, Text Analytics & Text Mining

A Guide: Text Analysis, Text Analytics & Text Mining. A guide to what it is, applications & use cases, tools, and how it improves business decision-making