In this article, take a look at text analysis within a full-text search engine.
Full-Text Search refers to techniques for searching text content within a document or a collection of documents that hold textual content. A Full-Text search engine examines all the textual content within documents as it tries to match a single search term or several terms, text analysis being a pivotal component.
You’ve probably heard of the most well-known Full-Text Search engine: Lucene with Elasticsearch built on top of it. Couchbase’s Full-Text Search (FTS) Engine is powered by Bleve, and this article will showcase the various ways to analyze text within this engine.
Bleve is an open-sourced text indexing and search library implemented in Go, developed in-house at Couchbase.
Couchbase’s FTS engine supports indexes that subscribe to data residing within a Couchbase Server and indexes data that it ingests from the server. It’s a distributed system – meaning it can partition data across multiple nodes in a cluster and searches involve scattering the request and gathering responses from across all nodes within the cluster before responding to the application.
The FTS engine distributes documents ingested for an index across a configurable number of partitions and these partitions could reside across multiple nodes within a cluster. Each partition follows the same set of rules that the FTS index is configured with – to analyze and index text into the full-text search database.
The text analysis component of a Full-Text search engine is responsible for breaking down the raw text into a list of words – which we’ll refer to as tokens. These tokens are more suitable for indexing in the database and searching.
Couchbase’s FTS Engine handles text indexing for JSON documents. It builds an index for the content that is analyzed and stores into the database – the index along with all the relevant metadata needed to link the tokens generated to the original documents within which they reside.
An Inverted index is the data structure chosen to index the tokens generated from text, to make search queries faster. This index links every token generated to documents that contain the token.
For example, take the following documents ..
The inverted index for the tokens generated from the 2 documents above would resemble this..
Here’s a diagram highlighting the components of the full-text search engine ..
The components of a text analyzer can broadly be classified into 2 categories:
Couchbase’s engine further categorizes filters into:
Before we dive into the function of each of these components, here’s an overview of a text analyzer ..
A tokenizer is the first component to which the documents are subjected to. As the name suggests, it breaks the raw text into a list of tokens. This conversion will depend on a rule-set defined for the tokenizer.
Take this sample text for an example: “_this is my email ID: [email protected]”
A couple of configurable tokenizers...
Larave full text search app. Here, you'll learn how to implement full text search in laravel app. This tutorial also work with laravel 5, 5.5, 6, 7 version
This article shows how to implement a full text search in ASP.NET Core using Azure Cognitive Search. The search results are returned using paging and the search index can be created, deleted from a…
In this article, we will go through the main difficulties of full-text search implementation for CJK languages and how to overcome them with the help of Manticore Search.
Full-text search should not slow down your application. Learn the basics and understand that the solution is not always just code! Then you probably faced with the problem caused by LIKE ‘%...%'.
A Guide: Text Analysis, Text Analytics & Text Mining. A guide to what it is, applications & use cases, tools, and how it improves business decision-making