Toshi: A Full-text Search Engine in Rust

Toshi

What is a Toshi?

Toshi is a three year old Shiba Inu. He is a very good boy and is the official mascot of this project. Toshi personally reviews all code before it is committed to this repository and is dedicated to only accepting the highest quality contributions from his human. He will, though, accept treats for easier code reviews.

Please note that this is far from production ready, also Toshi is still under active development, I'm just slow.

Description

Toshi is meant to be a full-text search engine similar to Elasticsearch. Toshi strives to be to Elasticsearch what Tantivy is to Lucene.

Motivations

Toshi will always target stable Rust and will try our best to never make any use of unsafe Rust. While underlying libraries may make some use of unsafe, Toshi will make a concerted effort to vet these libraries in an effort to be completely free of unsafe Rust usage. The reason I chose this was because I felt that for this to actually become an attractive option for people to consider it would have to have be safe, stable and consistent. This was why stable Rust was chosen because of the guarantees and safety it provides. I did not want to go down the rabbit hole of using nightly features to then have issues with their stability later on. Since Toshi is not meant to be a library, I'm perfectly fine with having this requirement because people who would want to use this more than likely will take it off the shelf and not modify it. My motivation was to cater to that use case when building Toshi.

Build Requirements

At this current time Toshi should build and work fine on Windows, Mac OS X, and Linux. From dependency requirements you are going to need 1.39.0 and Cargo installed in order to build. You can get rust easily from rustup.

Configuration

There is a default configuration file in config/config.toml:

host = "127.0.0.1"
port = 8080
path = "data2/"
writer_memory = 200000000
log_level = "info"
json_parsing_threads = 4
bulk_buffer_size = 10000
auto_commit_duration = 10
experimental = false

[experimental_features]
master = true
nodes = [
    "127.0.0.1:8081"
]

[merge_policy]
kind = "log"
min_merge_size = 8
min_layer_size = 10_000
level_log_size = 0.75

Host

host = "localhost"

The hostname Toshi will bind to upon start.

Port

port = 8080

The port Toshi will bind to upon start.

Path

path = "data/"

The data path where Toshi will store its data and indices.

Writer Memory

writer_memory = 200000000

The amount of memory (in bytes) Toshi should allocate to commits for new documents.

Log Level

log_level = "info"

The detail level to use for Toshi's logging.

Json Parsing

json_parsing_threads = 4

When Toshi does a bulk ingest of documents it will spin up a number of threads to parse the document's json as it's received. This controls the number of threads spawned to handle this job.

Bulk Buffer

bulk_buffer_size = 10000

This will control the buffer size for parsing documents into an index. It will control the amount of memory a bulk ingest will take up by blocking when the message buffer is filled. If you want to go totally off the rails you can set this to 0 in order to make the buffer unbounded.

Auto Commit Duration

auto_commit_duration = 10

This controls how often an index will automatically commit documents if there are docs to be committed. Set this to 0 to disable this feature, but you will have to do commits yourself when you submit documents.

Merge Policy

[merge_policy]
kind = "log"

Tantivy will merge index segments according to the configuration outlined here. There are 2 options for this. "log" which is the default segment merge behavior. Log has 3 additional values to it as well. Any of these 3 values can be omitted to use Tantivy's default value. The default values are listed below.

min_merge_size = 8
min_layer_size = 10_000
level_log_size = 0.75

In addition there is the "nomerge" option, in which Tantivy will do no merging of segments.

Experimental Settings

experimental = false

[experimental_features]
master = true
nodes = [
    "127.0.0.1:8081"
]

In general these settings aren't ready for usage yet as they are very unstable or flat out broken. Right now the distribution of Toshi is behind this flag, so if experimental is set to false then all these settings are ignored.

Building and Running

Toshi can be built using cargo build --release. Once Toshi is built you can run ./target/release/toshi from the top level directory to start Toshi according to the configuration in config/config.toml

You should get a startup message like this.

  ______         __   _   ____                 __
 /_  __/__  ___ / /  (_) / __/__ ___ _________/ /
  / / / _ \(_-</ _ \/ / _\ \/ -_) _ `/ __/ __/ _ \
 /_/  \___/___/_//_/_/ /___/\__/\_,_/_/  \__/_//_/
 Such Relevance, Much Index, Many Search, Wow
 
 INFO  toshi::index > Indexes: []

You can verify Toshi is running with:

curl -X GET http://localhost:8080/

which should return:

{
  "name": "Toshi Search",
  "version": "0.1.1"
}

Once toshi is running it's best to check the requests.http file in the root of this project to see some more examples of usage.

Example Queries

Term Query

{ "query": {"term": {"test_text": "document" } }, "limit": 10 }

Fuzzy Term Query

{ "query": {"fuzzy": {"test_text": {"value": "document", "distance": 0, "transposition": false } } }, "limit": 10 }

Phrase Query

{ "query": {"phrase": {"test_text": {"terms": ["test","document"] } } }, "limit": 10 }

Range Query

{ "query": {"range": { "test_i64": { "gte": 2012, "lte": 2015 } } }, "limit": 10 }

Regex Query

{ "query": {"regex": { "test_text": "d[ou]{1}c[k]?ument" } }, "limit": 10 }

Boolean Query

{ "query": {"bool": {"must": [ { "term": { "test_text": "document" } } ], "must_not": [ {"range": {"test_i64": { "gt": 2017 } } } ] } }, "limit": 10 }

Usage

To try any of the above queries you can use the above example

curl -X POST http://localhost:8080/test_index -H 'Content-Type: application/json' -d '{ "query": {"term": {"test_text": "document" } }, "limit": 10 }'

Also, to note, limit is optional, 10 is the default value. It's only included here for completeness.

Running Tests

cargo test

Download Details:
Author: toshi-search
Source Code: https://github.com/toshi-search/Toshi
License: MIT License

#rust  #rustlang  #machinelearing #searchengine 

What is GEEK

Buddha Community

Toshi: A Full-text Search Engine in Rust

I am Developer

1597475640

Laravel 7 Full Text Search MySQL

Here, I will show you how to create full text search in laravel app. You just follow the below easy steps and create full text search with mysql db in laravel.

Laravel 7 Full Text Search Mysql

Let’s start laravel full-text search implementation in laravel 7, 6 versions:

  1. Step 1: Install Laravel New App
  2. Step 2: Configuration DB .evn file
  3. Step 3: Run Migration
  4. Step 4: Install Full Text Search Package
  5. Step 5: Add Fake Records in DB
  6. Step 6: Add Routes,
  7. Step 7: Create Controller
  8. Step 8: Create Blade View
  9. Step 9: Start Development Server

https://www.tutsmake.com/laravel-full-text-search-tutorial/

#laravel full text search mysql #laravel full text search query #mysql full text search in laravel #full text search in laravel 6 #full text search in laravel 7 #using full text search in laravel

Hollie  Ratke

Hollie Ratke

1597989600

Text Analysis Within a Full-Text Search Engine

Full-Text Search refers to techniques for searching text content within a document or a collection of documents that hold textual content. A Full-Text search engine examines all the textual content within documents as it tries to match a single search term or several terms, text analysis being a pivotal component.

You’ve probably heard of the most well-known Full-Text Search engine: Lucene with Elasticsearch built on top of it. Couchbase’s Full-Text Search (FTS) Engine is powered by Bleve, and this article will showcase the various ways to analyze text within this engine.

Bleve is an open-sourced text indexing and search library implemented in Go, developed in-house at Couchbase.

Couchbase’s FTS engine supports indexes that subscribe to data residing within a Couchbase Server and indexes data that it ingests from the server. It’s a distributed system – meaning it can partition data across multiple nodes in a cluster and searches involve scattering the request and gathering responses from across all nodes within the cluster before responding to the application.

The FTS engine distributes documents ingested for an index across a configurable number of partitions and these partitions could reside across multiple nodes within a cluster. Each partition follows the same set of rules that the FTS index is configured with – to analyze and index text into the full-text search database.

The text analysis component of a Full-Text search engine is responsible for breaking down the raw text into a list of words – which we’ll refer to as tokens. These tokens are more suitable for indexing in the database and searching.

Couchbase’s FTS Engine handles text indexing for JSON documents. It builds an index for the content that is analyzed and stores into the database – the index along with all the relevant metadata needed to link the tokens generated to the original documents within which they reside.

An Inverted index is the data structure chosen to index the tokens generated from text, to make search queries faster. This index links every token generated to documents that contain the token.

For example, take the following documents …

The inverted index for the tokens generated from the 2 documents above would resemble this…

Here’s a diagram highlighting the components of the full-text search engine …

A Text Analyzer

The components of a text analyzer can broadly be classified into 2 categories:

  • Tokenizer
  • Filters

Couchbase’s engine further categorizes filters into:

  • Character filters
  • Token filters

Before we dive into the function of each of these components, here’s an overview of a text analyzer …

Tokenizer

A tokenizer is the first component to which the documents are subjected to. As the name suggests, it breaks the raw text into a list of tokens. This conversion will depend on a rule-set defined for the tokenizer.

Stock tokenizers…

Take this sample text for an example: “_this is my email ID: _abhi123@cb.com

A couple of configurable tokenizers…

  • Exception … This tokenizer allows the user to enter exception patterns (regular expressions) over the stock tokenizers.
  • Regexp … This tokenizer extracts text that matches the pattern (a regular expression) as tokens.

For example:

#json #couchbase #search #go #text analysis #full-text search #bleve #full-text #full-text-indexing

Ahebwe  Oscar

Ahebwe Oscar

1624131000

Basic and Full-text Search with Django and Postgres

Unlike relational databases, full-text search is not standardized. There are a number of open-source options like ElasticSearch, Solr, and Xapian. ElasticSearch is probably the most popular solution; however, it’s complicated to set up and maintain. Further, if you’re not taking advantage of some of the advanced features that ElasticSearch offers, you should stick with the full-text search capabilities that many relational (like Postgres, MySQL, SQLite) and non-relational databases (like MongoDB and CouchDB) offer. Postgres in particular is well-suited for full-text search. Django supports it out-of-the-box as well.

For the vast majority of your Django apps, you should, at the very least, start out with leveraging full-text search from Postgres before looking to a more powerful solution like ElasticSearch or Solr.

In this article, we’ll add basic and full-text search to a Django app with Postgres.

Objectives

By the end of this article, you should be able to:

  1. Set up basic search functionality in a Django app with the Q object
  2. Add full-text search to a Django app
  3. Sort full-text search results by relevance using stemming, ranking, and weighting

#basic and full-text search with django and postgres #django #search lookup #postgres #full-text search #postgres full text search

Navigating Between DOM Nodes in JavaScript

In the previous chapters you've learnt how to select individual elements on a web page. But there are many occasions where you need to access a child, parent or ancestor element. See the JavaScript DOM nodes chapter to understand the logical relationships between the nodes in a DOM tree.

DOM node provides several properties and methods that allow you to navigate or traverse through the tree structure of the DOM and make changes very easily. In the following section we will learn how to navigate up, down, and sideways in the DOM tree using JavaScript.

Accessing the Child Nodes

You can use the firstChild and lastChild properties of the DOM node to access the first and last direct child node of a node, respectively. If the node doesn't have any child element, it returns null.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
console.log(main.firstChild.nodeName); // Prints: #text

var hint = document.getElementById("hint");
console.log(hint.firstChild.nodeName); // Prints: SPAN
</script>

Note: The nodeName is a read-only property that returns the name of the current node as a string. For example, it returns the tag name for element node, #text for text node, #comment for comment node, #document for document node, and so on.

If you notice the above example, the nodeName of the first-child node of the main DIV element returns #text instead of H1. Because, whitespace such as spaces, tabs, newlines, etc. are valid characters and they form #text nodes and become a part of the DOM tree. Therefore, since the <div> tag contains a newline before the <h1> tag, so it will create a #text node.

To avoid the issue with firstChild and lastChild returning #text or #comment nodes, you could alternatively use the firstElementChild and lastElementChild properties to return only the first and last element node, respectively. But, it will not work in IE 9 and earlier.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
alert(main.firstElementChild.nodeName); // Outputs: H1
main.firstElementChild.style.color = "red";

var hint = document.getElementById("hint");
alert(hint.firstElementChild.nodeName); // Outputs: SPAN
hint.firstElementChild.style.color = "blue";
</script>

Similarly, you can use the childNodes property to access all child nodes of a given element, where the first child node is assigned index 0. Here's an example:

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes 
if(main.hasChildNodes()) {
    var nodes = main.childNodes;
    
    // Loop through node list and display node name
    for(var i = 0; i < nodes.length; i++) {
        alert(nodes[i].nodeName);
    }
}
</script>

The childNodes returns all child nodes, including non-element nodes like text and comment nodes. To get a collection of only elements, use children property instead.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes 
if(main.hasChildNodes()) {
    var nodes = main.children;
    
    // Loop through node list and display node name
    for(var i = 0; i < nodes.length; i++) {
        alert(nodes[i].nodeName);
    }
}
</script>

#javascript 

Ray  Patel

Ray Patel

1623892558

Search Emails From A Domain Through Search Engines for Python

EmailFinder

Search emails from a domain through search engines for python

Installation:

> pip3 install emailfinder

Upgrades are also available using:

> pip3 install emailfinder --upgrade

Search Engines

  • google: Ok (note cookies policy and Captcha!).
  • bing: OK.
  • baidu: OK (few requests).
  • bing: Hunting Robots very fast.

#email #python #search emails #search emails through search engines #search emails from a domain through search engines for python #domain