Alfie Mellor

Alfie Mellor

1656661961

Machine Learning Projects with DVC Extension in VS Code

Machine Learning Experimentation in VS Code with DVC Extension

Learn how to manage and make your machine learning projects reproducible with an open-source tool DVC and the new extension. We will see how to track datasets and models, run, compare, visualize, and track machine learning experiments right in VS Code.

Chapters:
00:00 VS Code Livestream begins
01:36 Welcome and intro
04:17 Today's example ML problem
06:59 What is DVC and why use it
10:19 Intro model 
11:25 Demo DVC Extension for VS Code
19:02 Q&A
36:22 Learn more

Resources:
Code https://aka.ms/VSCode/DVCExt/Code 
DVC extension https://morioh.com/p/ac711e5a9834 
Git Repo https://github.com/devcontainers/cli 
IterativeAI https://learn.iterative.ai/ 

#vscode #machinelearing 

Machine Learning Projects with DVC Extension in VS Code

sam kirubakar

1652796960

The 5 Phases of Natural Language Processing

Natural Language Processing – Overview

  • Natural language processing (NLP) is the interactions between computers and human language, how to program computers to process and analyse large amounts of natural language data.
  • The technology can accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
  • NLP makes computers capable of “understanding” the contents of documents, including the contextual nuances of the language within them.
  • Most higher-level NLP applications involve aspects that emulate intelligent behavior and apparent comprehension of natural language.
  • Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks.
  • These algorithms take as input a large set of “features” that are generated from the input data.

Natural Language Processing – Market Size

Natural Language Processing Market was valued at USD 11.02 Billion in 2020 and is projected to reach USD 45.79 Billion by 2028, growing at a CAGR of 19.49 % from 2021 to 2028.

Natural Language Processing – Segmentation Analysis

Natural Language Processing – Advantages

  1. Better data analysis
  2. Streamlined processes
  3. Cost-effective
  4. Empowered employees
  5. Enhanced customer experience

Natural Language Processing – 5 Phases

  1. Phase 1 – Lexical Analysis
  2. Phase 2 – Syntactic Analysis
  3. Phase 3 – Sematic Analysis
  4. Phase 4 – Discourse Analysis
  5. Phase 5 – Pragmatic Analysis

Phase 1 – Lexical Analysis

  • Lexical analysis is the process of converting a sequence of characters into a sequence of tokens.
  • A lexer is generally combined with a parser, which together analyzes the syntax of programming languages, web pages, and so forth.
  • Lexers and parsers are most often used for compilers but can be used for other computer language tools, such as pretty printers or linters.
  • Lexical analysis is also an important analysis during the early stage of natural language processing, where text or sound waves are segmented into words and other units.

Phase 2 – Syntactic Analysis

  • Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages, or data structures, conforming to the rules of formal grammar.
  • It is used in the analysis of computer languages, referring to the syntactic analysis of the input code into its component parts to facilitate the writing of compilers and interpreters.
  • Grammatical rules are applied to categories and groups of words, not individual words. The syntactic analysis basically assigns a semantic structure to text.
  • Syntactic analysis is a very important part of NLP that helps in understanding the grammatical meaning of any sentence.

Phase 3 – Semantic Analysis

  • Semantic Analysis attempts to understand the meaning of Natural Language.
  • Semantic Analysis of Natural Language captures the meaning of the given text while considering context, logical structuring of sentences, and grammar roles.
  • 2 parts of Semantic Analysis are (a) Lexical Semantic Analysis and (b) Compositional Semantics Analysis.
  • Semantic analysis can begin with the relationship between individual words.

Phase 4 – Discourse Analysis

  • Researchers use Discourse analysis to uncover the motivation behind a text.
  • It is useful for studying the underlying meaning of a spoken or written text as it considers the social and historical contexts.
  • Discourse analysis is a process of performing text or language analysis, involving text interpretation, and understanding the social interactions.

Phase 5 – Pragmatic Analysis

  • Pragmatic Analysis is part of the process of extracting information from text.
  • It focuses on taking a structured set of text and figuring out the actual meaning of the text.
  • It also focuses on the meaning of the words of the time and context.
  • Effects on interpretation can be measured using PA by understanding the communicative and social content.

#machinelearing #artificial-intelligence #nlp  #usa #uk 

Aida  Stamm

Aida Stamm

1652337950

Linear Algebraic Tools in Machine Learning and Data Science

Linear Algebraic Tools in Machine Learning and Data Science

Math and Architectures of Deep Learning sets out the foundations of DL in a way that’s both useful and accessible to working practitioners. Each chapter explores a new fundamental DL concept or architectural pattern, explaining the underpinning mathematics and demonstrating how they work in practice with well-annotated Python code. You’ll start with a primer of basic algebra, calculus, and statistics, working your way up to state-of-the-art DL paradigms taken from the latest research.

Check out Krishnendu Chaudhury's book 📖  Math and Architectures of Deep Learning | http://mng.bz/K2ag 📖  For 40% off this book use the ⭐  DISCOUNT CODE: twitchaud40 ⭐  Learn the most important tools in the repertoire of a data scientist and machine learning practitioner - Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Latent Semantic Analysis (LSA) - with the help of Krishnendu Chaudhury, a deep learning and computer vision expert with decade-long stints at both Google and Adobe Systems.

 

📚📚📚
Math and Architectures of Deep Learning | http://mng.bz/K2ag
For 40% off this book use discount code: twitchaud40
📚📚📚

#machinelearing #datascience #math #linearalgebraic #mathematics 

Linear Algebraic Tools in Machine Learning and Data Science
Awesome  Rust

Awesome Rust

1650958680

M2cgen: Transform ML Models into A Native Cod

m2cgen

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go, JavaScript, Visual Basic, C#, PowerShell, R, PHP, Dart, Haskell, Ruby, F#, Rust, Elixir).

Installation

Supported Python version is >= 3.7.

pip install m2cgen

Supported Languages

  • C
  • C#
  • Dart
  • F#
  • Go
  • Haskell
  • Java
  • JavaScript
  • PHP
  • PowerShell
  • Python
  • R
  • Ruby
  • Rust
  • Visual Basic (VBA-compatible)
  • Elixir

Supported Models

 ClassificationRegression
Linear
  • scikit-learn
    • LogisticRegression
    • LogisticRegressionCV
    • PassiveAggressiveClassifier
    • Perceptron
    • RidgeClassifier
    • RidgeClassifierCV
    • SGDClassifier
  • lightning
    • AdaGradClassifier
    • CDClassifier
    • FistaClassifier
    • SAGAClassifier
    • SAGClassifier
    • SDCAClassifier
    • SGDClassifier
  • scikit-learn
    • ARDRegression
    • BayesianRidge
    • ElasticNet
    • ElasticNetCV
    • GammaRegressor
    • HuberRegressor
    • Lars
    • LarsCV
    • Lasso
    • LassoCV
    • LassoLars
    • LassoLarsCV
    • LassoLarsIC
    • LinearRegression
    • OrthogonalMatchingPursuit
    • OrthogonalMatchingPursuitCV
    • PassiveAggressiveRegressor
    • PoissonRegressor
    • RANSACRegressor(only supported regression estimators can be used as a base estimator)
    • Ridge
    • RidgeCV
    • SGDRegressor
    • TheilSenRegressor
    • TweedieRegressor
  • StatsModels
    • Generalized Least Squares (GLS)
    • Generalized Least Squares with AR Errors (GLSAR)
    • Generalized Linear Models (GLM)
    • Ordinary Least Squares (OLS)
    • [Gaussian] Process Regression Using Maximum Likelihood-based Estimation (ProcessMLE)
    • Quantile Regression (QuantReg)
    • Weighted Least Squares (WLS)
  • lightning
    • AdaGradRegressor
    • CDRegressor
    • FistaRegressor
    • SAGARegressor
    • SAGRegressor
    • SDCARegressor
    • SGDRegressor
SVM
  • scikit-learn
    • LinearSVC
    • NuSVC
    • OneClassSVM
    • SVC
  • lightning
    • KernelSVC
    • LinearSVC
  • scikit-learn
    • LinearSVR
    • NuSVR
    • SVR
  • lightning
    • LinearSVR
Tree
  • DecisionTreeClassifier
  • ExtraTreeClassifier
  • DecisionTreeRegressor
  • ExtraTreeRegressor
Random Forest
  • ExtraTreesClassifier
  • LGBMClassifier(rf booster only)
  • RandomForestClassifier
  • XGBRFClassifier
  • ExtraTreesRegressor
  • LGBMRegressor(rf booster only)
  • RandomForestRegressor
  • XGBRFRegressor
Boosting
  • LGBMClassifier(gbdt/dart/goss booster only)
  • XGBClassifier(gbtree(including boosted forests)/gblinear booster only)
  • LGBMRegressor(gbdt/dart/goss booster only)
  • XGBRegressor(gbtree(including boosted forests)/gblinear booster only)

You can find versions of packages with which compatibility is guaranteed by CI tests here. Other versions can also be supported but they are untested.

Classification Output

Linear / Linear SVM / Kernel SVM

Binary

Scalar value; signed distance of the sample to the hyperplane for the second class.

Multiclass

Vector value; signed distance of the sample to the hyperplane per each class.

Comment

The output is consistent with the output of LinearClassifierMixin.decision_function.

SVM

Outlier detection

Scalar value; signed distance of the sample to the separating hyperplane: positive for an inlier and negative for an outlier.

Binary

Scalar value; signed distance of the sample to the hyperplane for the second class.

Multiclass

Vector value; one-vs-one score for each class, shape (n_samples, n_classes * (n_classes-1) / 2).

Comment

The output is consistent with the output of BaseSVC.decision_function when the decision_function_shape is set to ovo.

Tree / Random Forest / Boosting

Binary

Vector value; class probabilities.

Multiclass

Vector value; class probabilities.

Comment

The output is consistent with the output of the predict_proba method of DecisionTreeClassifier / ExtraTreeClassifier / ExtraTreesClassifier / RandomForestClassifier / XGBRFClassifier / XGBClassifier / LGBMClassifier.

Usage

Here's a simple example of how a linear model trained in Python environment can be represented in Java code:

from sklearn.datasets import load_diabetes
from sklearn import linear_model
import m2cgen as m2c

X, y = load_diabetes(return_X_y=True)

estimator = linear_model.LinearRegression()
estimator.fit(X, y)

code = m2c.export_to_java(estimator)

Generated Java code:

public class Model {
    public static double score(double[] input) {
        return ((((((((((152.1334841628965) + ((input[0]) * (-10.012197817470472))) + ((input[1]) * (-239.81908936565458))) + ((input[2]) * (519.8397867901342))) + ((input[3]) * (324.39042768937657))) + ((input[4]) * (-792.1841616283054))) + ((input[5]) * (476.74583782366153))) + ((input[6]) * (101.04457032134408))) + ((input[7]) * (177.06417623225025))) + ((input[8]) * (751.2793210873945))) + ((input[9]) * (67.62538639104406));
    }
}

You can find more examples of generated code for different models/languages here.

CLI

m2cgen can be used as a CLI tool to generate code using serialized model objects (pickle protocol):

$ m2cgen <pickle_file> --language <language> [--indent <indent>] [--function_name <function_name>]
         [--class_name <class_name>] [--module_name <module_name>] [--package_name <package_name>]
         [--namespace <namespace>] [--recursion-limit <recursion_limit>]

Don't forget that for unpickling serialized model objects their classes must be defined in the top level of an importable module in the unpickling environment.

Piping is also supported:

$ cat <pickle_file> | m2cgen --language <language>

FAQ

Q: Generation fails with RecursionError: maximum recursion depth exceeded error.

A: If this error occurs while generating code using an ensemble model, try to reduce the number of trained estimators within that model. Alternatively you can increase the maximum recursion depth with sys.setrecursionlimit(<new_depth>).

Q: Generation fails with ImportError: No module named <module_name_here> error while transpiling model from a serialized model object.

A: This error indicates that pickle protocol cannot deserialize model object. For unpickling serialized model objects, it is required that their classes must be defined in the top level of an importable module in the unpickling environment. So installation of package which provided model's class definition should solve the problem.

Q: Generated by m2cgen code provides different results for some inputs compared to original Python model from which the code were obtained.

A: Some models force input data to be particular type during prediction phase in their native Python libraries. Currently, m2cgen works only with float64 (double) data type. You can try to cast your input data to another type manually and check results again. Also, some small differences can happen due to specific implementation of floating-point arithmetic in a target language.

Download Details:
Author: BayesWitnesses
Source Code: https://github.com/BayesWitnesses/m2cgen
License: MIT License

#rust  #rustlang #machinelearing 

M2cgen: Transform ML Models into A Native Cod

The Rust CUDA Project

Goal

The Rust CUDA Project is a project aimed at making Rust a tier-1 language for extremely fast GPU computing using the CUDA Toolkit. It provides tools for compiling Rust to extremely fast PTX code as well as libraries for using existing CUDA libraries with it.

Background

Historically, general purpose high performance GPU computing has been done using the CUDA toolkit. The CUDA toolkit primarily provides a way to use Fortran/C/C++ code for GPU computing in tandem with CPU code with a single source. It also provides many libraries, tools, forums, and documentation to supplement the single-source CPU/GPU code.

CUDA is exclusively an NVIDIA-only toolkit. Many tools have been proposed for cross-platform GPU computing such as OpenCL, Vulkan Computing, and HIP. However, CUDA remains the most used toolkit for such tasks by far. This is why it is imperative to make Rust a viable option for use with the CUDA toolkit.

However, CUDA with Rust has been a historically very rocky road. The only viable option until now has been to use the LLVM PTX backend, however, the LLVM PTX backend does not always work and would generate invalid PTX for many common Rust operations, and in recent years it has been shown time and time again that a specialized solution is needed for Rust on the GPU with the advent of projects such as rust-gpu (for Rust -> SPIR-V).

Our hope is that with this project we can push the Rust GPU computing industry forward and make Rust an excellent language for such tasks. Rust offers plenty of benefits such as __restrict__ performance benefits for every kernel, An excellent module/crate system, delimiting of unsafe areas of CPU/GPU code with unsafe, high level wrappers to low level CUDA libraries, etc.

Structure

The scope of the Rust CUDA Project is quite broad, it spans the entirety of the CUDA ecosystem, with libraries and tools to make it usable using Rust. Therefore, the project contains many crates for all corners of the CUDA ecosystem.

The current line-up of libraries is the following:

  • rustc_codegen_nvvm Which is a rustc backend that targets NVVM IR (a subset of LLVM IR) for the libnvvm library.
    • Generates highly optimized PTX code which can be loaded by the CUDA Driver API to execute on the GPU.
    • For the near future it will be CUDA-only, but it may be used to target amdgpu in the future.
  • cuda_std for GPU-side functions and utilities, such as thread index queries, memory allocation, warp intrinsics, etc.
    • Not a low level library, provides many utility functions to make it easier to write cleaner and more reliable GPU kernels.
    • Closely tied to rustc_codegen_nvvm which exposes GPU features through it internally.
  • cudnn for a collection of GPU-accelerated primitives for deep neural networks.
  • cust for CPU-side CUDA features such as launching GPU kernels, GPU memory allocation, device queries, etc.
    • High level with features such as RAII and Rust Results that make it easier and cleaner to manage the interface to the GPU.
    • A high level wrapper for the CUDA Driver API, the lower level version of the more common CUDA Runtime API used from C++.
    • Provides much more fine grained control over things like kernel concurrency and module loading than the C++ Runtime API.
  • gpu_rand for GPU-friendly random number generation, currently only implements xoroshiro RNGs from rand_xoshiro.
  • optix for CPU-side hardware raytracing and denoising using the CUDA OptiX library.

In addition to many "glue" crates for things such as high level wrappers for certain smaller CUDA libraries.

Related Projects

Other projects related to using Rust on the GPU:

  • 2016: glassful Subset of Rust that compiles to GLSL.
  • 2017: inspirv-rust Experimental Rust MIR -> SPIR-V Compiler.
  • 2018: nvptx Rust to PTX compiler using the nvptx target for rustc (using the LLVM PTX backend).
  • 2020: accel Higher level library that relied on the same mechanism that nvptx does.
  • 2020: rlsl Experimental Rust -> SPIR-V compiler (predecessor to rust-gpu)
  • 2020: rust-gpu Rustc codegen backend to compile Rust to SPIR-V for use in shaders, similar mechanism as our project.

Download Details:
Author: Rust-GPU
Source Code: https://github.com/Rust-GPU/Rust-CUDA
License: View license

#rust  #rustlang  #machinelearing #cuda #python 

The Rust CUDA Project

Quickwit: Cloud Native and Cost-efficient Search Engine for Log Manage

Quickwit is the next-gen search & analytics engine built for logs. It is a highly reliable & cost-efficient alternative to Elasticsearch. 

💡 Features

  • Index data persisted on object storage
  • Ingest JSON documents with or without a strict schema
  • Ingest & Aggregation API Elasticsearch compatible
  • Lightweight Embedded UI
  • Runs on a fraction of the resources: written in Rust, powered by the mighty tantivy
  • Works out of the box with sensible defaults
  • Optimized for multi-tenancy. Add and scale tenants with no overhead costs
  • Distributed search
  • Cloud-native: Kubernetes ready
  • Add and remove nodes in seconds
  • Decoupled compute & storage
  • Sleep like a log: all your indexed data is safely stored on object storage (AWS S3...)
  • Ingest your documents with exactly-once semantics
  • Kafka-native ingestion
  • Search stream API that notably unlocks full-text search in ClickHouse

🔮 Upcoming Features

  • Ingest your logs from your object storage
  • Distributed indexing
  • Support for tracing
  • Native support for OpenTelemetry

Uses & Limitations

:white_check_mark:   When to use:x:   When not to use
Your documents are immutable: application logs, system logs, access logs, user actions logs, audit trail, etc.Your documents are mutable.
Your data has a time component. Quickwit includes optimizations and design choices specifically related to time.You need a low-latency search for e-commerce websites.
You want a full-text search in a multi-tenant environment.You provide a public-facing search with high QPS.
You want to index directly from Kafka.You want to re-score documents at query time.
You want to add full-text search to your ClickHouse cluster. 
You ingest a tremendous amount of logs and don't want to pay huge bills. 
You ingest a tremendous amount of data and you don't want to waste your precious time babysitting your cluster. 

⚡ Getting Started

Let's download and install Quickwit.

curl -L https://install.quickwit.io | sh

You can now move this executable directory wherever sensible for your environment and possibly add it to your PATH environment. You can also install it via other means.

Take a look at our Quick Start to do amazing things, like Creating your first index or Adding some documents, or take a glance at our full Installation guide!

📚 Tutorials

💬 Community

🙋 FAQ

How is Quickwit different from traditional search engines like Elasticsearch or Solr?

The core difference and advantage of Quickwit is its architecture that is built from the ground up for cloud and logs. Optimized IO paths make search on object storage sub-second and thanks to the true decoupled compute and storage, search instances are stateless, it is possible to add or remove search nodes within seconds. Last but not least, we implemented a highly-reliable distributed search and exactly-once semantics during indexing so that all engineers can sleep at night.

How does Quickwit compare to Elastic in terms of cost?

We estimate that Quickwit can be up to 10x cheaper on average than Elastic. To understand how, check out our blog post about searching the web on AWS S3.

What license does Quickwit use?

Quickwit is open-source under the GNU Affero General Public License Version 3 - AGPLv3. Fundamentally, this means that you are free to use Quickwit for your project, as long as you don't modify Quickwit. If you do, you have to make the modifications public. We also provide a commercial license for enterprises to provide support and a voice on our roadmap.

What is Quickwit's business model?

Our business model relies on our commercial license. There is no plan to become SaaS in the near future.

Download Details:
Author: quickwit-oss
Source Code: https://github.com/quickwit-oss/quickwit
License: View license

#rust  #rustlang  #machinelearing 

Quickwit: Cloud Native and Cost-efficient Search Engine for Log Manage

ZomboDB: Text-search & Analytics Features to Postgres Written in Rust

Making Postgres and Elasticsearch work together like it's 2022

Readme

ZomboDB brings powerful text-search and analytics features to Postgres by using Elasticsearch as an index type. Its comprehensive query language and SQL functions enable new and creative ways to query your relational data.

From a technical perspective, ZomboDB is a 100% native Postgres extension that implements Postgres' Index Access Method API. As a native Postgres index type, ZomboDB allows you to CREATE INDEX ... USING zombodb on your existing Postgres tables. At that point, ZomboDB takes over and fully manages the remote Elasticsearch index and guarantees transactionally-correct text-search query results.

ZomboDB is fully compatible with all of Postgres' query plan types and most SQL commands such as CREATE INDEX, COPY, INSERT, UPDATE, DELETE, SELECT, ALTER, DROP, REINDEX, (auto)VACUUM, etc.

It doesn’t matter if you’re using an Elasticsearch cloud provider or managing your own cluster -- ZomboDB communicates with Elasticsearch via its RESTful APIs so you’re covered either way.

ZomboDB allows you to use the power and scalability of Elasticsearch directly from Postgres. You don’t have to manage transactions between Postgres and Elasticsearch, asynchronous indexing pipelines, complex reindexing processes, or multiple data-access code paths -- ZomboDB does it all for you.

Quick Links

Features

Current Limitations

  • Only one ZomboDB index per table
  • ZomboDB indexes with predicates (ie, partial indexes) are not supported
  • CREATE INDEX CONCURRENTLY is not supported

These limitations may be addressed in future versions of ZomboDB.

System Requirements

ProductVersion
Postgres10.x, 11.x, 12.x, 13.x
Elasticsearch7.x

Sponsorship and Downloads

Please see https://github.com/sponsors/eeeebbbbrrrr for sponsorship details. Your sponsorship at any tier is greatly appreciated and helps keep ZomboDB moving forward.

Note that ZomboDB is only available in binary form for certain sponsor tiers.

When you become a sponsor at a tier that provides binary downloads, please request a download key from https://www.zombodb.com/services/. Please do the same if you sponsor a tier that provides access to ZomboDB's private Discord server.

Quick Overview

Note that this is just a quick overview. Please read the getting started tutorial for more details.

Create the extension:

CREATE EXTENSION zombodb;

Create a table:

CREATE TABLE products (
    id SERIAL8 NOT NULL PRIMARY KEY,
    name text NOT NULL,
    keywords varchar(64)[],
    short_summary text,
    long_description zdb.fulltext, 
    price bigint,
    inventory_count integer,
    discontinued boolean default false,
    availability_date date
);

-- insert some data

Create a ZomboDB index:

CREATE INDEX idxproducts 
          ON products 
       USING zombodb ((products.*)) 
        WITH (url='localhost:9200/');

Query it:

SELECT * 
  FROM products 
 WHERE products ==> '(keywords:(sports OR box) OR long_description:"wooden away"~5) AND price:[1000 TO 20000]';

Contact Information

History

The name is an homage to zombo.com and its long history of continuous self-affirmation.

Historically, ZomboDB began in 2013 by Technology Concepts & Design, Inc as a closed-source effort to provide transaction safe text-search on top of Postgres tables. While Postgres' "tsearch" features are useful, they're not necessarily adequate for 200 column-wide tables with 100M rows, each containing large text content.

Initially designed on-top of Postgres' Foreign Data Wrapper API, ZomboDB quickly evolved into an index type so that queries are MVCC-safe and standard SQL can be used to query and manage indices.

Elasticsearch was chosen as the backing search index because of its horizontal scaling abilities, performance, and general ease of use.

ZomboDB was open-sourced in July 2015 and has since been used in numerous production systems of various sizes and complexity.

Download Details:
Author: zombodb
Source Code: https://github.com/zombodb/zombodb
License: View license

#rust  #rustlang  #machinelearing 

ZomboDB: Text-search & Analytics Features to Postgres Written in Rust

Hora: Approximate Nearest Neighbor Search Algorithm Library

Hora

Hora Search Everywhere!

Hora is an approximate nearest neighbor search algorithm (wiki) library. We implement all code in Rust🦀 for reliability, high level abstraction and high speeds comparable to C++.

Hora, 「ほら」 in Japanese, sounds like [hōlə], and means Wow, You see! or Look at that!. The name is inspired by a famous Japanese song 「小さな恋のうた」.

Demos

👩 Face-Match [online demo], have a try!

🍷 Dream wine comments search [online demo], have a try!

Features

Performant ⚡️

  • SIMD-Accelerated (packed_simd)
  • Stable algorithm implementation
  • Multiple threads design

Supports Multiple Languages ☄️

  • Python
  • Javascript
  • Java
  • Go (WIP)
  • Ruby (WIP)
  • Swift (WIP)
  • R (WIP)
  • Julia (WIP)
  • Can also be used as a service

Supports Multiple Indexes 🚀

  • Hierarchical Navigable Small World Graph Index (HNSWIndex) (details)
  • Satellite System Graph (SSGIndex) (details)
  • Product Quantization Inverted File(PQIVFIndex) (details)
  • Random Projection Tree(RPTIndex) (LSH, WIP)
  • BruteForce (BruteForceIndex) (naive implementation with SIMD)

Portable 💼

  • Supports WebAssembly
  • Supports Windows, Linux and OS X
  • Supports IOS and Android (WIP)
  • Supports no_std (WIP, partial)
  • No heavy dependencies, such as BLAS

Reliability 🔒

  • Rust compiler secures all code
  • Memory managed by Rust for all language libraries such as Python's
  • Broad testing coverage

Supports Multiple Distances 🧮

  • Dot Product Distance
    • equation
  • Euclidean Distance
    • equation
  • Manhattan Distance
    • equation
  • Cosine Similarity
    • equation

Productive

  • Well documented
  • Elegant, simple and easy to learn API

Installation

Rust

in Cargo.toml

[dependencies]
hora = "0.1.1"

Python

$ pip install horapy

Javascript (WebAssembly)

$ npm i horajs

Building from source

$ git clone https://github.com/hora-search/hora
$ cargo build

Benchmarks

by aws t2.medium (CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz) more information

Examples

Rust example [more info]

use hora::core::ann_index::ANNIndex;
use rand::{thread_rng, Rng};
use rand_distr::{Distribution, Normal};

pub fn demo() {
    let n = 1000;
    let dimension = 64;

    // make sample points
    let mut samples = Vec::with_capacity(n);
    let normal = Normal::new(0.0, 10.0).unwrap();
    for _i in 0..n {
        let mut sample = Vec::with_capacity(dimension);
        for _j in 0..dimension {
            sample.push(normal.sample(&mut rand::thread_rng()));
        }
        samples.push(sample);
    }

    // init index
    let mut index = hora::index::hnsw_idx::HNSWIndex::<f32, usize>::new(
        dimension,
        &hora::index::hnsw_params::HNSWParams::<f32>::default(),
    );
    for (i, sample) in samples.iter().enumerate().take(n) {
        // add point
        index.add(sample, i).unwrap();
    }
    index.build(hora::core::metrics::Metric::Euclidean).unwrap();

    let mut rng = thread_rng();
    let target: usize = rng.gen_range(0..n);
    // 523 has neighbors: [523, 762, 364, 268, 561, 231, 380, 817, 331, 246]
    println!(
        "{:?} has neighbors: {:?}",
        target,
        index.search(&samples[target], 10) // search for k nearest neighbors
    );
}

thank @vaaaaanquish for this complete pure Rust 🦀 image search example, For more information about this example, you can click Pure Rust な近似最近傍探索ライブラリ hora を用いた画像検索を実装する

Python example [more info]

import numpy as np
from horapy import HNSWIndex

dimension = 50
n = 1000

# init index instance
index = HNSWIndex(dimension, "usize")

samples = np.float32(np.random.rand(n, dimension))
for i in range(0, len(samples)):
    # add node
    index.add(np.float32(samples[i]), i)

index.build("euclidean")  # build index

target = np.random.randint(0, n)
# 410 in Hora ANNIndex <HNSWIndexUsize> (dimension: 50, dtype: usize, max_item: 1000000, n_neigh: 32, n_neigh0: 64, ef_build: 20, ef_search: 500, has_deletion: False)
# has neighbors: [410, 736, 65, 36, 631, 83, 111, 254, 990, 161]
print("{} in {} \nhas neighbors: {}".format(
    target, index, index.search(samples[target], 10)))  # search

JavaScript example [more info]

import * as horajs from "horajs";

const demo = () => {
    const dimension = 50;
    var bf_idx = horajs.BruteForceIndexUsize.new(dimension);
    // var hnsw_idx = horajs.HNSWIndexUsize.new(dimension, 1000000, 32, 64, 20, 500, 16, false);
    for (var i = 0; i < 1000; i++) {
        var feature = [];
        for (var j = 0; j < dimension; j++) {
            feature.push(Math.random());
        }
        bf_idx.add(feature, i); // add point
    }
    bf_idx.build("euclidean"); // build index
    var feature = [];
    for (var j = 0; j < dimension; j++) {
        feature.push(Math.random());
    }
    console.log("bf result", bf_idx.search(feature, 10)); //bf result Uint32Array(10) [704, 113, 358, 835, 408, 379, 117, 414, 808, 826]
}

(async () => {
    await horajs.default();
    await horajs.init_env();
    demo();
})();

Java example [more info]

public void demo() {
    final int dimension = 2;
    final float variance = 2.0f;
    Random fRandom = new Random();

    BruteForceIndex bruteforce_idx = new BruteForceIndex(dimension); // init index instance

    List<float[]> tmp = new ArrayList<>();
    for (int i = 0; i < 5; i++) {
        for (int p = 0; p < 10; p++) {
            float[] features = new float[dimension];
            for (int j = 0; j < dimension; j++) {
                features[j] = getGaussian(fRandom, (float) (i * 10), variance);
            }
            bruteforce_idx.add("bf", features, i * 10 + p); // add point
            tmp.add(features);
          }
    }
    bruteforce_idx.build("bf", "euclidean"); // build index

    int search_index = fRandom.nextInt(tmp.size());
    // nearest neighbor search
    int[] result = bruteforce_idx.search("bf", 10, tmp.get(search_index));
    // [main] INFO com.hora.app.ANNIndexTest  - demo bruteforce_idx[7, 8, 0, 5, 3, 9, 1, 6, 4, 2]
    log.info("demo bruteforce_idx" + Arrays.toString(result));
}

private static float getGaussian(Random fRandom, float aMean, float variance) {
    float r = (float) fRandom.nextGaussian();
    return aMean + r * variance;
}

Roadmap

  •  Full test coverage
  •  Implement EFANNA algorithm to achieve faster KNN graph building
  •  Swift support and iOS/macOS deployment example
  •  Support R
  •  support mmap

Related Projects and Comparison

Faiss, Annoy, ScaNN:

  • Hora's implementation is strongly inspired by these libraries.
  • Faiss focuses more on the GPU scenerio, and Hora is lighter than Faiss (no heavy dependencies).
  • Hora expects to support more languages, and everything related to performance will be implemented by Rust🦀.
  • Annoy only supports the LSH (Random Projection) algorithm.
  • ScaNN and Faiss are less user-friendly, (e.g. lack of documentation).
  • Hora is ALL IN RUST 🦀.

Milvus, Vald, Jina AI

  • Milvus and Vald also support multiple languages, but serve as a service instead of a library
  • Milvus is built upon some libraries such as Faiss, while Hora is a library with all the algorithms implemented itself

Contribute

We appreciate your participation!

We are glad to have you participate, any contributions are welcome, including documentations and tests. You can create a Pull Request or Issue on GitHub, and we will review it as soon as possible.

We use GitHub issues for tracking suggestions and bugs.

Clone the repo

git clone https://github.com/hora-search/hora

Build

cargo build

Test

cargo test --lib

Try the changes

cd examples
cargo run

Download Details:
Author: hora-search
Source Code: https://github.com/hora-search/hora
License: Apache-2.0 License

#rust  #rustlang  #machinelearing 

Hora: Approximate Nearest Neighbor Search Algorithm Library

INX: insanely Fast, Feature-rich Searching Written In Rust

An ultra-fast, adaptable deployment of the tantivy search engine via REST.

🌟 Standing On The Shoulders of Giants

lnx is built to not re-invent the wheel, it stands on top of the tokio-rs work-stealing runtime, hyper web framework combined with the raw compute power of the tantivy search engine.

Together this allows lnx to offer millisecond indexing on tens of thousands of document inserts at once (No more waiting around for things to get indexed!), Per index transactions and the ability to process searches like it's just another lookup on the hashtable 😲

✨ Features

lnx although very new offers a wide range of features thanks to the ecosystem it stands on.

  • 🤓 Complex Query Parser.
  • ❤️ Typo tolerant fuzzy queries.
  • ⚡️ Typo tolerant fast-fuzzy queries. (pre-computed spell correction)
  • 🔥 More-Like-This queries.
  • Order by fields.
  • Fast indexing.
  • Fast Searching.
  • Several Options for fine grain performance tuning.
  • Multiple storage backends available for testing and developing.
  • Permissions based authorization access tokens.

Demo video

Here you can see lnx doing search as you type on a 27 million document dataset coming in at reasonable 18GB once indexed, ran on my i7-8700k using ~3GB of RAM with our fast-fuzzy system Got a bigger dataset for us to try? Open an issue!

Performance

lnx can provide the ability to fine tune the system to your particular use case. You can customise the async runtime threads. The concurrency thread pool, threads per reader and writer threads, all per index.

This gives you the ability to control in detail where your computing resources are going. Got a large dataset but lower amount of concurrent reads? Bump the reader threads in exchange for lower max concurrency.

The bellow figures were taken by our lnx-cli on the small movies.json dataset, we didn't try any higher as Meilisearch takes an incredibly long time to index millions of docs although the new Meilisearch engine has improved this somewhat.

💔 Limitations

As much as lnx provides a wide range of features, it can not do it all being such a young system. Naturally, it has some limitations:

  • lnx is not distributed (yet) so this really does just scale vertically.
  • Simple but not too simple, lnx can't offer the same level of ease of use compared to MeiliSearch due to its schema-full nature and wide range of tuning options. With more tuning comes more settings, unfortunately.
  • Metrics (yet)

Download Details:
Author: lnx-search
Source Code: https://github.com/lnx-search/lnx
License: MIT License

#rust  #rustlang  #machinelearing 

INX: insanely Fast, Feature-rich Searching Written In Rust

Weggli: A Fast and Robust Semantic Search Tool for C & C++ Codebases

weggli

Introduction

weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.

weggli performs pattern matching on Abstract Syntax Trees based on user provided queries. Its query language resembles C and C++ code, making it easy to turn interesting code patterns into queries.

weggli is inspired by great tools like Semgrep, Coccinelle, joern and CodeQL, but makes some different design decisions:

  • C++ support: weggli has first class support for modern C++ constructs, such as lambda expressions, range-based for loops and constexprs.
  • Minimal setup: weggli should work out-of-the box against most software you will encounter. weggli does not require the ability to build the software and can work with incomplete sources or missing dependencies.
  • Interactive: weggli is designed for interactive usage and fast query performance. Most of the time, a weggli query will be faster than a grep search. The goal is to enable an interactive workflow where quick switching between code review and query creation/improvement is possible.
  • Greedy: weggli's pattern matching is designed to find as many (useful) matches as possible for a specific query. While this increases the risk of false positives it simplifies query creation. For example, the query $x = 10; will match both assignment expressions (foo = 10;) and declarations (int bar = 10;).

Usage

Use -h for short descriptions and --help for more details.

 Homepage: https://github.com/googleprojectzero/weggli

 USAGE: weggli [OPTIONS] <PATTERN> <PATH>

 ARGS:
     <PATTERN>
            A weggli search pattern. weggli's query language closely resembles
             C and C++ with a small number of extra features.

             For example, the pattern '{_ $buf[_]; memcpy($buf,_,_);}' will
             find all calls to memcpy that directly write into a stack buffer.

             Besides normal C and C++ constructs, weggli's query language
             supports the following features:

             _        Wildcard. Will match on any AST node.

             $var     Variables. Can be used to write queries that are independent
                      of identifiers. Variables match on identifiers, types,
                      field names or namespaces. The --unique option
                      optionally enforces that $x != $y != $z. The --regex option can
                      enforce that the variable has to match (or not match) a
                      regular expression.

             _(..)    Subexpressions. The _(..) wildcard matches on arbitrary
                      sub expressions. This can be helpful if you are looking for some
                      operation involving a variable, but don't know more about it.
                      For example, _(test) will match on expressions like test+10,
                      buf[test->size] or f(g(&test));

             not:     Negative sub queries. Only show results that do not match the
                      following sub query. For example, '{not: $fv==NULL; not: $fv!=NULL *$v;}'
                      would find pointer dereferences that are not preceded by a NULL check.

            strict:   Enable stricter matching. This turns off statement unwrapping 
                      and greedy function name matching. For example 'strict: func();' 
                      will not match on 'if (func() == 1)..' or 'a->func()' anymore.

             weggli automatically unwraps expression statements in the query source
             to search for the inner expression instead. This means that the query `{func($x);}`
             will match on `func(a);`, but also on `if (func(a)) {..}` or  `return func(a)`.
             Matching on `func(a)` will also match on `func(a,b,c)` or `func(z,a)`.
             Similarly, `void func($t $param)` will also match function definitions
             with multiple parameters.

             Additional patterns can be specified using the --pattern (-p) option. This makes
             it possible to search across functions or type definitions.

    <PATH>
            Input directory or file to search. By default, weggli will search inside
             .c and .h files for the default C mode or .cc, .cpp, .cxx, .h and .hpp files when
             executing in C++ mode (using the --cpp option).
             Alternative file endings can be specified using the --extensions (-e) option.

             When combining weggli with other tools or preprocessing steps,
             files can also be specified via STDIN by setting the directory to '-'
             and piping a list of filenames.


 OPTIONS:
     -A, --after <after>
            Lines to print after a match. Default = 5.

    -B, --before <before>
            Lines to print before a match. Default = 5.

    -C, --color
            Force enable color output.

    -X, --cpp
            Enable C++ mode.

        --exclude <exclude>...
            Exclude files that match the given regex.

    -e, --extensions <extensions>...
            File extensions to include in the search.

    -f, --force
            Force a search even if the queries contains syntax errors.

    -h, --help
            Prints help information.

        --include <include>...
            Only search files that match the given regex.

    -l, --limit
            Only show the first match in each function.

    -p, --pattern <p>...
            Specify additional search patterns.

    -R, --regex <regex>...
            Filter variable matches based on a regular expression.
             This feature uses the Rust regex crate, so most Perl-style
             regular expression features are supported.
             (see https://docs.rs/regex/1.5.4/regex/#syntax)

             Examples:

             Find calls to functions starting with the string 'mem':
             weggli -R 'func=^mem' '$func(_);'

             Find memcpy calls where the last argument is NOT named 'size':
             weggli -R 's!=^size$' 'memcpy(_,_,$s);'

    -u, --unique
            Enforce uniqueness of variable matches.
             By default, two variables such as $a and $b can match on identical values.
             For example, the query '$x=malloc($a); memcpy($x, _, $b);' would
             match on both

             void *buf = malloc(size);
             memcpy(buf, src, size);

             and

             void *buf = malloc(some_constant);
             memcpy(buf, src, size);

             Using the unique flag would filter out the first match as $a==$b.

    -v, --verbose
            Sets the level of verbosity.

    -V, --version
            Prints version information.

Examples

Calls to memcpy that write into a stack-buffer:

weggli '{
    _ $buf[_];
    memcpy($buf,_,_);
}' ./target/src

Calls to foo that don't check the return value:

weggli '{
   strict: foo(_);
}' ./target/src

Potentially vulnerable snprintf() users:

weggli '{
    $ret = snprintf($b,_,_);
    $b[$ret] = _;
}' ./target/src

Potentially uninitialized pointers:

weggli '{ _* $p;
NOT: $p = _;
$func(&$p);
}' ./target/src

Potentially insecure WeakPtr usage:

weggli --cpp '{
$x = _.GetWeakPtr(); 
DCHECK($x); 
$x->_;}' ./target/src

Debug only iterator validation:

weggli -X 'DCHECK(_!=_.end());' ./target/src

Functions that perform writes into a stack-buffer based on a function argument.

weggli '_ $fn(_ $limit) {
    _ $buf[_];
    for (_; $i<$limit; _) {
        $buf[$i]=_;
    }
}' ./target/src

Functions with the string decode in their name

weggli -R func=decode '_ $func(_) {_;}'

Encoding/Conversion functions

weggli '_ $func($t *$input, $t2 *$output) {
    for (_($i);_;_) {
        $input[$i]=_($output);
    }
}' ./target/src

Install

$ cargo install weggli

Build Instruction

# optional: install rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh 

git clone https://github.com/googleprojectzero/weggli.git
cd weggli; cargo build --release
./target/release/weggli

Implementation details

Weggli is built on top of the tree-sitter parsing library and its C and C++ grammars. Search queries are first parsed using an extended version of the corresponding grammar, and the resulting AST is transformed into a set of tree-sitter queries in builder.rs. The actual query matching is implemented in query.rs, which is a relatively small wrapper around tree-sitter's query engine to add weggli specific features.

Contributing

See CONTRIBUTING.md for details.

weggli example

Download Details:
Author: googleprojectzero
Source Code: https://github.com/googleprojectzero/weggli
License: Apache-2.0 License

#rust  #rustlang  #machinelearing 

Weggli: A Fast and Robust Semantic Search Tool for C & C++ Codebases

ERDOS: A Platform for Developing Self-driving Cars & Robotics Apps

ERDOS

ERDOS is a platform for developing self-driving cars and robotics applications.

Getting started

Local installation

System requirements

ERDOS is known to work on Ubuntu 18.04 and 20.04.

Rust installation

To develop an ERDOS application in Rust, simply include ERDOS in Cargo.toml. The latest ERDOS release is published on Crates.io and documentation is available on Docs.rs.

If you'd like to contribute to ERDOS, first install Rust. Then run the following to clone the repository and build ERDOS:

git clone https://github.com/erdos-project/erdos.git && cd erdos
cargo build

Python Installation

To develop an ERDOS application in Python, simply run pip install erdos. Documentation is available on Read the Docs.

If you'd like to contribute to ERDOS, first install Rust. Within a virtual environment, run the following to clone the repository and build ERDOS:

git clone https://github.com/erdos-project/erdos.git && cd erdos/python
pip3 install maturin
maturin develop

The Python-Rust bridge interface is developed in the python crate, which also contains user-facing python files under the python/erdos directory.

If you'd like to build ERDOS for release (better performance, but longer build times), run maturin develop --release.

Running an example

python3 python/examples/simple_pipeline.py

Writing Applications

ERDOS provides Python and Rust interfaces for developing applications.

The Python interface provides easy integration with popular libraries such as tensorflow, but comes at the cost of performance (e.g. slower serialization and the lack of parallelism within a process).

The Rust interface provides more safety guarantees (e.g. compile-time type checking) and faster performance (e.g. multithreading and zero-copy message passing). High performance, safety critical applications such as self-driving car pipelines deployed in production should use the Rust API to take full advantage of ERDOS.

ERDOS Design

ERDOS is a streaming dataflow system designed for self-driving car pipelines and robotics applications.

Components of the pipelines are implemented as operators which are connected by data streams. The set of operators and streams forms the dataflow graph, the representation of the pipline that ERDOS processes.

Applications define the dataflow graph by connecting operators to streams in the driver section of the program. Operators are typically implemented elsewhere.

ERDOS is designed for low latency. Self-driving car pipelines require end-to-end deadlines on the order of hundreds of milliseconds for safe driving. Similarly, self-driving cars typically process gigabytes per second of data on small clusters. Therefore, ERDOS is optimized to send small amounts of data (gigabytes as opposed to terabytes) as quickly as possible.

ERDOS provides determinism through watermarks. Low watermarks are a bound on the age of messages received and operators will ignore any messages older than the most recent watermark received. By processing on watermarks, applications can avoid non-determinism from processing messages out of order.

To read more about the ideas behind ERDOS, refer to our paper, D3: A Dynamic Deadline-Driven Approach for Building Autonomous Vehicles. If you find ERDOS useful to your work, please consider citing our paper:

@inproceedings{gog2022d3,
  title={D3: a dynamic deadline-driven approach for building autonomous vehicles},
  author={Gog, Ionel and Kalra, Sukrit and Schafhalter, Peter and Gonzalez, Joseph E and Stoica, Ion},
  booktitle={Proceedings of the Seventeenth European Conference on Computer Systems},
  pages={453--471},
  year={2022}
}

Pylot

We are actively developing an AV platform atop ERDOS! For more information, see the Pylot repository.

Getting involved

If you would like to contact us, you can:

  • Community on Slack: Join our community on Slack for discussions about development, questions about usage, and feature requests.
  • Github Issues: For reporting bugs.

We always welcome contributions to ERDOS. One way to get started is to pick one of the issues tagged with good first issue -- these are usually good issues that help you familiarize yourself with the ERDOS code base. Please submit contributions using pull requests.

Download Details:
Author: erdos-project
Source Code: https://github.com/erdos-project/erdos
License: Apache-2.0 License

#rust  #rustlang  #machinelearing #python 

ERDOS: A Platform for Developing Self-driving Cars & Robotics Apps

Search Through Millions Of Documents in Milliseconds Written in Rust

a concurrent indexer combined with fast and relevant search algorithms

Introduction

This repository contains the core engine used in Meilisearch.

It contains a library that can manage one and only one index. Meilisearch manages the multi-index itself. Milli is unable to store updates in a store: it is the job of something else above and this is why it is only able to process one update at a time.

This repository contains crates to quickly debug the engine:

  • There are benchmarks located in the benchmarks crate.
  • The http-ui crate is a simple HTTP dashboard to tests the features like for real!
  • The infos crate is used to dump the internal data-structure and ensure correctness.
  • The search crate is a simple command-line that helps run flamegraph on top of it.
  • The helpers crate is only used to modify the database inplace, sometimes.

Compile and run the HTTP debug server

You can specify the number of threads to use to index documents and many other settings too.

cd http-ui
cargo run --release -- --db my-database.mdb -vvv --indexing-jobs 8

Index your documents

It can index a massive amount of documents in not much time, I already achieved to index:

  • 115m songs (song and artist name) in ~48min and take 81GiB on disk.
  • 12m cities (name, timezone and country ID) in ~4min and take 6GiB on disk.

These metrics are done on a MacBook Pro with the M1 processor.

You can feed the engine with your CSV (comma-separated, yes) data like this:

printf "id,name,age\n1,hello,32\n2,kiki,24\n" | http POST 127.0.0.1:9700/documents content-type:text/csv

Don't forget to specify the id of the documents. Also, note that it supports JSON and JSON streaming: you can send them to the engine by using the content-type:application/json and content-type:application/x-ndjson headers respectively.

Querying the engine via the website

You can query the engine by going to the HTML page itself.

Contributing

You can setup a git-hook to stop you from making a commit too fast. It'll stop you if:

  • Any of the workspaces does not build
  • Your code is not well-formatted

These two things are also checked in the CI, so ignoring the hook won't help you merge your code. But if you need to, you can still add --no-verify when creating your commit to ignore the hook.

To enable the hook, run the following command from the root of the project:

cp script/pre-commit .git/hooks/pre-commit

Download Details:
Author: meilisearch
Source Code: https://github.com/meilisearch/milli
License: MIT License

#rust  #rustlang  #machinelearing 

Search Through Millions Of Documents in Milliseconds Written in Rust

Toshi: A Full-text Search Engine in Rust

Toshi

What is a Toshi?

Toshi is a three year old Shiba Inu. He is a very good boy and is the official mascot of this project. Toshi personally reviews all code before it is committed to this repository and is dedicated to only accepting the highest quality contributions from his human. He will, though, accept treats for easier code reviews.

Please note that this is far from production ready, also Toshi is still under active development, I'm just slow.

Description

Toshi is meant to be a full-text search engine similar to Elasticsearch. Toshi strives to be to Elasticsearch what Tantivy is to Lucene.

Motivations

Toshi will always target stable Rust and will try our best to never make any use of unsafe Rust. While underlying libraries may make some use of unsafe, Toshi will make a concerted effort to vet these libraries in an effort to be completely free of unsafe Rust usage. The reason I chose this was because I felt that for this to actually become an attractive option for people to consider it would have to have be safe, stable and consistent. This was why stable Rust was chosen because of the guarantees and safety it provides. I did not want to go down the rabbit hole of using nightly features to then have issues with their stability later on. Since Toshi is not meant to be a library, I'm perfectly fine with having this requirement because people who would want to use this more than likely will take it off the shelf and not modify it. My motivation was to cater to that use case when building Toshi.

Build Requirements

At this current time Toshi should build and work fine on Windows, Mac OS X, and Linux. From dependency requirements you are going to need 1.39.0 and Cargo installed in order to build. You can get rust easily from rustup.

Configuration

There is a default configuration file in config/config.toml:

host = "127.0.0.1"
port = 8080
path = "data2/"
writer_memory = 200000000
log_level = "info"
json_parsing_threads = 4
bulk_buffer_size = 10000
auto_commit_duration = 10
experimental = false

[experimental_features]
master = true
nodes = [
    "127.0.0.1:8081"
]

[merge_policy]
kind = "log"
min_merge_size = 8
min_layer_size = 10_000
level_log_size = 0.75

Host

host = "localhost"

The hostname Toshi will bind to upon start.

Port

port = 8080

The port Toshi will bind to upon start.

Path

path = "data/"

The data path where Toshi will store its data and indices.

Writer Memory

writer_memory = 200000000

The amount of memory (in bytes) Toshi should allocate to commits for new documents.

Log Level

log_level = "info"

The detail level to use for Toshi's logging.

Json Parsing

json_parsing_threads = 4

When Toshi does a bulk ingest of documents it will spin up a number of threads to parse the document's json as it's received. This controls the number of threads spawned to handle this job.

Bulk Buffer

bulk_buffer_size = 10000

This will control the buffer size for parsing documents into an index. It will control the amount of memory a bulk ingest will take up by blocking when the message buffer is filled. If you want to go totally off the rails you can set this to 0 in order to make the buffer unbounded.

Auto Commit Duration

auto_commit_duration = 10

This controls how often an index will automatically commit documents if there are docs to be committed. Set this to 0 to disable this feature, but you will have to do commits yourself when you submit documents.

Merge Policy

[merge_policy]
kind = "log"

Tantivy will merge index segments according to the configuration outlined here. There are 2 options for this. "log" which is the default segment merge behavior. Log has 3 additional values to it as well. Any of these 3 values can be omitted to use Tantivy's default value. The default values are listed below.

min_merge_size = 8
min_layer_size = 10_000
level_log_size = 0.75

In addition there is the "nomerge" option, in which Tantivy will do no merging of segments.

Experimental Settings

experimental = false

[experimental_features]
master = true
nodes = [
    "127.0.0.1:8081"
]

In general these settings aren't ready for usage yet as they are very unstable or flat out broken. Right now the distribution of Toshi is behind this flag, so if experimental is set to false then all these settings are ignored.

Building and Running

Toshi can be built using cargo build --release. Once Toshi is built you can run ./target/release/toshi from the top level directory to start Toshi according to the configuration in config/config.toml

You should get a startup message like this.

  ______         __   _   ____                 __
 /_  __/__  ___ / /  (_) / __/__ ___ _________/ /
  / / / _ \(_-</ _ \/ / _\ \/ -_) _ `/ __/ __/ _ \
 /_/  \___/___/_//_/_/ /___/\__/\_,_/_/  \__/_//_/
 Such Relevance, Much Index, Many Search, Wow
 
 INFO  toshi::index > Indexes: []

You can verify Toshi is running with:

curl -X GET http://localhost:8080/

which should return:

{
  "name": "Toshi Search",
  "version": "0.1.1"
}

Once toshi is running it's best to check the requests.http file in the root of this project to see some more examples of usage.

Example Queries

Term Query

{ "query": {"term": {"test_text": "document" } }, "limit": 10 }

Fuzzy Term Query

{ "query": {"fuzzy": {"test_text": {"value": "document", "distance": 0, "transposition": false } } }, "limit": 10 }

Phrase Query

{ "query": {"phrase": {"test_text": {"terms": ["test","document"] } } }, "limit": 10 }

Range Query

{ "query": {"range": { "test_i64": { "gte": 2012, "lte": 2015 } } }, "limit": 10 }

Regex Query

{ "query": {"regex": { "test_text": "d[ou]{1}c[k]?ument" } }, "limit": 10 }

Boolean Query

{ "query": {"bool": {"must": [ { "term": { "test_text": "document" } } ], "must_not": [ {"range": {"test_i64": { "gt": 2017 } } } ] } }, "limit": 10 }

Usage

To try any of the above queries you can use the above example

curl -X POST http://localhost:8080/test_index -H 'Content-Type: application/json' -d '{ "query": {"term": {"test_text": "document" } }, "limit": 10 }'

Also, to note, limit is optional, 10 is the default value. It's only included here for completeness.

Running Tests

cargo test

Download Details:
Author: toshi-search
Source Code: https://github.com/toshi-search/Toshi
License: MIT License

#rust  #rustlang  #machinelearing #searchengine 

Toshi: A Full-text Search Engine in Rust

Bayard: A Full-text Search & Indexing Server Written In Rust

Bayard

Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC.
Achieves consensus across all the nodes, ensures every change made to the system is made to a quorum of nodes.
Bayard makes easy for programmers to develop search applications with advanced features and high availability.

Features

  • Full-text search/indexing
  • Index replication
  • Bringing up a cluster
  • Command line interface is available

Source code repository

Docker container repository

Documents

Download Details:
Author: mosuka
Source Code: https://github.com/mosuka/bayard
License: MIT License

#rust  #rustlang  #machinelearing 

Bayard: A Full-text Search & Indexing Server Written In Rust

Qdrant: Vector Similarity Search Engine with Extended Filtering

Vector Similarity Search Engine with extended filtering support

Qdrant (read: quadrant ) is a vector similarity search engine. It provides a production-ready service with a convenient API to store, search, and manage points - vectors with an additional payload. Qdrant is tailored to extended filtering support. It makes it useful for all sorts of neural-network or semantic-based matching, faceted search, and other applications.

Qdrant is written in Rust :crab:, which makes it reliable even under high load.

With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!

Demo Projects

Semantic Text Search :mag:

The neural search uses semantic embeddings instead of keywords and works best with short texts. With Qdrant and a pre-trained neural network, you can build and deploy semantic neural search on your data in minutes. Try it online!

Similar Image Search - Food Discovery :pizza:

There are multiple ways to discover things, text search is not the only one. In the case of food, people rely more on appearance than description and ingredients. So why not let people choose their next lunch by its appearance, even if they don’t know the name of the dish? Check it out!

Extreme classification - E-commerce Product Categorization :tv:

Extreme classification is a rapidly growing research area within machine learning focusing on multi-class and multi-label problems involving an extremely large number of labels. Sometimes it is millions and tens of millions classes. The most promising way to solve this problem is to use similarity learning models. We put together a demo example of how you could approach the problem with a pre-trained transformer model and Qdrant. So you can play with it online!

More solutions

Semantic Text SearchSimilar Image SearchRecommendations
Chat BotsMatching Engines

API

Online OpenAPI 3.0 documentation is available here. OpenAPI makes it easy to generate a client for virtually any framework or programing language.

You can also download raw OpenAPI definitions.

Features

Filtering

Qdrant supports key-value payload associated with vectors. It does not only store payload but also allows filter results based on payload values. It allows any combinations of should, must, and must_not conditions, but unlike ElasticSearch post-filtering, Qdrant guarantees all relevant vectors are retrieved.

Rich data types

Vector payload supports a large variety of data types and query conditions, including string matching, numerical ranges, geo-locations, and more. Payload filtering conditions allow you to build almost any custom business logic that should work on top of similarity matching.

Query planning and payload indexes

Using the information about the stored key-value data, the query planner decides on the best way to execute the query. For example, if the search space limited by filters is small, it is more efficient to use a full brute force than an index.

SIMD Hardware Acceleration

Qdrant can take advantage of modern CPU x86-x64 architectures. It allows you to search even faster on modern hardware.

Write-ahead logging

Once the service confirmed an update - it won't lose data even in case of power shut down. All operations are stored in the update journal and the latest database state could be easily reconstructed at any moment.

Stand-alone

Qdrant does not rely on any external database or orchestration controller, which makes it very easy to configure.

Usage

Docker :whale:

Build your own from source

docker build . --tag=qdrant/qdrant

Or use latest pre-built image from DockerHub

docker pull qdrant/qdrant

To run container use command:

docker run -p 6333:6333 \
    -v $(pwd)/path/to/data:/qdrant/storage \
    -v $(pwd)/path/to/custom_config.yaml:/qdrant/config/production.yaml \
    qdrant/qdrant
  • /qdrant/storage - is a place where Qdrant persists all your data. Make sure to mount it as a volume, otherwise docker will drop it with the container.
  • /qdrant/config/production.yaml - is the file with engine configuration. You can override any value from the reference config

Now Qdrant should be accessible at localhost:6333

Docs :notebook:

Contacts

Download Details:
Author: qdrant
Source Code: https://github.com/qdrant/qdrant
License: Apache-2.0 License

#rust  #rustlang  #machinelearing 

Qdrant: Vector Similarity Search Engine with Extended Filtering