How to Build a Fast “Most-Similar Words” Method in SpaCy

How to Build a Fast “Most-Similar Words” Method in SpaCy

In this article, I want to share how I speeded up the spaCy library to be used in the word similarity API. The spaCy does not support an efficient most-similar method, contrary to Gensim.

The recent developments in natural language processing or NLP introduce language models that must be cautiously used in production. For example, the spaCy large English model, en-core-web-lgcontains more than 600 thousand 300-d vectors. Or, the pre-trained word2vec-google-news-300 model contains 3 million 300-d vectors for words and phrases. When you want to calculate a metric across these high-dimension vectors, the solution may easily suffer from computation power.In this article, I want to share how I speeded up the spaCy library to be used in the word similarity API. The spaCy does not support an efficient most-similar method, contrary to Gensim. I recently published a word similarity API, named Owl. This API lets you extract the most similar words to target words using various word2vec models including spaCy. Given a word, this API returns a list of groups of words that are similar to the original word in predefined contexts such as News or General. The General context uses the spaCy large English model.

— How to Extract the Most Similar Words Using spaCy?

There is no shortcut. I must compute the distance between the vector of the target word and a large number of vectors stored in the word2vec model. However, I can *refine the search space *with a filter and *speed-up computation *with an optimized computing library.I do not need to restore all vectors. I can prune vectors (i.e., refine the search space) using, for example, .probattribute, provided by spaCy. This attribute helps to select the most frequent words in English w.prob >= -15. Here, you can use any filter that suits your problem. For example, you may want to filter the pool of vectors using the .sentiment attribute that represents the positivity or negativity of a word. The code below shows you how to build a most-similar method for spaCy. This code is not optimized to run fast though.

programming artificial-intelligence naturallanguageprocessing machine-learning data-science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

Artificial Intelligence vs Machine Learning vs Data Science

Artificial Intelligence, Machine Learning, and Data Science are amongst a few terms that have become extremely popular amongst professionals in almost all the fields.

Pipelines in Machine Learning | Data Science | Machine Learning | Python

Machine Learning Pipelines performs a complete workflow with an ordered sequence of the process involved in a Machine Learning task. The Pipelines can also

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.