Getting started with Natural Language Processing using Node.js

Introduction

The internet facilitates a never-ending creation of large volumes of unstructured textual data. Luckily, we have modern systems that can make sense of this kind of data.

Modern computer systems can make sense of natural languages using an underlying technology called NLP (natural language processing). This technology can process human language as input and perform one or more of the following operations:

Sentiment analysis (Is it a positive or negative statement?)
Topic classification (What is it about?)
Decide on what actions should be taken based on this statement
Intent extraction (What is the intention behind this statement?)

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Significant implementations of NLP aren’t too far from us these days as most of our devices integrate AI (artificial intelligence), ML (machine learning) and NLP to enhance human-to-machine communications. Here are some common examples of NLP in action:

Search engines: One of the most helpful technologies is the Google Search engine. You put in text and receive millions of related results as a response. This is possible because of the NLP technology that can make sense of the input and perform a series of logical operations. This is also what allows Google Search to understand your intent and suggest the proper spelling to you when you spell a search term incorrectly.
Intelligent virtual assistants: Virtual assistants such as Siri, Alexa, and Google Assistant show an advanced level of the implementation of NLP. After receiving verbal input from you, they can identify the intent, perform an operation and send back a response in a natural language.
Smart chatbots: Chatbots can analyze large amounts of textual data and give different responses based on large data and their ability to detect intent. This gives the overall feel of a natural conversation and not one with a machine.
Spam filter: Have you noticed that email clients are constantly getting better at filtering spam emails out of your inbox? This is possible because the filter engines can understand the content of emails — mostly using Bayesian spam filtering — and decide if it’s spam or not.

The use cases above show that AI, ML, and NLP are already being used heavily on the web. Since humans interact with websites using natural languages, we should build our websites with NLP capabilities.

Python is usually the go-to language when the topic is NLP (or ML and AI) because of its wealth of language processing packages like Natural Language Toolkit. However, JavaScript is growing rapidly and the existence of NPM (Node Package Manager) gives its developers access to a large number of packages, including packages to perform NLP for different languages.

In this article, we will focus on getting started with NLP using Node. We will be using a JavaScript library called natural. By adding the natural library to our project, our code will be able to parse, interpret, manipulate, and understand natural languages from user input.

This article will barely scratch the surface of NLP. This post will be useful for developers who already use NLP with Python but want to transition to achieve the same results with Node. Complete newbies will also learn a lot about NLP as a technology and its usage with Node.

Prerequisite

Basic knowledge of Node.js
A system that is set up to run Node code

To code along with this article, you will need to create an index.js file and paste in the snippet you want to try then run the file with Node.

Let’s begin.

Installation

We can install natural by running the following command:

npm install natural

The source code to each of the following usage examples in the next section is available on Github. Feel free to clone it, fork it or submit an issue.

Usage

Let’s learn how to perform some basic but important NLP tasks using natural.

Tokenization

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

For example, in the text string: The quick brown fox jumps over the lazy dog

The string isn’t implicitly segmented on spaces, as a natural language speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string " " or regular expression /\s{1}/).

Natural ships with a number of smart tokenizer algorithms that can break text into arrays of tokens. Here’s a code snippet showing the usage of the Word tokenizer:

// index.js

var natural = require('natural');
var tokenizer = new natural.WordTokenizer();

console.log(tokenizer.tokenize("The quick brown fox jumps over the lazy dog"));

Running this with Node gives the following output:

[ 'The',
  'quick',
  'brown',
  'fox',
  'jumps',
  'over',
  'the',
  'lazy',
  'dog' ]

Stemming

Stemming refers to the reduction of words to their word stem (also known as base or root form). For example, words such as cats, catlike, and catty will be stemmed down to the root word — cat.

Natural currently supports two stemming algorithms — Porter and Lancaster (Paice/Husk). Here’s a code snippet implementing stemming, using the Porter algorithm:

// index.js

var natural = require('natural');

natural.PorterStemmer.attach();
console.log("I can see that we are going to be friends".tokenizeAndStem());

This example uses the attach() method to patch stem() and tokenizeAndStem() to String as a shortcut to PorterStemmer.stem(token).tokenizeAndStem(). The result is the breaking down of the text into single words then an array of stemmed tokens will be returned:

[ 'go', 'friend' ]

Note: In the result above, stop words have been removed by the algorithm. Stop words are words that are filtered out before the processing of natural language(for example be, an, and to are all stop words).

Measuring the similarity between words (string distance)

Natural provides an implementation of four algorithms for calculating string distance, Hamming distance, Jaro-Winkler, Levenshtein distance, and Dice coefficient. Using these algorithms, we can tell if two strings match or not. For the sake of this project we will be using Hamming distance.

Hamming distance measures the distance between two strings of equal length by counting the number of different characters. The third parameter indicates whether the case should be ignored. By default, the algorithm is case sensitive.

Here’s a code snippet showing the usage of the Hemming algorithm for calculating string distance:

// index.js

var natural = require('natural');

console.log(natural.HammingDistance("karolin", "kathrin", false));
console.log(natural.HammingDistance("karolin", "kerstin", false));
console.log(natural.HammingDistance("short string", "longer string", false));

The output:

3
3
-1

The first two comparisons return 3 because three letters differ. The last one returns -1 because the lengths of the strings being compared are different.

Classification

Text classification also known as text tagging is the process of classifying text into organized groups. That is, if we have a new unknown statement, our processing system can decide which category it fits in the most based on its content.

Some of the most common use cases for automatic text classification include the following:

Sentiment analysis
Topic detection
Language detection

Natural currently supports two classifiers — Naive Bayes and logistic regression. The following examples use the BayesClassifier class:

// index.js

var natural = require('natural');

var classifier = new natural.BayesClassifier();
classifier.addDocument('i am long qqqq', 'buy');
classifier.addDocument('buy the q\'s', 'buy');
classifier.addDocument('short gold', 'sell');
classifier.addDocument('sell gold', 'sell');
classifier.train();

console.log(classifier.classify('i am short silver'));
console.log(classifier.classify('i am long copper'));

In the code above, we trained the classifier on sample text. It will use reasonable defaults to tokenize and stem the text. Based on the sample text, the console will log the following output:

sell
buy

Sentiment analysis

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

Natural supports algorithms that can calculate the sentiment of each piece of text by summing the polarity of each word and normalizing it with the length of the sentence. If a negation occurs the result is made negative.

Here’s an example of its usage:

// index.js

var natural = require('natural');
var Analyzer = natural.SentimentAnalyzer;
var stemmer = natural.PorterStemmer;
var analyzer = new Analyzer("English", stemmer, "afinn");

// getSentiment expects an array of strings
console.log(analyzer.getSentiment(["I", "don't", "want", "to", "play", "with", "you"]));

The constructor has three parameters:

Language
Stemmer- to increase the coverage of the sentiment analyzer a stemmer may be provided
Vocabulary- sets the type of vocabulary, "afinn", "senticon" or "pattern" are valid values

Running the code above gives the following output:

0.42857142857142855 // indicates a relatively negative statement

Phonetic matching

Using natural, we can compare two words that are spelled differently but sound similar using phonetic matching. Here’s an example using the metaphone.compare() method:

// index.js

var natural = require('natural');
var metaphone = natural.Metaphone;
var soundEx = natural.SoundEx;

var wordA = 'phonetics';
var wordB = 'fonetix';

if (metaphone.compare(wordA, wordB))
    console.log('They sound alike!');

// We can also obtain the raw phonetics of a word using process()
console.log(metaphone.process('phonetics'));

We also obtained the raw phonetics of a word using process(). We get the following output when we run the code above:

They sound alike!
FNTKS

Spell check

Users may make typographical errors when supplying input to a web application through a search bar or an input field. Natural has a probabilistic spellchecker that can suggest corrections for misspelled words using an array of tokens from a text corpus.

Let’s explore an example using an array of two words (also known as a corpus) for simplicity:

// index.js

var natural = require('natural');

var corpus = ['something', 'soothing'];
var spellcheck = new natural.Spellcheck(corpus);

console.log(spellcheck.getCorrections('soemthing', 1)); 
console.log(spellcheck.getCorrections('soemthing', 2));

It suggests corrections (sorted by probability in descending order) that are up to a maximum edit distance away from the input word. A maximum distance of 1 will cover 80% to 95% of spelling mistakes. After a distance of 2, it becomes very slow.

We get the following output from running the code:

[ 'something' ]
[ 'something', 'soothing' ]

Conclusion

Here’s a quick summary of what we’ve learned so far in this article:

Computer systems are getting smarter by the day and can extract meaning from large volumes of unstructured textual data using NLP
Python has a wealth of intelligent packages for performing AI, ML, and NLP tasks but JavaScript is growing really rapidly and its package manager has an impressive number of packages capable of processing natural language
Natural, a JavaScript package, is robust in performing NLP operations and has a number of algorithm alternatives for each task

The source code to each of the following usage examples in the next section is available on Github. Feel free to clone it, fork it or submit an issue.

#nodejs #node #javascript #machine-learning #data-science