The internet facilitates a never-ending creation of large volumes of unstructured textual data. Luckily, we have modern systems that can make sense of this kind of data.
Modern computer systems can make sense of natural languages using an underlying technology called NLP (natural language processing). This technology can process human language as input and perform one or more of the following operations:
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Significant implementations of NLP aren’t too far from us these days as most of our devices integrate AI (artificial intelligence), ML (machine learning) and NLP to enhance human-to-machine communications. Here are some common examples of NLP in action:
The use cases above show that AI, ML, and NLP are already being used heavily on the web. Since humans interact with websites using natural languages, we should build our websites with NLP capabilities.
Python is usually the go-to language when the topic is NLP (or ML and AI) because of its wealth of language processing packages like Natural Language Toolkit. However, JavaScript is growing rapidly and the existence of NPM (Node Package Manager) gives its developers access to a large number of packages, including packages to perform NLP for different languages.
In this article, we will focus on getting started with NLP using Node. We will be using a JavaScript library called natural. By adding the natural library to our project, our code will be able to parse, interpret, manipulate, and understand natural languages from user input.
This article will barely scratch the surface of NLP. This post will be useful for developers who already use NLP with Python but want to transition to achieve the same results with Node. Complete newbies will also learn a lot about NLP as a technology and its usage with Node.
To code along with this article, you will need to create an index.js
file and paste in the snippet you want to try then run the file with Node.
Let’s begin.
We can install natural by running the following command:
npm install natural
The source code to each of the following usage examples in the next section is available on Github. Feel free to clone it, fork it or submit an issue.
Let’s learn how to perform some basic but important NLP tasks using natural.
Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.
For example, in the text string:
The quick brown fox jumps over the lazy dog
The string isn’t implicitly segmented on spaces, as a natural language speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string
" "
or regular expression/\s{1}/
).
Natural ships with a number of smart tokenizer algorithms that can break text into arrays of tokens. Here’s a code snippet showing the usage of the Word tokenizer:
// index.js
var natural = require('natural');
var tokenizer = new natural.WordTokenizer();
console.log(tokenizer.tokenize("The quick brown fox jumps over the lazy dog"));
Running this with Node gives the following output:
[ 'The',
'quick',
'brown',
'fox',
'jumps',
'over',
'the',
'lazy',
'dog' ]
Stemming refers to the reduction of words to their word stem (also known as base or root form). For example, words such as cats, catlike, and catty will be stemmed down to the root word — cat.
Natural currently supports two stemming algorithms — Porter and Lancaster (Paice/Husk). Here’s a code snippet implementing stemming, using the Porter algorithm:
// index.js
var natural = require('natural');
natural.PorterStemmer.attach();
console.log("I can see that we are going to be friends".tokenizeAndStem());
This example uses the attach()
method to patch stem()
and tokenizeAndStem()
to String
as a shortcut to PorterStemmer.stem(token)
.tokenizeAndStem()
. The result is the breaking down of the text into single words then an array of stemmed tokens will be returned:
[ 'go', 'friend' ]
Note: In the result above, stop words have been removed by the algorithm. Stop words are words that are filtered out before the processing of natural language(for example be, an, and to are all stop words).
Natural provides an implementation of four algorithms for calculating string distance, Hamming distance, Jaro-Winkler, Levenshtein distance, and Dice coefficient. Using these algorithms, we can tell if two strings match or not. For the sake of this project we will be using Hamming distance.
Hamming distance measures the distance between two strings of equal length by counting the number of different characters. The third parameter indicates whether the case should be ignored. By default, the algorithm is case sensitive.
Here’s a code snippet showing the usage of the Hemming algorithm for calculating string distance:
// index.js
var natural = require('natural');
console.log(natural.HammingDistance("karolin", "kathrin", false));
console.log(natural.HammingDistance("karolin", "kerstin", false));
console.log(natural.HammingDistance("short string", "longer string", false));
The output:
3
3
-1
The first two comparisons return 3
because three letters differ. The last one returns -1
because the lengths of the strings being compared are different.
Text classification also known as text tagging is the process of classifying text into organized groups. That is, if we have a new unknown statement, our processing system can decide which category it fits in the most based on its content.
Some of the most common use cases for automatic text classification include the following:
Natural currently supports two classifiers — Naive Bayes and logistic regression. The following examples use the BayesClassifier
class:
// index.js
var natural = require('natural');
var classifier = new natural.BayesClassifier();
classifier.addDocument('i am long qqqq', 'buy');
classifier.addDocument('buy the q\'s', 'buy');
classifier.addDocument('short gold', 'sell');
classifier.addDocument('sell gold', 'sell');
classifier.train();
console.log(classifier.classify('i am short silver'));
console.log(classifier.classify('i am long copper'));
In the code above, we trained the classifier on sample text. It will use reasonable defaults to tokenize and stem the text. Based on the sample text, the console will log the following output:
sell
buy
Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
Natural supports algorithms that can calculate the sentiment of each piece of text by summing the polarity of each word and normalizing it with the length of the sentence. If a negation occurs the result is made negative.
Here’s an example of its usage:
// index.js
var natural = require('natural');
var Analyzer = natural.SentimentAnalyzer;
var stemmer = natural.PorterStemmer;
var analyzer = new Analyzer("English", stemmer, "afinn");
// getSentiment expects an array of strings
console.log(analyzer.getSentiment(["I", "don't", "want", "to", "play", "with", "you"]));
The constructor has three parameters:
"afinn"
, "senticon"
or "pattern"
are valid valuesRunning the code above gives the following output:
0.42857142857142855 // indicates a relatively negative statement
Using natural, we can compare two words that are spelled differently but sound similar using phonetic matching. Here’s an example using the metaphone.compare()
method:
// index.js
var natural = require('natural');
var metaphone = natural.Metaphone;
var soundEx = natural.SoundEx;
var wordA = 'phonetics';
var wordB = 'fonetix';
if (metaphone.compare(wordA, wordB))
console.log('They sound alike!');
// We can also obtain the raw phonetics of a word using process()
console.log(metaphone.process('phonetics'));
We also obtained the raw phonetics of a word using process()
. We get the following output when we run the code above:
They sound alike!
FNTKS
Users may make typographical errors when supplying input to a web application through a search bar or an input field. Natural has a probabilistic spellchecker that can suggest corrections for misspelled words using an array of tokens from a text corpus.
Let’s explore an example using an array of two words (also known as a corpus) for simplicity:
// index.js
var natural = require('natural');
var corpus = ['something', 'soothing'];
var spellcheck = new natural.Spellcheck(corpus);
console.log(spellcheck.getCorrections('soemthing', 1));
console.log(spellcheck.getCorrections('soemthing', 2));
It suggests corrections (sorted by probability in descending order) that are up to a maximum edit distance away from the input word. A maximum distance of 1 will cover 80% to 95% of spelling mistakes. After a distance of 2, it becomes very slow.
We get the following output from running the code:
[ 'something' ]
[ 'something', 'soothing' ]
Here’s a quick summary of what we’ve learned so far in this article:
The source code to each of the following usage examples in the next section is available on Github. Feel free to clone it, fork it or submit an issue.
#nodejs #node #javascript #machine-learning #data-science