1600653600
A step-by-step guide to build a text classifier with CNNs implemented in PyTorch.
“Deep Learning is more that adding layers”
The objective of this blog is to develop a step by step text classifier by implementing convolutional neural networks. So, this blog is divided into the following sections:
So, let’s get started!
The text classification problem can be addressed from different approaches, for example, considering the frequency of occurrence of words in a given text with respect to the occurrence of these words in the complete corpus.
On the other hand, there exists other approaches where the text is modeled as a sequence of words or characters, this type of approach makes use mainly of models based on Recurrent Neural Network architectures.
_If you want to know more about text classification with LSTM recurrent neural networks, take a look at this blog: _Text Classification with LSTMs in PyTorch
However, there is another approach where the text is modeled as a distribution of words in a given space. This is achieved through the use of Convolutional Neural Networks (CNNs).
So, we are going to start from the last mentioned approach, we are going to build a model to classify text considering the distribution in space of a set of words that make up the text using an architecture based on CNNs.
Let’s start!
#pytorch #cnn #text #neural-networks #python
1650870267
In the previous chapters you've learnt how to select individual elements on a web page. But there are many occasions where you need to access a child, parent or ancestor element. See the JavaScript DOM nodes chapter to understand the logical relationships between the nodes in a DOM tree.
DOM node provides several properties and methods that allow you to navigate or traverse through the tree structure of the DOM and make changes very easily. In the following section we will learn how to navigate up, down, and sideways in the DOM tree using JavaScript.
You can use the firstChild
and lastChild
properties of the DOM node to access the first and last direct child node of a node, respectively. If the node doesn't have any child element, it returns null
.
<div id="main">
<h1 id="title">My Heading</h1>
<p id="hint"><span>This is some text.</span></p>
</div>
<script>
var main = document.getElementById("main");
console.log(main.firstChild.nodeName); // Prints: #text
var hint = document.getElementById("hint");
console.log(hint.firstChild.nodeName); // Prints: SPAN
</script>
Note: The
nodeName
is a read-only property that returns the name of the current node as a string. For example, it returns the tag name for element node,#text
for text node,#comment
for comment node,#document
for document node, and so on.
If you notice the above example, the nodeName
of the first-child node of the main DIV element returns #text instead of H1. Because, whitespace such as spaces, tabs, newlines, etc. are valid characters and they form #text nodes and become a part of the DOM tree. Therefore, since the <div>
tag contains a newline before the <h1>
tag, so it will create a #text node.
To avoid the issue with firstChild
and lastChild
returning #text or #comment nodes, you could alternatively use the firstElementChild
and lastElementChild
properties to return only the first and last element node, respectively. But, it will not work in IE 9 and earlier.
<div id="main">
<h1 id="title">My Heading</h1>
<p id="hint"><span>This is some text.</span></p>
</div>
<script>
var main = document.getElementById("main");
alert(main.firstElementChild.nodeName); // Outputs: H1
main.firstElementChild.style.color = "red";
var hint = document.getElementById("hint");
alert(hint.firstElementChild.nodeName); // Outputs: SPAN
hint.firstElementChild.style.color = "blue";
</script>
Similarly, you can use the childNodes
property to access all child nodes of a given element, where the first child node is assigned index 0. Here's an example:
<div id="main">
<h1 id="title">My Heading</h1>
<p id="hint"><span>This is some text.</span></p>
</div>
<script>
var main = document.getElementById("main");
// First check that the element has child nodes
if(main.hasChildNodes()) {
var nodes = main.childNodes;
// Loop through node list and display node name
for(var i = 0; i < nodes.length; i++) {
alert(nodes[i].nodeName);
}
}
</script>
The childNodes
returns all child nodes, including non-element nodes like text and comment nodes. To get a collection of only elements, use children
property instead.
<div id="main">
<h1 id="title">My Heading</h1>
<p id="hint"><span>This is some text.</span></p>
</div>
<script>
var main = document.getElementById("main");
// First check that the element has child nodes
if(main.hasChildNodes()) {
var nodes = main.children;
// Loop through node list and display node name
for(var i = 0; i < nodes.length; i++) {
alert(nodes[i].nodeName);
}
}
</script>
1596635640
The question remains open: how to learn semantics? what is semantics? would DL-based models be capable to learn semantics?
The aim of this blog is to explain how to build a text classifier based on LSTMs as well as how it is built by using the PyTorch framework.
I would like to start with the following question: how to classify a text? Several approaches have been proposed from different viewpoints under different premises, but what is the most suitable one?. It’s interesting to pause for a moment and question ourselves: how we as humans can classify a text?, what do our brains take into account to be able to classify a text?. Such questions are complex to be answered.
Currently, we have access to a set of different text types such as emails, movie reviews, social media, books, etc. In this sense, the text classification problem would be determined by what’s intended to be classified (e.g. _is it intended to classify the polarity of given text? Is it intended to classify a set of movie reviews by category? Is it intended to classify a set of texts by topic? _). In this regard, the problem of text classification is categorized most of the time under the following tasks:
In order to go deeper into this hot topic, I really recommend to take a look at this paper: Deep Learning Based Text Classification: A Comprehensive Review.
The two keys in this model are: tokenization and recurrent neural nets. Tokenization refers to the process of splitting a text into a set of sentences or words (i.e. tokens). In this regard, tokenization techniques can be applied at sequence-level or word-level. In order to understand the bases of tokenization you can take a look at: Introduction to Information Retrieval.
In the other hand, RNNs (Recurrent Neural Networks) are a kind of neural network which are well-known to work well on sequential data, such as the case of text data. In this case, it’s been implemented a special kind of RNN which is LSTMs (Long-Short Term Memory). LSTMs are one of the _improved _versions of RNNs, essentially LSTMs have shown a better performance working with _longer sentences. _In order to go deeper about what RNNs and LSTMs are, you can take a look at: Understanding LSTMs Networks.
Since the idea of this blog is to present a baseline model for text classification, the text preprocessing phase is based on the tokenization technique, meaning that each _text sentence _will be tokenized, then each _token _will be transformed into its index-based representation. Then, each _token sentence based indexes _will be passed sequentially through an embedding layer, this embedding layer will output an embedded representation of each token whose are passed through a two-stacked LSTM neural net, then the last LSTM’s hidden state will be passed through a two-linear layer neural net which outputs a single value filtered by a sigmoid activation function. The following image describes the model architecture:
#pytorch #text-mining #lstm #text-classification #python
1598404620
Text Processing mainly requires Natural Language Processing( NLP), which is processing the data in a useful way so that the machine can understand the Human Language with the help of an application or product. Using NLP we can derive some information from the textual data such as sentiment, polarity, etc. which are useful in creating text processing based applications.
Python provides different open-source libraries or modules which are built on top of NLTK and helps in text processing using NLP functions. Different libraries have different functionalities that are used on data to gain meaningful results. One such Library is Pattern.
Pattern is an open-source python library and performs different NLP tasks. It is mostly used for text processing due to various functionalities it provides. Other than text processing Pattern is used for Data Mining i.e we can extract data from various sources such as Twitter, Google, etc. using the data mining functions provided by Pattern.
In this article, we will try and cover the following points:
#developers corner #data mining #text analysis #text analytics #text classification #text dataset #text-based algorithm
1600653600
A step-by-step guide to build a text classifier with CNNs implemented in PyTorch.
“Deep Learning is more that adding layers”
The objective of this blog is to develop a step by step text classifier by implementing convolutional neural networks. So, this blog is divided into the following sections:
So, let’s get started!
The text classification problem can be addressed from different approaches, for example, considering the frequency of occurrence of words in a given text with respect to the occurrence of these words in the complete corpus.
On the other hand, there exists other approaches where the text is modeled as a sequence of words or characters, this type of approach makes use mainly of models based on Recurrent Neural Network architectures.
_If you want to know more about text classification with LSTM recurrent neural networks, take a look at this blog: _Text Classification with LSTMs in PyTorch
However, there is another approach where the text is modeled as a distribution of words in a given space. This is achieved through the use of Convolutional Neural Networks (CNNs).
So, we are going to start from the last mentioned approach, we are going to build a model to classify text considering the distribution in space of a set of words that make up the text using an architecture based on CNNs.
Let’s start!
#pytorch #cnn #text #neural-networks #python
1594979967
The question remains open: how to learn semantics? what is semantics? would DL-based models be capable to learn semantics?
The aim of this blog is to explain how to build a text classifier based on LSTMs as well as how it is built by using the PyTorch framework.
I would like to start with the following question: how to classify a text?. Several approaches have been proposed from different viewpoints under different premises, but what is the most suitable one?. It’s interesting to pause for a moment and question ourselves: how we as humans can classify a text?, what do our brains take into account to be able to classify a text?. Such questions are complex to be answered.
Currently, we have access to a set of different text types such as emails, movie reviews, social media, books, etc. In this sense, the text classification problem would be determined by what’s intended to be classified (e.g. _is it intended to classify the polarity of given text? Is it intended to classify a set of movie reviews by category? Is it intended to classify a set of texts by topic? _). In this regard, the problem of text classification is categorized most of the time under the following tasks:
In order to go deeper into this hot topic, I really recommend to take a look at this paper: Deep Learning Based Text Classification: A Comprehensive Review.
#pytorch #text-classification #python