Angela  Dickens

Angela Dickens

1599202560

Text Classification on Disaster Tweets with LSTM and Word Embedding

This was my first Kaggle notebook and I thought why not write it on Medium too?

Full code on my Github.

In this post, I will elaborate on how to use fastText and GloVe as word embedding on LSTM model for text classification. I got interested in Word Embedding while doing my paper on Natural Language Generation. It showed that embedding matrix for the weight on embedding layer improved the performance of the model. But since it was NLG, the measurement was objective. And I only used fastText too. So in this article, I want to see how each method (with fastText and GloVe and without) affects to the prediction. On my Github code, I also compare the result with CNN. The dataset that i use here is from one of competition on Kaggle, consisted of tweets and labelled with whether the tweet is using disastrous words to inform a real disaster or merely just used it metaphorically. Honestly, on first seeing this dataset, I immediately thought about BERT and its ability to understand way better than what I proposed on this article (further reading on BERT).

But anyway, in this article I will focus on fastText and GloVe.

Let’s go?


Data + Pre-Processing

The data consisted of 7613 tweets (columns Text) with label (column Target) whether they were talking about a real disaster or not. With 3271 rows informing real disaster and 4342 rows informing not real disaster. The data shared on kaggle competition, and if you want to learn more about the data you can read it here.

Image for post

Example of real disaster word in a text :

“ Forest fire near La Ronge Sask. Canada “

Example of the use of disaster word but not about disaster:

“These boxes are ready to explodeExploding Kittens finally arrived! gameofkittens #explodingkittens”

The data will be divided for training (6090 rows) and testing (1523 rows) then proceed to pre-processing. We will only be using the text and target columns.

#data-science #lstm #word-embeddings #nlp #text-classification #data analysis

What is GEEK

Buddha Community

Text Classification on Disaster Tweets with LSTM and Word Embedding
Angela  Dickens

Angela Dickens

1599202560

Text Classification on Disaster Tweets with LSTM and Word Embedding

This was my first Kaggle notebook and I thought why not write it on Medium too?

Full code on my Github.

In this post, I will elaborate on how to use fastText and GloVe as word embedding on LSTM model for text classification. I got interested in Word Embedding while doing my paper on Natural Language Generation. It showed that embedding matrix for the weight on embedding layer improved the performance of the model. But since it was NLG, the measurement was objective. And I only used fastText too. So in this article, I want to see how each method (with fastText and GloVe and without) affects to the prediction. On my Github code, I also compare the result with CNN. The dataset that i use here is from one of competition on Kaggle, consisted of tweets and labelled with whether the tweet is using disastrous words to inform a real disaster or merely just used it metaphorically. Honestly, on first seeing this dataset, I immediately thought about BERT and its ability to understand way better than what I proposed on this article (further reading on BERT).

But anyway, in this article I will focus on fastText and GloVe.

Let’s go?


Data + Pre-Processing

The data consisted of 7613 tweets (columns Text) with label (column Target) whether they were talking about a real disaster or not. With 3271 rows informing real disaster and 4342 rows informing not real disaster. The data shared on kaggle competition, and if you want to learn more about the data you can read it here.

Image for post

Example of real disaster word in a text :

“ Forest fire near La Ronge Sask. Canada “

Example of the use of disaster word but not about disaster:

“These boxes are ready to explodeExploding Kittens finally arrived! gameofkittens #explodingkittens”

The data will be divided for training (6090 rows) and testing (1523 rows) then proceed to pre-processing. We will only be using the text and target columns.

#data-science #lstm #word-embeddings #nlp #text-classification #data analysis

Navigating Between DOM Nodes in JavaScript

In the previous chapters you've learnt how to select individual elements on a web page. But there are many occasions where you need to access a child, parent or ancestor element. See the JavaScript DOM nodes chapter to understand the logical relationships between the nodes in a DOM tree.

DOM node provides several properties and methods that allow you to navigate or traverse through the tree structure of the DOM and make changes very easily. In the following section we will learn how to navigate up, down, and sideways in the DOM tree using JavaScript.

Accessing the Child Nodes

You can use the firstChild and lastChild properties of the DOM node to access the first and last direct child node of a node, respectively. If the node doesn't have any child element, it returns null.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
console.log(main.firstChild.nodeName); // Prints: #text

var hint = document.getElementById("hint");
console.log(hint.firstChild.nodeName); // Prints: SPAN
</script>

Note: The nodeName is a read-only property that returns the name of the current node as a string. For example, it returns the tag name for element node, #text for text node, #comment for comment node, #document for document node, and so on.

If you notice the above example, the nodeName of the first-child node of the main DIV element returns #text instead of H1. Because, whitespace such as spaces, tabs, newlines, etc. are valid characters and they form #text nodes and become a part of the DOM tree. Therefore, since the <div> tag contains a newline before the <h1> tag, so it will create a #text node.

To avoid the issue with firstChild and lastChild returning #text or #comment nodes, you could alternatively use the firstElementChild and lastElementChild properties to return only the first and last element node, respectively. But, it will not work in IE 9 and earlier.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
alert(main.firstElementChild.nodeName); // Outputs: H1
main.firstElementChild.style.color = "red";

var hint = document.getElementById("hint");
alert(hint.firstElementChild.nodeName); // Outputs: SPAN
hint.firstElementChild.style.color = "blue";
</script>

Similarly, you can use the childNodes property to access all child nodes of a given element, where the first child node is assigned index 0. Here's an example:

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes 
if(main.hasChildNodes()) {
    var nodes = main.childNodes;
    
    // Loop through node list and display node name
    for(var i = 0; i < nodes.length; i++) {
        alert(nodes[i].nodeName);
    }
}
</script>

The childNodes returns all child nodes, including non-element nodes like text and comment nodes. To get a collection of only elements, use children property instead.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes 
if(main.hasChildNodes()) {
    var nodes = main.children;
    
    // Loop through node list and display node name
    for(var i = 0; i < nodes.length; i++) {
        alert(nodes[i].nodeName);
    }
}
</script>

#javascript 

Text Classification Using Long Short Term Memory & GloVe Embeddings

Preparing textual data for machine learning is a little different than the preparation of tabular data. What makes text data different is the fact that it’s majorly in string form. Therefore, we have to find the best way to represent it in numerical form. In this piece, we’ll see how we can prepare textual data using TensorFlow. Eventually, we’ll build a bidirectional long short term memory  model to classify text data.

#text-classification #text-preprocessing #lstm #machine-learning #heartbeat

Oleta  Becker

Oleta Becker

1604329740

Documents embeddings and text classification without coding

Text is described by the sequence of character. Since every machine learning algorithm needs numbers, we need to transform the text into vectors of real numbers before we can continue with the analysis. To do this, we can use various approaches. The most known approach before the evolution of deep learning was the bag of words which is still widely used because of its advantages. The recent boom in the deep learning brought us new approaches such as word and document embeddings. In this post, we explain what document embedding is, why it is useful, and show its usage on the classification example without coding. For the analysis, we will use the Orange open-source tool.

Word embedding and document embedding

Before we can understand document embeddings, we need to understand the concept of word embeddings. Word embedding is a representation of a word in multidimensional space such that words with similar meanings have similar embedding. It means that each word is mapped to the vector of real numbers that represent the word. Embedding models are mostly based on neural networks.

Document embedding is usually computed from the word embeddings in two steps. First, each word in the document is embedded with the word embedding then word embeddings are aggregated. The most common type of aggregation is the average over each dimension.

Why and when should we use embedders?

Compared to bag-of-words, which counts the number of appearances of each token (word) in the document, embeddings have two main advantages:

  • They do not have a dimensionality problem. The result of bag-of-words is a table which has the number of features equal to the number of unique tokens in all documents in a corpus. Large corpora with long texts result in a large number of unique tokens. It results in huge tables which can exceed the computer memory. Huge tables also increase the learning and evaluation time of machine learning models. Embeddings have constant dimensionality of the vector, which is 300 for fastText embeddings that Orange uses.
  • **Most of the preprocessing is not required. **In the case of the bag-of-words approach, we solve the dimensionality problem with the text preprocessing where we remove tokens (e.g. words) that seems to be less important for the analysis. It can also cause the removal of some important tokens. When using embedders, we do not need to remove tokens, so we are not losing accuracy. Also most of the basic preprocessing can be omitted (such as normalization) in case of fastText embedding.
  • Embeddings can be pretrained on large corpora with billions of tokens. That way, they capture the significant characteristics of the language and produce the embeddings of high quality. Pretrained models are then used to obtain embeddings of smaller datasets.

The shortcoming of the embedders is that they are difficult to understand. For example, when we use a bag-of-words, we can easily observe which tokens are important for classification since tokens themselves are features. In the case of document embeddings, features are numbers which are not understandable to human by themselves.

#embedding #data-science #text-analysis #machine-learning #text-embedding

Daron  Moore

Daron Moore

1598404620

Hands-on Guide to Pattern - A Python Tool for Effective Text Processing and Data Mining

Text Processing mainly requires Natural Language Processing( NLP), which is processing the data in a useful way so that the machine can understand the Human Language with the help of an application or product. Using NLP we can derive some information from the textual data such as sentiment, polarity, etc. which are useful in creating text processing based applications.

Python provides different open-source libraries or modules which are built on top of NLTK and helps in text processing using NLP functions. Different libraries have different functionalities that are used on data to gain meaningful results. One such Library is Pattern.

Pattern is an open-source python library and performs different NLP tasks. It is mostly used for text processing due to various functionalities it provides. Other than text processing Pattern is used for Data Mining i.e we can extract data from various sources such as Twitter, Google, etc. using the data mining functions provided by Pattern.

In this article, we will try and cover the following points:

  • NLP Functionalities of Pattern
  • Data Mining Using Pattern

#developers corner #data mining #text analysis #text analytics #text classification #text dataset #text-based algorithm