Higher accuracy and less process time in text classification

Implementing feature selection methods on text classification

The size of variables or features is referred to as the dimensionality of a dataset. On text classification methods, the size of features could be enumerated a large number. In this post, we are going to implement tf-idf decomposition dimensionality reduction technique using Linear Discriminant Analysis-LDA.

Our pathway in this study:

1. Preparing Dataset

2. Transforming text to feature vectors

3. Applying filter methods

4. Applying Linear Discriminant Analysis

5. Building a Random Forest Classifier

6. Result comparison

All source codes and notebooks have been uploaded in this Github repository.

Problem Formulation

Enhancing the accuracy and reducing process time in text classification.

Data Exploration

We are using the “515K Hotel Reviews Data in Europe” from the Kaggle datasets. The data was scraped from Booking.com. All data in the file is publicly available to everyone already. Data is originally owned by Booking.com and you can download it thought this profile on Kaggle. The dataset which we need contains 515,000 positive and negative reviews.

Import libraries

The most important libraries that we have used are Scikit-Learn and pandas.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
from nltk.stem.snowball import SnowballStemmer
from string import punctuation
from textblob import TextBlob
import re

Preparing Dataset

We are going to work just on two categories, positives and negatives. Therefore, we select 5,000 rows for each category and copy them into the Pandas Dataframe (5,000 for each part). We used Kaggle’s notebook for this project, therefore the dataset was loaded as a local file. If you are using another tool or running as a script you can download it. let’s take a glance at the dataset:

fields = ['Positive_Review', 'Negative_Review']
df = pd.read_csv(
    '../input/515k-hotel-reviews-data-in-europe/Hotel_Reviews.csv',
    usecols= fields, nrows=5000)
df.head()

Image for post

Stemming words using NLTK SnowballStemmer:

stemmer = SnowballStemmer('english')
df['Positive_Review'] = df['Positive_Review'].apply(
    lambda x:' '.join([stemmer.stem(y) for y in x.split()]))

df['Negative_Review'] = df['Negative_Review'].apply(
    lambda x: ' '.join([stemmer.stem(y) for y in x.split()]))

Removing Stop-Words:

Exclude stopwords with countwordsfree.com list comprehension and pandas.DataFrame.apply.

url = "https://countwordsfree.com/stopwords/english/json"
response = pd.DataFrame(data = json.loads(requests.get(url).text))
SW = list(response['words'])

df['Positive_Review'] = df['Positive_Review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (SW)]))
df['Negative_Review'] = df['Negative_Review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (SW)]))

#feature-selection #lda #machine-learning #scikit-learn #text-classification #deep learning

What is GEEK

Buddha Community

Higher accuracy and less process time in text classification

Navigating Between DOM Nodes in JavaScript

In the previous chapters you've learnt how to select individual elements on a web page. But there are many occasions where you need to access a child, parent or ancestor element. See the JavaScript DOM nodes chapter to understand the logical relationships between the nodes in a DOM tree.

DOM node provides several properties and methods that allow you to navigate or traverse through the tree structure of the DOM and make changes very easily. In the following section we will learn how to navigate up, down, and sideways in the DOM tree using JavaScript.

Accessing the Child Nodes

You can use the firstChild and lastChild properties of the DOM node to access the first and last direct child node of a node, respectively. If the node doesn't have any child element, it returns null.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
console.log(main.firstChild.nodeName); // Prints: #text

var hint = document.getElementById("hint");
console.log(hint.firstChild.nodeName); // Prints: SPAN
</script>

Note: The nodeName is a read-only property that returns the name of the current node as a string. For example, it returns the tag name for element node, #text for text node, #comment for comment node, #document for document node, and so on.

If you notice the above example, the nodeName of the first-child node of the main DIV element returns #text instead of H1. Because, whitespace such as spaces, tabs, newlines, etc. are valid characters and they form #text nodes and become a part of the DOM tree. Therefore, since the <div> tag contains a newline before the <h1> tag, so it will create a #text node.

To avoid the issue with firstChild and lastChild returning #text or #comment nodes, you could alternatively use the firstElementChild and lastElementChild properties to return only the first and last element node, respectively. But, it will not work in IE 9 and earlier.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
alert(main.firstElementChild.nodeName); // Outputs: H1
main.firstElementChild.style.color = "red";

var hint = document.getElementById("hint");
alert(hint.firstElementChild.nodeName); // Outputs: SPAN
hint.firstElementChild.style.color = "blue";
</script>

Similarly, you can use the childNodes property to access all child nodes of a given element, where the first child node is assigned index 0. Here's an example:

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes 
if(main.hasChildNodes()) {
    var nodes = main.childNodes;
    
    // Loop through node list and display node name
    for(var i = 0; i < nodes.length; i++) {
        alert(nodes[i].nodeName);
    }
}
</script>

The childNodes returns all child nodes, including non-element nodes like text and comment nodes. To get a collection of only elements, use children property instead.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes 
if(main.hasChildNodes()) {
    var nodes = main.children;
    
    // Loop through node list and display node name
    for(var i = 0; i < nodes.length; i++) {
        alert(nodes[i].nodeName);
    }
}
</script>

#javascript 

Daron  Moore

Daron Moore

1598404620

Hands-on Guide to Pattern - A Python Tool for Effective Text Processing and Data Mining

Text Processing mainly requires Natural Language Processing( NLP), which is processing the data in a useful way so that the machine can understand the Human Language with the help of an application or product. Using NLP we can derive some information from the textual data such as sentiment, polarity, etc. which are useful in creating text processing based applications.

Python provides different open-source libraries or modules which are built on top of NLTK and helps in text processing using NLP functions. Different libraries have different functionalities that are used on data to gain meaningful results. One such Library is Pattern.

Pattern is an open-source python library and performs different NLP tasks. It is mostly used for text processing due to various functionalities it provides. Other than text processing Pattern is used for Data Mining i.e we can extract data from various sources such as Twitter, Google, etc. using the data mining functions provided by Pattern.

In this article, we will try and cover the following points:

  • NLP Functionalities of Pattern
  • Data Mining Using Pattern

#developers corner #data mining #text analysis #text analytics #text classification #text dataset #text-based algorithm

Higher accuracy and less process time in text classification

Implementing feature selection methods on text classification

The size of variables or features is referred to as the dimensionality of a dataset. On text classification methods, the size of features could be enumerated a large number. In this post, we are going to implement tf-idf decomposition dimensionality reduction technique using Linear Discriminant Analysis-LDA.

Our pathway in this study:

1. Preparing Dataset

2. Transforming text to feature vectors

3. Applying filter methods

4. Applying Linear Discriminant Analysis

5. Building a Random Forest Classifier

6. Result comparison

All source codes and notebooks have been uploaded in this Github repository.

Problem Formulation

Enhancing the accuracy and reducing process time in text classification.

Data Exploration

We are using the “515K Hotel Reviews Data in Europe” from the Kaggle datasets. The data was scraped from Booking.com. All data in the file is publicly available to everyone already. Data is originally owned by Booking.com and you can download it thought this profile on Kaggle. The dataset which we need contains 515,000 positive and negative reviews.

Import libraries

The most important libraries that we have used are Scikit-Learn and pandas.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
from nltk.stem.snowball import SnowballStemmer
from string import punctuation
from textblob import TextBlob
import re

Preparing Dataset

We are going to work just on two categories, positives and negatives. Therefore, we select 5,000 rows for each category and copy them into the Pandas Dataframe (5,000 for each part). We used Kaggle’s notebook for this project, therefore the dataset was loaded as a local file. If you are using another tool or running as a script you can download it. let’s take a glance at the dataset:

fields = ['Positive_Review', 'Negative_Review']
df = pd.read_csv(
    '../input/515k-hotel-reviews-data-in-europe/Hotel_Reviews.csv',
    usecols= fields, nrows=5000)
df.head()

Image for post

Stemming words using NLTK SnowballStemmer:

stemmer = SnowballStemmer('english')
df['Positive_Review'] = df['Positive_Review'].apply(
    lambda x:' '.join([stemmer.stem(y) for y in x.split()]))

df['Negative_Review'] = df['Negative_Review'].apply(
    lambda x: ' '.join([stemmer.stem(y) for y in x.split()]))

Removing Stop-Words:

Exclude stopwords with countwordsfree.com list comprehension and pandas.DataFrame.apply.

url = "https://countwordsfree.com/stopwords/english/json"
response = pd.DataFrame(data = json.loads(requests.get(url).text))
SW = list(response['words'])

df['Positive_Review'] = df['Positive_Review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (SW)]))
df['Negative_Review'] = df['Negative_Review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (SW)]))

#feature-selection #lda #machine-learning #scikit-learn #text-classification #deep learning

Kasey  Turcotte

Kasey Turcotte

1623947400

One Line of Code for a Common Text Pre-Processing Step in Pandas

A quick look at splitting text columns for use in machine learning and data analysis

ometimes you’ll want to do some processing to create new variables out of your existing data. This can be as simple as splitting up a “name” column into “first name” and “last name”.

Whatever the case may be, Pandas will allow you to effortlessly work with text data through a variety of in-built methods. In this piece, we’ll go specifically into parsing text columns for the exact information you need either for further data analysis or for use in a machine learning model.

If you’d like to follow along, go ahead and download the ‘train’ dataset here. Once you’ve done that, make sure it’s saved to the same directory as your notebook and then run the code below to read it in:

import pandas as pd
df = pd.read_csv('train.csv')

Let’s get to it!

#programming #python #one line of code for a common text pre-processing step in pandas #pandas #one line of code for a common text pre-processing #text pre-processing

 iOS App Dev

iOS App Dev

1622608260

Making Sense of Unbounded Data & Real-Time Processing Systems

Unbounded data refers to continuous, never-ending data streams with no beginning or end. They are made available over time. Anyone who wishes to act upon them can do without downloading them first.

As Martin Kleppmann stated in his famous book, unbounded data will never “complete” in any meaningful way.

“In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of business, this process never ends, and so the dataset is never “complete” in any meaningful way.”

— Martin Kleppmann, Designing Data-Intensive Applications

Processing unbounded data requires an entirely different approach than its counterpart, batch processing. This article summarises the value of unbounded data and how you can build systems to harness the power of real-time data.

#stream-processing #software-architecture #event-driven-architecture #data-processing #data-analysis #big-data-processing #real-time-processing #data-storage