Text Summarization from scratch

During our school days, most of us would have encountered the reading comprehension section of our English paper. We would be given a paragraph or Essay based on which we need to answer several questions.

Image for post

How do we as humans approach this task at hand? We go through the entire text, make sense of the context in which the question is asked and then we write answers. Is there a way we can use AI and deep learning techniques to mimic this behavior of us?

Automatic text summarization is a common problem in machine learning and natural language processing (NLP). There are two approaches to this problem.

  1. Extractive Summarization- Extractive text summarization done by picking up the most important sentences from the original text in the way that forms the final summary. We do some kind of extractive text summarization to solve our simple reading comprehension exercises. TextRank is a very popular extractive and unsupervised text summarization technique.

2. Abstractive Summarization-Abstractive text summarization**,** on the other hand, is a technique in which the summary is generated by generating novel sentences by either rephrasing or using the new words, instead of simply extracting the important sentences. For example, some questions in the reading comprehension might not be straightforward in such cases we do rephrasing or use new words to answer such questions.

We humans can easily do both kinds of text summarization. In this blog let us see how to implement abstractive text summarization using deep learning techniques.

Problem Statement

Given a news article text, we are going to summarize it and generate appropriate headlines.

Whenever any media account shares a news story on Twitter or in any social networking site, they provide a crisp headlines /clickbait to make users click the link and read the article.

Often media houses provide sensational headlines that serve as a clickbait. This is a technique often employed to increase clicks to their site.

Our problem statement is to generate headlines given article text. For this we are using the news_summary dataset. You can download the dataset [here]

Image for post

A tweet from CNN with a headline for an article on COVID 19

Before we go through the code, let us learn some concepts needed for building an abstractive text summarizer.

Sequence to Sequence Model

Techniques like multi-layer perceptron(MLP) work well your input data is vector and convolutional neural networks(CNN) works very well if your input data is an image.

What if my input x is a sequence? What if x is a sequence of words. In most languages sequence of words matters a lot. We need to somehow preserve the sequence of words.

The core idea here is if output depends on a sequence of inputs then we need to build a new type of neural network which gives importance to sequence information, which somehow retains and leverages the sequence information.

Image for post

Google Translate is a very good example of a seq2seq model application

We can build a Seq2Seq model on any problem which involves sequential information. In our case, our objective is to build a text summarizer where the input is a long sequence of words(in a text body), and the output is a summary (which is a sequence as well). So, we can model this as a Many-to-Many Seq2Seq problem.

A many to many seq2seq model has two building blocks- **Encoder **and **Decoder. **The Encoder-Decoder architecture is mainly used to solve the sequence-to-sequence (Seq2Seq) problems where the input and output sequences are of different lengths.

Generally, variants of Recurrent Neural Networks (RNNs), i.e. Gated Recurrent Neural Network (GRU) or Long Short Term Memory (LSTM), are preferred as the encoder and decoder components. This is because they are capable of capturing long term dependencies by overcoming the problem of vanishing gradient.

Encoder-Decoder Architecture

Let us see a high-level overview of Encoder-Decoder architecture and then see its detailed working in the training and inference phase.

Intuitively this is what happens in our encoder-decoder network:

1. We feed in our input (in our case text from news articles) to the Encoder unit. Encoder reads the input sequence and summarizes the information in something called the internal state vectors (in case of LSTM these are called the hidden state and cell state vectors).

2. The encoder generates something called the context vector, which gets passed to the decoder unit as input. The outputs generated by the encoder are discarded and only the context vector is passed over to the decoder.

3. The decoder unit generates an output sequence based on the context vector.

We can set up the Encoder-Decoder in 2 phases:

  • Training phase
  • Inference phase

Training phase

A.Encoder

In the training phase at every time step, we feed in words from a sentence one by one in sequence to the encoder. For example, if there is a sentence “I am a good boy”, then at time step t=1, the word I is fed, then at time step t=2, the word am is fed, and so on.

Say for example we have a sequence x comprising of words x1,x2,x3,x4 then the encoder in training phase looks like below:

Image for post

Training Phase

The initial state of the LSTM unit is zero vector or it is randomly initiated. Now h1,c1 is the state of LSTM unit at time step t=1 when the word x1 of the sequence x is fed as input.

Similarly h2,c2 is the state of the LSTM unit at time step t=2 when the word x2 of the sequence x is fed as input and so on.

The hidden state (hi) and cell state (ci) of the last time step are used to initialize the decoder.

#text-summarization #tensorflow #encoder-decoder #deep-learning #deep learning

What is GEEK

Buddha Community

Text Summarization from scratch
Daron  Moore

Daron Moore

1598404620

Hands-on Guide to Pattern - A Python Tool for Effective Text Processing and Data Mining

Text Processing mainly requires Natural Language Processing( NLP), which is processing the data in a useful way so that the machine can understand the Human Language with the help of an application or product. Using NLP we can derive some information from the textual data such as sentiment, polarity, etc. which are useful in creating text processing based applications.

Python provides different open-source libraries or modules which are built on top of NLTK and helps in text processing using NLP functions. Different libraries have different functionalities that are used on data to gain meaningful results. One such Library is Pattern.

Pattern is an open-source python library and performs different NLP tasks. It is mostly used for text processing due to various functionalities it provides. Other than text processing Pattern is used for Data Mining i.e we can extract data from various sources such as Twitter, Google, etc. using the data mining functions provided by Pattern.

In this article, we will try and cover the following points:

  • NLP Functionalities of Pattern
  • Data Mining Using Pattern

#developers corner #data mining #text analysis #text analytics #text classification #text dataset #text-based algorithm

I am Developer

1597475640

Laravel 7 Full Text Search MySQL

Here, I will show you how to create full text search in laravel app. You just follow the below easy steps and create full text search with mysql db in laravel.

Laravel 7 Full Text Search Mysql

Let’s start laravel full-text search implementation in laravel 7, 6 versions:

  1. Step 1: Install Laravel New App
  2. Step 2: Configuration DB .evn file
  3. Step 3: Run Migration
  4. Step 4: Install Full Text Search Package
  5. Step 5: Add Fake Records in DB
  6. Step 6: Add Routes,
  7. Step 7: Create Controller
  8. Step 8: Create Blade View
  9. Step 9: Start Development Server

https://www.tutsmake.com/laravel-full-text-search-tutorial/

#laravel full text search mysql #laravel full text search query #mysql full text search in laravel #full text search in laravel 6 #full text search in laravel 7 #using full text search in laravel

Hollie  Ratke

Hollie Ratke

1597989600

Text Analysis Within a Full-Text Search Engine

Full-Text Search refers to techniques for searching text content within a document or a collection of documents that hold textual content. A Full-Text search engine examines all the textual content within documents as it tries to match a single search term or several terms, text analysis being a pivotal component.

You’ve probably heard of the most well-known Full-Text Search engine: Lucene with Elasticsearch built on top of it. Couchbase’s Full-Text Search (FTS) Engine is powered by Bleve, and this article will showcase the various ways to analyze text within this engine.

Bleve is an open-sourced text indexing and search library implemented in Go, developed in-house at Couchbase.

Couchbase’s FTS engine supports indexes that subscribe to data residing within a Couchbase Server and indexes data that it ingests from the server. It’s a distributed system – meaning it can partition data across multiple nodes in a cluster and searches involve scattering the request and gathering responses from across all nodes within the cluster before responding to the application.

The FTS engine distributes documents ingested for an index across a configurable number of partitions and these partitions could reside across multiple nodes within a cluster. Each partition follows the same set of rules that the FTS index is configured with – to analyze and index text into the full-text search database.

The text analysis component of a Full-Text search engine is responsible for breaking down the raw text into a list of words – which we’ll refer to as tokens. These tokens are more suitable for indexing in the database and searching.

Couchbase’s FTS Engine handles text indexing for JSON documents. It builds an index for the content that is analyzed and stores into the database – the index along with all the relevant metadata needed to link the tokens generated to the original documents within which they reside.

An Inverted index is the data structure chosen to index the tokens generated from text, to make search queries faster. This index links every token generated to documents that contain the token.

For example, take the following documents …

The inverted index for the tokens generated from the 2 documents above would resemble this…

Here’s a diagram highlighting the components of the full-text search engine …

A Text Analyzer

The components of a text analyzer can broadly be classified into 2 categories:

  • Tokenizer
  • Filters

Couchbase’s engine further categorizes filters into:

  • Character filters
  • Token filters

Before we dive into the function of each of these components, here’s an overview of a text analyzer …

Tokenizer

A tokenizer is the first component to which the documents are subjected to. As the name suggests, it breaks the raw text into a list of tokens. This conversion will depend on a rule-set defined for the tokenizer.

Stock tokenizers…

Take this sample text for an example: “_this is my email ID: _abhi123@cb.com

A couple of configurable tokenizers…

  • Exception … This tokenizer allows the user to enter exception patterns (regular expressions) over the stock tokenizers.
  • Regexp … This tokenizer extracts text that matches the pattern (a regular expression) as tokens.

For example:

#json #couchbase #search #go #text analysis #full-text search #bleve #full-text #full-text-indexing

Fancy Font Generator - Fancy Text Generator - Cool & Stylish Text Fonts

𝐹𝒶𝓃𝒸𝓎 𝒯𝑒𝓍𝓉 - Generate Online 😀 ℭ𝔬𝔬𝔩 and ⓢⓣⓨⓛⓘⓢⓗ Text Fonts with Symbols,Imogis and Many Different Styles

https://www.FancyTextWala.xyz

Fancy Font Generator - Fancy Text Generator - Cool & Stylish Text Fonts - FancyTextWala.xyz

Cool and Fancy Text Generator that converts Normal Text To Cool And Fancy. PUBG Mobile Fonts. Cursive Fancy Texts and Emojis. Stylish and Cool Text.

  1. Enter Your Text To Contert it In Fancy Text.
  2. Choose Your Font You Like And Click On Copy.

Welcome To one of the best fancy Font/Text Generator website. on our website you can generate almost unlimited different types of fancy text and Fonts with a mix of symbols, emojis and other different types of characters.

#fancy text generator #fancy font #fancy text #fancy font generator #fancy text font #fancy text font generator

Text Summarization from scratch

During our school days, most of us would have encountered the reading comprehension section of our English paper. We would be given a paragraph or Essay based on which we need to answer several questions.

Image for post

How do we as humans approach this task at hand? We go through the entire text, make sense of the context in which the question is asked and then we write answers. Is there a way we can use AI and deep learning techniques to mimic this behavior of us?

Automatic text summarization is a common problem in machine learning and natural language processing (NLP). There are two approaches to this problem.

  1. Extractive Summarization- Extractive text summarization done by picking up the most important sentences from the original text in the way that forms the final summary. We do some kind of extractive text summarization to solve our simple reading comprehension exercises. TextRank is a very popular extractive and unsupervised text summarization technique.

2. Abstractive Summarization-Abstractive text summarization**,** on the other hand, is a technique in which the summary is generated by generating novel sentences by either rephrasing or using the new words, instead of simply extracting the important sentences. For example, some questions in the reading comprehension might not be straightforward in such cases we do rephrasing or use new words to answer such questions.

We humans can easily do both kinds of text summarization. In this blog let us see how to implement abstractive text summarization using deep learning techniques.

Problem Statement

Given a news article text, we are going to summarize it and generate appropriate headlines.

Whenever any media account shares a news story on Twitter or in any social networking site, they provide a crisp headlines /clickbait to make users click the link and read the article.

Often media houses provide sensational headlines that serve as a clickbait. This is a technique often employed to increase clicks to their site.

Our problem statement is to generate headlines given article text. For this we are using the news_summary dataset. You can download the dataset [here]

Image for post

A tweet from CNN with a headline for an article on COVID 19

Before we go through the code, let us learn some concepts needed for building an abstractive text summarizer.

Sequence to Sequence Model

Techniques like multi-layer perceptron(MLP) work well your input data is vector and convolutional neural networks(CNN) works very well if your input data is an image.

What if my input x is a sequence? What if x is a sequence of words. In most languages sequence of words matters a lot. We need to somehow preserve the sequence of words.

The core idea here is if output depends on a sequence of inputs then we need to build a new type of neural network which gives importance to sequence information, which somehow retains and leverages the sequence information.

Image for post

Google Translate is a very good example of a seq2seq model application

We can build a Seq2Seq model on any problem which involves sequential information. In our case, our objective is to build a text summarizer where the input is a long sequence of words(in a text body), and the output is a summary (which is a sequence as well). So, we can model this as a Many-to-Many Seq2Seq problem.

A many to many seq2seq model has two building blocks- **Encoder **and **Decoder. **The Encoder-Decoder architecture is mainly used to solve the sequence-to-sequence (Seq2Seq) problems where the input and output sequences are of different lengths.

Generally, variants of Recurrent Neural Networks (RNNs), i.e. Gated Recurrent Neural Network (GRU) or Long Short Term Memory (LSTM), are preferred as the encoder and decoder components. This is because they are capable of capturing long term dependencies by overcoming the problem of vanishing gradient.

Encoder-Decoder Architecture

Let us see a high-level overview of Encoder-Decoder architecture and then see its detailed working in the training and inference phase.

Intuitively this is what happens in our encoder-decoder network:

1. We feed in our input (in our case text from news articles) to the Encoder unit. Encoder reads the input sequence and summarizes the information in something called the internal state vectors (in case of LSTM these are called the hidden state and cell state vectors).

2. The encoder generates something called the context vector, which gets passed to the decoder unit as input. The outputs generated by the encoder are discarded and only the context vector is passed over to the decoder.

3. The decoder unit generates an output sequence based on the context vector.

We can set up the Encoder-Decoder in 2 phases:

  • Training phase
  • Inference phase

Training phase

A.Encoder

In the training phase at every time step, we feed in words from a sentence one by one in sequence to the encoder. For example, if there is a sentence “I am a good boy”, then at time step t=1, the word I is fed, then at time step t=2, the word am is fed, and so on.

Say for example we have a sequence x comprising of words x1,x2,x3,x4 then the encoder in training phase looks like below:

Image for post

Training Phase

The initial state of the LSTM unit is zero vector or it is randomly initiated. Now h1,c1 is the state of LSTM unit at time step t=1 when the word x1 of the sequence x is fed as input.

Similarly h2,c2 is the state of the LSTM unit at time step t=2 when the word x2 of the sequence x is fed as input and so on.

The hidden state (hi) and cell state (ci) of the last time step are used to initialize the decoder.

#text-summarization #tensorflow #encoder-decoder #deep-learning #deep learning