An essential task in Natural Language Processing is text classification, that is the process of assigning categories to a text in order to extract information from it. The idea of extracting information can be handy in a broad range of contexts, from short tweets, headlines, comments to customer reviews, articles, legal contracts or clinical texts. Some of the most used examples of text classification include sentiment analysis, language detection, topic labeling and intent detection.

Use Case:

We can find an extraordinary example in Electronic Health Records (EHRs), much of valuable information regarding patient’s conditions is embedded in free text format and the meaning of clinical entities is heavily affected by modifiers such as negation. Thus, a negation detection algorithm is of interest, this can reduce the number of incorrect negation assignment for patients with positive findings, and therefore improve the identification of patients with the target clinical findings in EHRs. However, most of the research work has been developed for English language. That is why I introduce to you a practical usage of the Conditional Random Fields algorithm to detect negations in any language, particularly Spanish.

Image for post

Electronic Health Record of a random individual.

Conditional Random Fields:

CRF is a probabilistic discriminative model and a special case of a Markov Random Field, let us assume a divided Markov Random Field into a set of random variables X and Y so that when we condition the graph on X globally, all the random variables in Y follow the Markov property. In layman’s terms, you can think of it as a probabilistic undirected graph making task-specific predictions.

Image for post

Image for post

Simple visual representation of the CRF.

Data preprocessing:

First of all, we have to consider the labeled data we are going to use to train the model, often called CORPUS, a corpus is a collection of written material stored on computer, and used to find out how language is used, in other words is a systematic computerized collection of authentic language that is used for linguistic analysis.

Let’s take some random tweets, they can be stored with linguistic information in a corpus, an example of how that might look, is the following:

Image for post

Tweet stored in XML format.

Keeping the Tweet example, the next task is to create a new format for all the information embedded in the corpus, so that we can use it as an input for the CRF model, that is, reading whatever format our corpus is written in (in this case, XML) and reshape it into something useful.

The first step is reading in plain text every tweet, tokenize it and adding the Part Of Speech tags (POS) for each token and then adding relevant information of the corpus. Spacy is a Python’s library that offers a tokenizer (Segmenting text into words and punctuations marks).

Image for post

All the above put together might look like this(←), where the first “column” is the token of the Tweet, the second is the POS tag and the third is the information of the CORPUS presented in a kind of tagging named BIO (Beginning — Inside — Outside ) this last tag is going to be the target variable or the entity to predict in our CRF model.

The negation expression is tagged as B, the scope as** I** and the rest as O.

#science #machine-learning #language #mathematics #nlp #data science

Negation cue detection using Conditional Random Fields.
1.40 GEEK