In this post we will be using healthcare chart notes data(doctor’s scribbled notes)** to model topics that exist in Clinical notes. Keep in mind, there is no structure to write these notes.**

In a later story, we will summarize these notes.

NLP Tasks that will be covered over 4 articles:

  1. Pre-processing and Cleaning
  2. Text Summarization
  3. Topic Modeling using Latent Dirichlet allocation (LDA)
  4. Clustering

_If you want to __try the entire code yourself or follow along, __go to my published jupyter notebook on GitHub: _https://github.com/gaurikatyagi/Natural-Language-Processing/blob/master/Introdution%20to%20NLP-Clustering%20Text.ipynb

DATA:

Source: https://mimic.physionet.org/about/mimic/

Doctors take notes on their computer and 80% of what they capture is not structured. That makes the processing of information even more difficult. Let’s not forget, interpretation of healthcare jargon is not an easy task either. It requires a lot of context for interpretation. Let’s see what we have:

Image by Author: Text Data as Input

Things we immediately notice:

  1. This is plain text with no markups. If it did have markups, we could have used libraries such as beautiful soup
  2. The lines are artificially wrapped with new lines (whenever you see a single \n)
  3. No typos… woohoo, but too many Acronyms and capital letters
  4. There’s punctuation like commas, apostrophes, quotes, question marks, and hyphenated descriptions like “FOLLOW-UP”
  5. There is usage of a lot of sequenced data and hence the appearance of ‘1.’, ‘2.’ and so on. But, notice how there is actually a single line brake even before these numbers (like in 2)
  6. See how the de-identified names are all replaced with the likes of ‘Last Name’ or ‘First Name3’, ‘Hospital Ward Name’ Good thing is these are all on square brackets and easy to identify. These are also always followed by a parenthesis. Yay! something we can remove at the beginning itself!
  7. Notice how dates might be required to handle (if they are all not in the same format)
  8. Notice some formatting: such as : \n\n*******\n\n or n???\tT. We will need to handle all of this

#nlp #regex #data-preprocessing #healthcare #lemmatization #data analytic

NLP-Preprocessing Clinical data to find Sections
1.60 GEEK