Text****data is the most common format of the data out there. An abundance of articles, tweets, documents, books and else, fly around us every day. The amount of insights you can extract is immense and the tools to help on extracting them are improving continuously.

In Python, you can use external libraries to process text data. One issue with these libraries is that they are not lightweight and they do have a steep learning curve for starters. In many cases, functions written in pure Python can do great if your sole intention is to explore and get to know with the data.

Below I’ve provided a list of 10 such functions that I use mostly. These functions are mainly **heuristic functions **that work properly under assumptions made. These assumptions are stated at the points below at functions where they are considered.

#1 Get Sentences

Assumption: Everything that ends with a dot, question mark or exclamation mark, is taken as a sentence.

def get_sentences(text):
    import re
    pattern = r'([A-Z][^\.!?]*[\.!?])'
    pattern_compiled = re.compile(pattern, re.M)
    list_of_sentences = re.findall(pattern, text)
    return list_of_sentences


# Test
text = """This is the most frequent question we're asked by prospective students. And our response? Absolutely! We've trained people from all walks of life."""
get_sentences(text)


# [
#     "This is the most frequent questions we're asked by prospective students.",
#     'And our response?',
#     'Absolutely!',
#     "We've trained people from all walks of life."
# ]

#2 Get List of Items per Sentence

Assumption: A colon followed by a list of items separated with comma is taken.

def get_listed_items_with_colon(text):
    import re
    list_of_items = []
    list_of_sentences = re.split('\.|\?|\!', text)
    for sentence in list_of_sentences:
        if ':' in sentence:
            start_index = sentence.find(':')
            sub_sentence = sentence[start_index+1:]
            list_of_items.append([word.strip() for word in sub_sentence.split(',')])
    return list_of_items


# Test
text = """The house has everything I need: two bedrooms, a backyard, and a garage. I have several favorite genres of movies: drama, science fiction, and mystery."""
get_listed_items_with_colon(text)

# [ ['two bedrooms', 'a backyard', 'and a garage'], ['drama', 'science fiction', 'and mystery'] ]

#data-science #nlp #programming #text-mining #python

10 Pure Python Functions for Ad Hoc Text Analysis
7.90 GEEK