How to Wrap Your Head Around Spark NLP

Welcome to the second part of the Spark NLP article. In the first part, the objective was to present an ice breaker for NLP practitioners and warm-up minds towards Spark NLP. The strongest bias against any Spark-based library comes from the school of thought that states “Spark code is a bit different than your regular Python script”. To fight this prejudice, learning strategies were shared, and if you have followed them, you are ready for the next level.

In this part of the article, we will compare spaCy to Spark NLP and dive deeper into Spark NLP modules and pipelines. I created a notebook that uses both spaCy and Spark NLP for doing the same things for picture-perfect comparison. SpaCy performs well, but in terms of speed, memory consumption, and accuracy Spark NLP outperforms it. Therefore, in terms of ease of use, once initial friction of Spark is overcome, I found it to be at least on par with spaCy, thanks to the convenience that comes with pipelines.

Usage of smart, comprehensive notebooks, prepared by Spark NLP creators, which provide numerous examples for real-life scenarios together with the repo I created for practice is highly recommended to excel in the required skill sets.

**Day 8/9: Understanding Annotators /Transformers in Spark NLP and ****Text Preprocessing **with Spark

Built natively on Apache Spark and TensorFlow, Spark NLP library provides simple, performant as well as accurate NLP notations for machine learning pipelines which can scale easily in a distributed environment. This library reuses the Spark ML pipeline along with integrating NLP functionality.

The library covers many common NLP tasks, including tokenization, stemming, lemmatization, part of speech tagging, sentiment analysis, spell checking, named entity recognition, all of which are included as open-source and can be used by training models with your data. Spark NLP’s annotators utilize rule-based algorithms, machine learning, and some Tensorflow running under the hood to power specific deep learning implementations.

In Spark NLP, all Annotators are either Estimators or Transformers just like Spark ML, consisting of two types: AnnotatorApproach and **AnnotatorModel. **Any annotator training on a DataFrame that produces a model is an AnnotatorApproach. Those transforming a DataFrame into another DataFrame through some models are AnnotatorModels (e.g. WordEmbeddingsModel). Normally, an annotator does not take _Model _suffix if it doesn’t rely on a pre-trained annotator while transforming a DataFrame (e.g. Tokenizer). Here is a list of Annotators and their descriptions:

#spark #wrap #nlp

towardsdatascience.com

How to Wrap Your Head Around Spark NLP