Spark Monitor: An Extension for Jupyter Lab

This project was originally written by krishnan-r as a Google Summer of Code project for Jupyter Notebook. Check his website out here.

As a part of my internship as a Software Engineer at Yelp, I created this fork to update the extension to be compatible with JupyterLab - Yelp's choice for sharing and collaborating on notebooks.

About

+=

SparkMonitor is an extension for Jupyter Lab that enables the live monitoring of Apache Spark Jobs spawned from a notebook. The extension provides several features to monitor and debug a Spark job from within the notebook interface itself. 
 


jobdisplay

Requirements

  • At least JupyterLab 3
  • pyspark 3.X.X or newer (For compatibility with older pyspark versions, use jupyterlab-sparkmonitor 3.X)

Features

  • Automatically displays a live monitoring tool below cells that run Spark jobs in a Jupyter notebook
  • A table of jobs and stages with progressbars
  • A timeline which shows jobs, stages, and tasks
  • A graph showing number of active tasks & executor cores vs time
  • A notebook server extension that proxies the Spark UI and displays it in an iframe popup for more details
  • For a detailed list of features see the use case notebooks
  • Support for multiple SparkSessions (default port is 4040)
  • How it Works

Quick Start

To do a quick test of the extension

This docker image has pyspark and several other related packages installed alongside the sparkmonitor extension.

docker run -it -p 8888:8888 itsjafer/sparkmonitor

Setting up the extension

pip install jupyterlab-sparkmonitor # install the extension

# set up ipython profile and add our kernel extension to it
ipython profile create --ipython-dir=.ipython
echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >>  .ipython/profile_default/ipython_config.py

# run jupyter lab
IPYTHONDIR=.ipython jupyter lab --watch

With the extension installed, a SparkConf object called conf will be usable from your notebooks. You can use it as follows:

from pyspark import SparkContext

# start the spark context using the SparkConf the extension inserted
sc=SparkContext.getOrCreate(conf=conf) #Start the spark context

# Monitor should spawn under the cell with 4 jobs
sc.parallelize(range(0,100)).count()
sc.parallelize(range(0,100)).count()
sc.parallelize(range(0,100)).count()
sc.parallelize(range(0,100)).count()

If you already have your own spark configuration, you will need to set spark.extraListeners to sparkmonitor.listener.JupyterSparkMonitorListener and spark.driver.extraClassPath to the path to the sparkmonitor python package path/to/package/sparkmonitor/listener.jar

from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .config('spark.extraListeners', 'sparkmonitor.listener.JupyterSparkMonitorListener')\
        .config('spark.driver.extraClassPath', 'venv/lib/python3.7/site-packages/sparkmonitor/listener.jar')\
        .getOrCreate()

# should spawn 4 jobs in a monitor bnelow the cell
spark.sparkContext.parallelize(range(0,100)).count()
spark.sparkContext.parallelize(range(0,100)).count()
spark.sparkContext.parallelize(range(0,100)).count()
spark.sparkContext.parallelize(range(0,100)).count()

Changelog

  • 1.0 - Initial Release
  • 2.0 - Migration to JupyterLab 2, Multiple Spark Sessions, and displaying monitors beneath the correct cell more accurately
  • 3.0 - Migrate to JupyterLab 3 as prebuilt extension
  • 4.0 - pyspark 3.X Compatibility; no longer compatible with PySpark 2.X or under

Development

If you'd like to develop the extension:

make all # Clean the directory, build the extension, and run it locally

Author: itsjafer
Source Code: https://github.com/itsjafer/jupyterlab-sparkmonitor
License: Apache-2.0 license

#jupyterlab #jupyter 

What is GEEK

Buddha Community

Spark Monitor: An Extension for Jupyter Lab
Carmen  Grimes

Carmen Grimes

1598959140

How to Monitor Third Party API Integrations

Many enterprises and SaaS companies depend on a variety of external API integrations in order to build an awesome customer experience. Some integrations may outsource certain business functionality such as handling payments or search to companies like Stripe and Algolia. You may have integrated other partners which expand the functionality of your product offering, For example, if you want to add real-time alerts to an analytics tool, you might want to integrate the PagerDuty and Slack APIs into your application.

If you’re like most companies though, you’ll soon realize you’re integrating hundreds of different vendors and partners into your app. Any one of them could have performance or functional issues impacting your customer experience. Worst yet, the reliability of an integration may be less visible than your own APIs and backend. If the login functionality is broken, you’ll have many customers complaining they cannot log into your website. However, if your Slack integration is broken, only the customers who added Slack to their account will be impacted. On top of that, since the integration is asynchronous, your customers may not realize the integration is broken until after a few days when they haven’t received any alerts for some time.

How do you ensure your API integrations are reliable and high performing? After all, if you’re selling a feature real-time alerting, you’re alerts better well be real-time and have at least once guaranteed delivery. Dropping alerts because your Slack or PagerDuty integration is unacceptable from a customer experience perspective.

What to monitor

Latency

Specific API integrations that have an exceedingly high latency could be a signal that your integration is about to fail. Maybe your pagination scheme is incorrect or the vendor has not indexed your data in the best way for you to efficiently query.

Latency best practices

Average latency only tells you half the story. An API that consistently takes one second to complete is usually better than an API with high variance. For example if an API only takes 30 milliseconds on average, but 1 out of 10 API calls take up to five seconds, then you have high variance in your customer experience. This is makes it much harder to track down bugs and harder to handle in your customer experience. This is why 90th percentile and 95th percentiles are important to look at.

Reliability

Reliability is a key metric to monitor especially since your integrating APIs that you don’t have control over. What percent of API calls are failing? In order to track reliability, you should have a rigid definition on what constitutes a failure.

Reliability best practices

While any API call that has a response status code in the 4xx or 5xx family may be considered an error, you might have specific business cases where the API appears to successfully complete yet the API call should still be considered a failure. For example, a data API integration that returns no matches or no content consistently could be considered failing even though the status code is always 200 OK. Another API could be returning bogus or incomplete data. Data validation is critical for measuring where the data returned is correct and up to date.

Not every API provider and integration partner follows suggested status code mapping

Availability

While reliability is specific to errors and functional correctness, availability and uptime is a pure infrastructure metric that measures how often a service has an outage, even if temporary. Availability is usually measured as a percentage of uptime per year or number of 9’s.

AVAILABILITY %DOWNTIME PER YEARDOWNTIME PER MONTHDOWNTIME PER WEEKDOWNTIME PER DAY90% (“one nine”)36.53 days73.05 hours16.80 hours2.40 hours99% (“two nines”)3.65 days7.31 hours1.68 hours14.40 minutes99.9% (“three nines”)8.77 hours43.83 minutes10.08 minutes1.44 minutes99.99% (“four nines”)52.60 minutes4.38 minutes1.01 minutes8.64 seconds99.999% (“five nines”)5.26 minutes26.30 seconds6.05 seconds864.00 milliseconds99.9999% (“six nines”)31.56 seconds2.63 seconds604.80 milliseconds86.40 milliseconds99.99999% (“seven nines”)3.16 seconds262.98 milliseconds60.48 milliseconds8.64 milliseconds99.999999% (“eight nines”)315.58 milliseconds26.30 milliseconds6.05 milliseconds864.00 microseconds99.9999999% (“nine nines”)31.56 milliseconds2.63 milliseconds604.80 microseconds86.40 microseconds

Usage

Many API providers are priced on API usage. Even if the API is free, they most likely have some sort of rate limiting implemented on the API to ensure bad actors are not starving out good clients. This means tracking your API usage with each integration partner is critical to understand when your current usage is close to the plan limits or their rate limits.

Usage best practices

It’s recommended to tie usage back to your end-users even if the API integration is quite downstream from your customer experience. This enables measuring the direct ROI of specific integrations and finding trends. For example, let’s say your product is a CRM, and you are paying Clearbit $199 dollars a month to enrich up to 2,500 companies. That is a direct cost you have and is tied to your customer’s usage. If you have a free tier and they are using the most of your Clearbit quota, you may want to reconsider your pricing strategy. Potentially, Clearbit enrichment should be on the paid tiers only to reduce your own cost.

How to monitor API integrations

Monitoring API integrations seems like the correct remedy to stay on top of these issues. However, traditional Application Performance Monitoring (APM) tools like New Relic and AppDynamics focus more on monitoring the health of your own websites and infrastructure. This includes infrastructure metrics like memory usage and requests per minute along with application level health such as appdex scores and latency. Of course, if you’re consuming an API that’s running in someone else’s infrastructure, you can’t just ask your third-party providers to install an APM agent that you have access to. This means you need a way to monitor the third-party APIs indirectly or via some other instrumentation methodology.

#monitoring #api integration #api monitoring #monitoring and alerting #monitoring strategies #monitoring tools #api integrations #monitoring microservices

Top Spark Development Companies | Best Spark Developers - TopDevelopers.co

An extensively researched list of top Apache spark developers with ratings & reviews to help find the best spark development Companies around the world.

Our thorough research on the ace qualities of the best Big Data Spark consulting and development service providers bring this list of companies. To predict and analyze businesses and in the scenarios where prompt and fast data processing is required, Spark application will greatly be effective for various industry-specific management needs. The companies listed here have been skillfully boosting businesses through effective Spark consulting and customized Big Data solutions.

Check out this list of Best Spark Development Companies with Best Spark Developers.

#spark development service providers #top spark development companies #best big data spark development #spark consulting #spark developers #spark application

Chloe  Butler

Chloe Butler

1650441600

Spark Monitor: An Extension for Jupyter Notebook

Note: This project is now maintained at https://github.com/swan-cern/jupyter-extensions

Google Summer of Code - Final Report

For the google summer of code final report of this project click here

About

+=

SparkMonitor is an extension for Jupyter Notebook that enables the live monitoring of Apache Spark Jobs spawned from a notebook. The extension provides several features to monitor and debug a Spark job from within the notebook interface itself. 
 


jobdisplay

Features

  • Automatically displays a live monitoring tool below cells that run Spark jobs in a Jupyter notebook
  • A table of jobs and stages with progressbars
  • A timeline which shows jobs, stages, and tasks
  • A graph showing number of active tasks & executor cores vs time
  • A notebook server extension that proxies the Spark UI and displays it in an iframe popup for more details
  • For a detailed list of features see the use case notebooks
  • How it Works

Quick Installation

pip install sparkmonitor
jupyter nbextension install sparkmonitor --py --user --symlink 
jupyter nbextension enable sparkmonitor --py --user            
jupyter serverextension enable --py --user sparkmonitor
ipython profile create && echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >>  $(ipython profile locate default)/ipython_kernel_config.py

For more detailed instructions click here

To do a quick test of the extension:

docker run -it -p 8888:8888 krishnanr/sparkmonitor

Integration with ROOT and SWAN

At CERN, the SparkMonitor extension would find two main use cases:

  • Distributed analysis with ROOT and Apache Spark using the DistROOT module. Here is an example demonstrating this use case.
  • Integration with SWAN, A service for web based analysis, via a modified container image for SWAN user sessions.

Author: krishnan-r
Source Code: https://github.com/krishnan-r/sparkmonitor
License: Apache-2.0 License

#jupyter #python #spark 

Spark Monitor: An Extension for Jupyter Lab

This project was originally written by krishnan-r as a Google Summer of Code project for Jupyter Notebook. Check his website out here.

As a part of my internship as a Software Engineer at Yelp, I created this fork to update the extension to be compatible with JupyterLab - Yelp's choice for sharing and collaborating on notebooks.

About

+=

SparkMonitor is an extension for Jupyter Lab that enables the live monitoring of Apache Spark Jobs spawned from a notebook. The extension provides several features to monitor and debug a Spark job from within the notebook interface itself. 
 


jobdisplay

Requirements

  • At least JupyterLab 3
  • pyspark 3.X.X or newer (For compatibility with older pyspark versions, use jupyterlab-sparkmonitor 3.X)

Features

  • Automatically displays a live monitoring tool below cells that run Spark jobs in a Jupyter notebook
  • A table of jobs and stages with progressbars
  • A timeline which shows jobs, stages, and tasks
  • A graph showing number of active tasks & executor cores vs time
  • A notebook server extension that proxies the Spark UI and displays it in an iframe popup for more details
  • For a detailed list of features see the use case notebooks
  • Support for multiple SparkSessions (default port is 4040)
  • How it Works

Quick Start

To do a quick test of the extension

This docker image has pyspark and several other related packages installed alongside the sparkmonitor extension.

docker run -it -p 8888:8888 itsjafer/sparkmonitor

Setting up the extension

pip install jupyterlab-sparkmonitor # install the extension

# set up ipython profile and add our kernel extension to it
ipython profile create --ipython-dir=.ipython
echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >>  .ipython/profile_default/ipython_config.py

# run jupyter lab
IPYTHONDIR=.ipython jupyter lab --watch

With the extension installed, a SparkConf object called conf will be usable from your notebooks. You can use it as follows:

from pyspark import SparkContext

# start the spark context using the SparkConf the extension inserted
sc=SparkContext.getOrCreate(conf=conf) #Start the spark context

# Monitor should spawn under the cell with 4 jobs
sc.parallelize(range(0,100)).count()
sc.parallelize(range(0,100)).count()
sc.parallelize(range(0,100)).count()
sc.parallelize(range(0,100)).count()

If you already have your own spark configuration, you will need to set spark.extraListeners to sparkmonitor.listener.JupyterSparkMonitorListener and spark.driver.extraClassPath to the path to the sparkmonitor python package path/to/package/sparkmonitor/listener.jar

from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .config('spark.extraListeners', 'sparkmonitor.listener.JupyterSparkMonitorListener')\
        .config('spark.driver.extraClassPath', 'venv/lib/python3.7/site-packages/sparkmonitor/listener.jar')\
        .getOrCreate()

# should spawn 4 jobs in a monitor bnelow the cell
spark.sparkContext.parallelize(range(0,100)).count()
spark.sparkContext.parallelize(range(0,100)).count()
spark.sparkContext.parallelize(range(0,100)).count()
spark.sparkContext.parallelize(range(0,100)).count()

Changelog

  • 1.0 - Initial Release
  • 2.0 - Migration to JupyterLab 2, Multiple Spark Sessions, and displaying monitors beneath the correct cell more accurately
  • 3.0 - Migrate to JupyterLab 3 as prebuilt extension
  • 4.0 - pyspark 3.X Compatibility; no longer compatible with PySpark 2.X or under

Development

If you'd like to develop the extension:

make all # Clean the directory, build the extension, and run it locally

Author: itsjafer
Source Code: https://github.com/itsjafer/jupyterlab-sparkmonitor
License: Apache-2.0 license

#jupyterlab #jupyter 

Dynamic Partition Pruning in Spark 3.0

Dynamic Partition Pruning in Spark 3.0

With the release of Spark 3.0, big improvements were implemented to enable Spark to execute faster and there came many new features along with it. Among them, dynamic partition pruning is one. Before diving into the features which are new in Dynamic Partition Pruning let us understand what is Partition Pruning.

Partition Pruning in Spark

In standard database pruning means that the optimizer will avoid reading files that cannot contain the data that you are looking for

#apache spark #scala #partitioning in apache spark #spark #spark sql