Jack  Shaw

Jack Shaw


How to Build Massive AI Applications for Distributed Big Data

Distributed Deep Learning Library for Apache Spark

BigDL makes it easy for data scientists and data engineers to build end-to-end, distributed AI applications. The BigDL 2.0 release combines the original BigDL and Analytics Zoo projects, providing the following features:

  • DLlib: distributed deep learning library for Apache Spark (i.e., the original BigDL framework with Keras-style API and Spark ML pipeline support)
  • Orca: seamlessly scale out TensorFlow and PyTorch pipelines for distributed Big Data
  • RayOnSpark: run Ray programs directly on Big Data clusters
  • Chronos: scalable time series analysis using AutoML
  • PPML: privacy preserving big data analysis and machine learning (experimental)
  • Cluster Serving: distributed, real-time model serving

For more information, you may read the docs.


You can use BigDL on Google Colab without any installation. BigDL also includes a set of notebooks that you can directly open and run in Colab.

To install BigDL, we recommend using conda environments.

conda create -n my_env 
conda activate my_env
pip install bigdl 

To install latest nightly build, use pip install --pre --upgrade bigdl; see Python and Scala user guide for more details.

Getting Started with DLlib

DLlib is a distributed deep learning library for Apache Spark; with DLlib, users can write distributed deep learning applications as standard Spark programs (using either Scala or Python APIs).

First, call initNNContext at the beginning of the code:

import com.intel.analytics.bigdl.dllib.NNContext
val sc = NNContext.initNNContext()

Then, define the BigDL model using Keras-style API:

val input = Input[Float](inputShape = Shape(10))  
val dense = Dense[Float](12).inputs(input)  
val output = Activation[Float]("softmax").inputs(dense)  
val model = Model(input, output)

After that, use NNEstimator to train/predict/evaluate the model using Spark Dataframes and ML pipelines:

val trainingDF = spark.read.parquet("train_data")
val validationDF = spark.read.parquet("val_data")
val scaler = new MinMaxScaler().setInputCol("in").setOutputCol("value")
val estimator = NNEstimator(model, CrossEntropyCriterion())  
        .setBatchSize(size).setOptimMethod(new Adam()).setMaxEpoch(epoch)
val pipeline = new Pipeline().setStages(Array(scaler, estimator))

val pipelineModel = pipeline.fit(trainingDF)  
val predictions = pipelineModel.transform(validationDF)

See the NNframes and Keras API user guides for more details.

Getting Started with Orca

Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger data set in a distributed fashion. The Orca library seamlessly scales out your single node TensorFlow or PyTorch notebook across large clusters (so as to process distributed Big Data).

First, initialize Orca Context:

from bigdl.orca import init_orca_context, OrcaContext

# cluster_mode can be "local", "k8s" or "yarn"
sc = init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2) 

Next, perform data-parallel processing in Orca (supporting standard Spark Dataframes, TensorFlow Dataset, PyTorch DataLoader, Pandas, Pillow, etc.):

from pyspark.sql.functions import array

spark = OrcaContext.get_spark_session()
df = spark.read.parquet(file_path)
df = df.withColumn('user', array('user')) \  
       .withColumn('item', array('item'))

Finally, use sklearn-style Estimator APIs in Orca to perform distributed TensorFlow, PyTorch or Keras training and inference:

from tensorflow import keras
from bigdl.orca.learn.tf.estimator import Estimator

user = keras.layers.Input(shape=[1])  
item = keras.layers.Input(shape=[1])  
feat = keras.layers.concatenate([user, item], axis=1)  
predictions = keras.layers.Dense(2, activation='softmax')(feat)  
model = keras.models.Model(inputs=[user, item], outputs=predictions)  

est = Estimator.from_keras(keras_model=model)  
        feature_cols=['user', 'item'],  

See TensorFlow and PyTorch quickstart, as well as the document website, for more details.

Getting Started with RayOnSpark

Ray is an open source distributed framework for emerging AI applications. RayOnSpark allows users to directly run Ray programs on existing Big Data clusters, and directly write Ray code inline with their Spark code (so as to process the in-memory Spark RDDs or DataFrames).

from bigdl.orca import init_orca_context

# cluster_mode can be "local", "k8s" or "yarn"
sc = init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2, init_ray_on_spark=True) 

import ray

class Counter(object):
      def __init__(self):
          self.n = 0

      def increment(self):
          self.n += 1
          return self.n

counters = [Counter.remote() for i in range(5)]
print(ray.get([c.increment.remote() for c in counters]))

See the RayOnSpark user guide and quickstart for more details.

Getting Started with Chronos

Time series prediction takes observations from previous time steps as input and predicts the values at future time steps. The Chronos library makes it easy to build end-to-end time series analysis by applying AutoML to extremely large-scale time series prediction.

To train a time series model with AutoML, first initialize Orca Context:

from bigdl.orca import init_orca_context

#cluster_mode can be "local", "k8s" or "yarn"
init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2, init_ray_on_spark=True)

Then, create TSDataset for your data.

from bigdl.chronos.data import TSDataset

tsdata_train, tsdata_valid, tsdata_test\
        = TSDataset.from_pandas(df, 

Next, create an AutoTSEstimator.

from bigdl.chronos.autots import AutoTSEstimator

autotsest = AutoTSEstimator(model='lstm')

Finally, call fit on AutoTSEstimator, which applies AutoML to find the best model and hyper-parameters; it returns a TSPipeline which can be used for prediction or evaluation.

#train a pipeline with AutoML support
ts_pipeline = autotsest.fit(data=tsdata_train,


See the Chronos user guide and example for more details.

PPML (Privacy Preserving Machine Learning)

BigDL PPML provides a Trusted Cluster Environment for protecting the end-to-end Big Data AI pipeline. It combines various low level hardware and software security technologies (e.g., Intel SGX, LibOS such as Graphene and Occlum, Federated Learning, etc.), and allows users to run unmodified Big Data analysis and ML/DL programs (such as Apache Spark, Apache Flink, Tensorflow, PyTorch, etc.) in a secure fashion on (private or public) cloud.

See the PPML user guide for more details.

More information

Citing BigDL

If you've found BigDL useful for your project, you may cite the paper as follows:

  title={BigDL: A Distributed Deep Learning Framework for Big Data},
  author={Dai, Jason (Jinquan) and Wang, Yiheng and Qiu, Xin and Ding, Ding and Zhang, Yao and Wang, Yanzhang and Jia, Xianyan and Zhang, Li (Cherry) and Wan, Yan and Li, Zhichao and Wang, Jiao and Huang, Shengsheng and Wu, Zhongyuan and Wang, Yang and Yang, Yuhao and She, Bowen and Shi, Dongjie and Lu, Qi and Huang, Kai and Song, Guoqiong},
  booktitle={Proceedings of the ACM Symposium on Cloud Computing},
  publisher={Association for Computing Machinery},

Author: intel-analytics
Source Code: https://github.com/intel-analytics/BigDL
License: Apache-2.0 License

#python #scala #spark #tensorflow #keras #pytorch 

What is GEEK

Buddha Community

How to Build Massive AI Applications for Distributed Big Data
 iOS App Dev

iOS App Dev


Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Big Data Consulting Services | Big Data Development Experts USA

Big Data Consulting Services

Traditional data processing application has limitations of its own in terms of processing the large chunk of complex data and this is where the big data processing application comes into play. Big data processing app can easily process complex and large information with their advanced capabilities.

Want to develop a Big Data Processing Application?

WebClues Infotech with its years of experience and serving 350+ clients since our inception is the agency to trust for the Big Data Processing Application development services. With a team that is skilled in the latest technologies, there can be no one better for fulfilling your development requirements.

Want to know more about our Big Data Processing App development services?

Visit: https://www.webcluesinfotech.com/big-data-solutions/

Share your requirements https://www.webcluesinfotech.com/contact-us/

View Portfolio https://www.webcluesinfotech.com/portfolio/

#big data consulting services #big data development experts usa #big data analytics services #big data services #best big data analytics solution provider #big data services and consulting

Silly mistakes that can cost ‘Big’ in Big Data Analytics

Big Data has played a major role in defining the expansion of businesses of all kinds as it helps the companies to understand their audience and devise their business techniques in accordance with the requirement.

The importance of ‘Data’ has been spoken very highly in the modern-day business. Thus, while using big data analysis, the companies must keep away from these minor mistakes otherwise it could have a major impact on their performances. Big Data analysis can be the silver bullet that can answer your questions and help your business to scale newer heights.

Read More: Silly mistakes that can cost ‘Big’ in Big Data Analytics

#top big data analytics companies #best big data service providers #big data for business #big data technology #big data mistakes #big data analytics

Big Data can be The ‘Big’ boon for The Modern Age Businesses

The rapid growth of technology has led to many people opting for online services, and thus the collection and maintenance of data becomes a significant factor for any company. Big data analytics service providers can help the companies get a massive edge over their competitors as they would manage the data well and allow the businesses to make better business decisions. It will provide you with a combination of increased customer experience, revenue, and reduced cost and thus will create a win-win situation for your business. Big data technologies will be your perfect ally in excelling in the cut-throat business environment and come out with flying colors.

Read More: Big Data can be The ‘Big’ boon for The Modern Age Businesses

#big data analytics service providers #top big data analytics companies #impact of big data on businesses #best big data consulting firms #big data #big data for businesses

Top Microsoft big data solutions Companies | Best Microsoft big data Developers

An extensively researched list of top Microsoft big data analytics and solution with ratings & reviews to help find the best Microsoft big data solutions development companies around the world.
An exclusive list of Microsoft Big Data consulting and solution providers, after examining various factors of expert big data analytics firms and found the equivalent matches that boast the ace qualities with proven fineness in data analytics. For business growth and enterprise acceleration getting inputs from the whole data of the organization have become necessary, thus we bring to you the most trustworthy Microsoft Big Data consultants and solutions providers for your assistance.
Let’s take a look at the List of Best Microsoft big data solutions Companies.

#microsoft big data solutions development companies #microsoft big data analytics and solution #microsoft big data consultants #microsoft big data developers #microsoft big data #microsoft big data solution providers