Redash: Make Your Company Data Driven

Redash is designed to enable anyone, regardless of the level of technical sophistication, to harness the power of data big and small. SQL users leverage Redash to explore, query, visualize, and share data from any data sources. Their work in turn enables anybody in their organization to use the data. Every day, millions of users at thousands of organizations around the world use Redash to develop insights and make data-driven decisions.

Redash features:

  1. Browser-based: Everything in your browser, with a shareable URL.
  2. Ease-of-use: Become immediately productive with data without the need to master complex software.
  3. Query editor: Quickly compose SQL and NoSQL queries with a schema browser and auto-complete.
  4. Visualization and dashboards: Create beautiful visualizations with drag and drop, and combine them into a single dashboard.
  5. Sharing: Collaborate easily by sharing visualizations and their associated queries, enabling peer review of reports and queries.
  6. Schedule refreshes: Automatically update your charts and dashboards at regular intervals you define.
  7. Alerts: Define conditions and be alerted instantly when your data changes.
  8. REST API: Everything that can be done in the UI is also available through REST API.
  9. Broad support for data sources: Extensible data source API with native support for a long list of common databases and platforms.

Getting Started

Supported Data Sources

Redash supports more than 35 SQL and NoSQL data sources. It can also be extended to support more. Below is a list of built-in sources:

  • Amazon Athena
  • Amazon CloudWatch / Insights
  • Amazon DynamoDB
  • Amazon Redshift
  • ArangoDB
  • Axibase Time Series Database
  • Apache Cassandra
  • ClickHouse
  • CockroachDB
  • Couchbase
  • CSV
  • Databricks
  • DB2 by IBM
  • Dgraph
  • Apache Drill
  • Apache Druid
  • Eccenca Corporate Memory
  • Elasticsearch
  • Exasol
  • Microsoft Excel
  • Firebolt
  • Databend
  • Google Analytics
  • Google BigQuery
  • Google Spreadsheets
  • Graphite
  • Greenplum
  • Apache Hive
  • Apache Impala
  • InfluxDB
  • IBM Netezza Performance Server
  • JIRA (JQL)
  • JSON
  • Apache Kylin
  • OmniSciDB (Formerly MapD)
  • MariaDB
  • MemSQL
  • Microsoft Azure Data Warehouse / Synapse
  • Microsoft Azure SQL Database
  • Microsoft Azure Data Explorer / Kusto
  • Microsoft SQL Server
  • MongoDB
  • MySQL
  • Oracle
  • Apache Phoenix
  • Apache Pinot
  • PostgreSQL
  • Presto
  • Prometheus
  • Python
  • Qubole
  • Rockset
  • Salesforce
  • ScyllaDB
  • Shell Scripts
  • Snowflake
  • SQLite
  • TiDB
  • TreasureData
  • Trino
  • Uptycs
  • Vertica
  • Yandex AppMetrrica
  • Yandex Metrica

Getting Help

Reporting Bugs and Contributing Code

  • Want to report a bug or request a feature? Please open an issue.
  • Want to help us build Redash? Fork the project, edit in a dev environment and make a pull request. We need all the help we can get!


Please email to report any security vulnerabilities. We will acknowledge receipt of your vulnerability and strive to send you regular updates about our progress. If you're curious about the status of your disclosure please feel free to email us again. If you want to encrypt your disclosure email, you can use this PGP key.

Download Details:

Author: Getredash
Source Code: 
License: BSD-2-Clause license

#python #javascript #visualization #mysql #bigquery #spark #dashboard 

Redash: Make Your Company Data Driven

Разница между: Data Lake и Delta Lake


Озеро данных — это репозиторий, в котором дешево хранятся огромные объемы необработанных данных в собственном формате.

Он состоит из дампов текущих и исторических данных в различных форматах, включая XML, JSON, CSV, Parquet и т. д.

Недостатки озера данных

  • Не обеспечивает атомарность — не все или ничего, в конечном итоге это может привести к хранению поврежденных данных.
  • Нет контроля качества — создаются противоречивые и непригодные для использования данные.
  • Отсутствие согласованности/изоляции — чтение и добавление при обновлении невозможно.


Delta Lake позволяет нам постепенно улучшать качество, пока оно не будет готово к употреблению. Данные перетекают, как вода в озере Дельта, от одного этапа к другому (бронза -> серебро -> золото).

  • Озеро Delta переносит полные ACID-транзакции в Apache Spark. Это означает, что работы либо будут завершены, либо их не будет вообще.
  • Delta является открытым исходным кодом Apache. Вы можете хранить большой объем данных, не беспокоясь о блокировке.
  • Озеро Дельта глубоко измельчено Apache Spark, что означает, что задания Spark (пакетные/потоковые) могут быть преобразованы без написания их с нуля.

Архитектура Дельта-Лейк

Озеро данных против озера Дельта

Архитектура Дельта-Лейк

Бронзовые Столы

Данные могут поступать из различных источников, которые могут быть грязными. Таким образом, это свалка для необработанных данных.

Серебряные столы

Состоит из промежуточных данных с некоторой очисткой.

Это Queryable для легкой отладки.

Золотые столы

Он состоит из чистых данных, готовых к использованию.

Оригинальный источник статьи:

#datalake #delta #spark #databricks #bigdata 

Разница между: Data Lake и Delta Lake
田辺  桃子

田辺 桃子


两者之间的区别:Data Lake 与 Delta Lake


Data Lake 是一个存储库,可以廉价地以其本机格式存储大量原始数据。

它由各种格式的当前和历史数据转储组成,包括 XML、JSON、CSV、Parquet 等。


  • 不提供原子性——全无或全无,它最终可能会存储损坏的数据。
  • 没有质量执行——它会产生不一致和不可用的数据。
  • 无一致性/隔离——发生更新时不可能读取和追加。


Delta Lake 使我们能够逐步提高质量,直到可以使用为止。数据像 Delta Lake 中的水一样从一个阶段流到另一个阶段(青铜 -> 白银 -> 黄金)。

  • Delta lake 为 Apache Spark 带来了完整的 ACID 事务。这意味着工作要么完成要么根本不完成。
  • Delta 由 Apache 开源。您可以存储大量数据而不必担心锁定。
  • Delta lake 深受 Apache Spark 的影响,这意味着可以转换 Spark 作业(批处理/流)而无需从头开始编写。











文章原文出处:https:   //

#datalake #delta #spark #databricks #bigdata 

两者之间的区别:Data Lake 与 Delta Lake

Difference between: Data Lake Vs Delta Lake


Data Lake is a storage repository that cheaply stores vast raw data in its native format.

It consists of current and historical data dumps in various formats, including XML, JSON, CSV, Parquet, etc.

Drawbacks in Data Lake

  • Doesn’t provide Atomicity — No all or nothing, it may end up storing corrupt data.
  • No Quality Enforcement — It creates inconsistent and unusable data.
  • No Consistency/Isolation — It’s impossible to read and append when an update occurs.


Delta Lake allows us to incrementally improve the quality until it is ready for consumption. Data flows like water in Delta Lake from one stage to another stage (Bronze -> Silver -> Gold).

  • Delta lake brings full ACID transactions to Apache Spark. That means jobs will either be complete or not at all.
  • Delta is open-sourced by Apache. You can store a large amount of data without worrying about locking.
  • Delta lake is deeply powdered by Apache Spark, meaning the Spark jobs (batch/stream) can be converted without writing those from scratch.

Delta Lake Architecture

Data Lake Vs Delta Lake

Delta Lake Architecture

Bronze Tables

Data may come from various sources, which could be Dirty. Thus, It is a dumping ground for raw data.

Silver Tables

Consists of Intermediate data with some cleanup applied.

It is Queryable for easy debugging.

Gold Tables

It consists of clean data, which is ready for consumption.

Original article source at:

#datalake #delta #spark #databricks #bigdata 

Difference between: Data Lake Vs Delta Lake
Nat  Grady

Nat Grady


An Easy-to-use, High Performance & Unified Analytics Database

Apache Doris

Apache Doris is an easy-to-use, high-performance and real-time analytical database based on MPP architecture, known for its extreme speed and ease of use. It only requires a sub-second response time to return query results under massive data and can support not only high-concurrent point query scenarios but also high-throughput complex analysis scenarios.

All this makes Apache Doris an ideal tool for scenarios including report analysis, ad-hoc query, unified data warehouse, and data lake query acceleration. On Apache Doris, users can build various applications, such as user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, and order analysis.

📈 Usage Scenarios

As shown in the figure below, after various data integration and processing, the data sources are usually stored in the real-time data warehouse Apache Doris and the offline data lake or data warehouse (in Apache Hive, Apache Iceberg or Apache Hudi).

Apache Doris is widely used in the following scenarios:

Reporting Analysis

  • Real-time dashboards
  • Reports for in-house analysts and managers
  • Highly concurrent user-oriented or customer-oriented report analysis: such as website analysis and ad reporting that usually require thousands of QPS and quick response times measured in miliseconds. A successful user case is that Doris has been used by the Chinese e-commerce giant in ad reporting, where it receives 10 billion rows of data per day, handles over 10,000 QPS, and delivers a 99 percentile query latency of 150 ms.

Ad-Hoc Query. Analyst-oriented self-service analytics with irregular query patterns and high throughput requirements. XiaoMi has built a growth analytics platform (Growth Analytics, GA) based on Doris, using user behavior data for business growth analysis, with an average query latency of 10 seconds and a 95th percentile query latency of 30 seconds or less, and tens of thousands of SQL queries per day.

Unified Data Warehouse Construction. Apache Doris allows users to build a unified data warehouse via one single platform and save the trouble of handling complicated software stacks. Chinese hot pot chain Haidilao has built a unified data warehouse with Doris to replace its old complex architecture consisting of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix.

Data Lake Query. Apache Doris avoids data copying by federating the data in Apache Hive, Apache Iceberg, and Apache Hudi using external tables, and thus achieves outstanding query performance.

🖥️ Core Concepts

📂 Architecture of Apache Doris

The overall architecture of Apache Doris is shown in the following figure. The Doris architecture is very simple, with only two types of processes.

Frontend (FE): user request access, query parsing and planning, metadata management, node management, etc.

Backend (BE): data storage and query plan execution

Both types of processes are horizontally scalable, and a single cluster can support up to hundreds of machines and tens of petabytes of storage capacity. And these two types of processes guarantee high availability of services and high reliability of data through consistency protocols. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system.

The overall architecture of Apache Doris

In terms of interfaces, Apache Doris adopts MySQL protocol, supports standard SQL, and is highly compatible with MySQL dialect. Users can access Doris through various client tools and it supports seamless connection with BI tools.

💾 Storage Engine

Doris uses a columnar storage engine, which encodes, compresses, and reads data by column. This enables a very high compression ratio and largely reduces irrelavant data scans, thus making more efficient use of IO and CPU resources. Doris supports various index structures to minimize data scans:

  • Sorted Compound Key Index: Users can specify three columns at most to form a compound sort key. This can effectively prune data to better support highly concurrent reporting scenarios.
  • Z-order Index: This allows users to efficiently run range queries on any combination of fields in their schema.
  • MIN/MAX Indexing: This enables effective filtering of equivalence and range queries for numeric types.
  • Bloom Filter: very effective in equivalence filtering and pruning of high cardinality columns
  • Invert Index: This enables fast search for any field.

💿 Storage Models

Doris supports a variety of storage models and has optimized them for different scenarios:

Aggregate Key Model: able to merge the value columns with the same keys and significantly improve performance

Unique Key Model: Keys are unique in this model and data with the same key will be overwritten to achieve row-level data updates.

Duplicate Key Model: This is a detailed data model capable of detailed storage of fact tables.

Doris also supports strongly consistent materialized views. Materialized views are automatically selected and updated, which greatly reduces maintenance costs for users.

🔍 Query Engine

Doris adopts the MPP model in its query engine to realize parallel execution between and within nodes. It also supports distributed shuffle join for multiple large tables so as to handle complex queries.

The Doris query engine is vectorized, with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, improve cache hit rates, and make efficient use of SIMD instructions. Doris delivers a 5–10 times higher performance in wide table aggregation scenarios than non-vectorized engines.

Apache Doris uses Adaptive Query Execution technology to dynamically adjust the execution plan based on runtime statistics. For example, it can generate runtime filter, push it to the probe side, and automatically penetrate it to the Scan node at the bottom, which drastically reduces the amount of data in the probe and increases join performance. The runtime filter in Doris supports In/Min/Max/Bloom filter.

🚅 Query Optimizer

In terms of optimizers, Doris uses a combination of CBO and RBO. RBO supports constant folding, subquery rewriting, predicate pushdown and CBO supports Join Reorder. The Doris CBO is under continuous optimization for more accurate statistical information collection and derivation, and more accurate cost model prediction.

Technical Overview: 🔗Introduction to Apache Doris

🎆 Why choose Apache Doris?

🎯 Easy to Use: Two processes, no other dependencies; online cluster scaling, automatic replica recovery; compatible with MySQL protocol, and using standard SQL.

🚀 High Performance: Extremely fast performance for low-latency and high-throughput queries with columnar storage engine, modern MPP architecture, vectorized query engine, pre-aggregated materialized view and data index.

🖥️ Single Unified: A single system can support real-time data serving, interactive data analysis and offline data processing scenarios.

⚛️ Federated Querying: Supports federated querying of data lakes such as Hive, Iceberg, Hudi, and databases such as MySQL and Elasticsearch.

Various Data Import Methods: Supports batch import from HDFS/S3 and stream import from MySQL Binlog/Kafka; supports micro-batch writing through HTTP interface and real-time writing using Insert in JDBC.

🚙 Rich Ecology: Spark uses Spark-Doris-Connector to read and write Doris; Flink-Doris-Connector enables Flink CDC to implement exactly-once data writing to Doris; DBT Doris Adapter is provided to transform data in Doris with DBT.

🙌 Contributors

Apache Doris has graduated from Apache incubator successfully and become a Top-Level Project in June 2022.

Currently, the Apache Doris community has gathered more than 400 contributors from nearly 200 companies in different industries, and the number of active contributors is close to 100 per month.

Monthly Active Contributors

Contributor over time

We deeply appreciate 🔗community contributors for their contribution to Apache Doris.

👨‍👩‍👧‍👦 Users

Apache Doris now has a wide user base in China and around the world, and as of today, Apache Doris is used in production environments in thousands of companies worldwide. More than 80% of the top 50 Internet companies in China in terms of market capitalization or valuation have been using Apache Doris for a long time, including Baidu, Meituan, Xiaomi, Jingdong, Bytedance, Tencent, NetEase, Kwai, Sina, 360, Mihoyo, and Ke Holdings. It is also widely used in some traditional industries such as finance, energy, manufacturing, and telecommunications.

The users of Apache Doris: 🔗

Add your company logo at Apache Doris Website: 🔗Add Your Company

👣 Get Started

📚 Docs

All Documentation 🔗Docs

⬇️ Download

All release and binary version 🔗Download

🗄️ Compile

See how to compile 🔗Compilation

📮 Install

See how to install and deploy 🔗Installation and deployment

🧩 Components

📝 Doris Connector

Doris provides support for Spark/Flink to read data stored in Doris through Connector, and also supports to write data to Doris through Connector.



🌈 Community and Support

📤 Subscribe Mailing Lists

Mail List is the most recognized form of communication in Apache community. See how to 🔗Subscribe Mailing Lists

🙋 Report Issues or Submit Pull Request

If you meet any questions, feel free to file a 🔗GitHub Issue or post it in 🔗GitHub Discussion and fix it by submitting a 🔗Pull Request

🍻 How to Contribute

We welcome your suggestions, comments (including criticisms), comments and contributions. See 🔗How to Contribute and 🔗Code Submission Guide

⌨️ Doris Improvement Proposals (DSIP)

🔗Doris Improvement Proposal (DSIP) can be thought of as A Collection of Design Documents for all Major Feature Updates or Improvements.

💬 Contact Us

Contact us through the following mailing list.

dev@doris.apache.orgDevelopment-related discussionsSubscribeUnsubscribeArchives

🧰 Links

🎉 Version 1.2.2 released now! It is fully evolved release and all users are encouraged to upgrade to this release. Check out the 🔗Release Notes here.

🎉 Version 1.1.5 released now. It is a LTS(Long-term Support) release based on version 1.1. Check out the 🔗Release Notes here.

👀 Have a look at the 🔗Official Website for a comprehensive list of Apache Doris's core features, blogs and user cases.

Download Details:

Author: Apache
Source Code: 
License: Apache-2.0 license

#sql #bigquery #realtime #database #spark 

An Easy-to-use, High Performance & Unified Analytics Database
Royce  Reinger

Royce Reinger


Spark: .NET for Apache® Spark™


.NET for Apache® Spark™

.NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data.

.NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.

.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including Azure HDInsight Spark, Amazon EMR Spark, AWS & Azure Databricks.

Note: We currently have a Spark Project Improvement Proposal JIRA at SPIP: .NET bindings for Apache Spark to work with the community towards getting .NET support by default into Apache Spark. We highly encourage you to participate in the discussion.

Get Started

These instructions will show you how to run a .NET for Apache Spark app using .NET Core.

Build Status

Ubuntu iconWindows icon
 Build Status

Building from Source

Building from source is very easy and the whole process (from cloning to being able to run your app) should take less than 15 minutes!

Windows iconWindows
Ubuntu iconUbuntu



There are two types of samples/apps in the .NET for Apache Spark repo:

Icon Getting Started - .NET for Apache Spark code focused on simple and minimalistic scenarios.

Icon End-End apps/scenarios - Real world examples of industry standard benchmarks, usecases and business applications implemented using .NET for Apache Spark.

We welcome contributions to both categories!

Analytics Scenario



Dataframes and SparkSQLSimple code snippets to help you get familiarized with the programmability experience of .NET for Apache Spark.Basic     C#     F#   Getting started icon
Structured StreamingCode snippets to show you how to utilize Apache Spark's Structured Streaming (2.3.1, 2.3.2, 2.4.1, Latest)

Word Count     C#    F#    Getting started icon

Windowed Word Count    C#    F#    Getting started icon

Word Count on data from Kafka    C#    F#     Getting started icon

TPC-H Queries

Code to show you how to author complex queries using .NET for Apache Spark.

TPC-H Functional     C#    End-to-end app icon

TPC-H SparkSQL     C#    End-to-end app icon


We welcome contributions! Please review our contribution guide.

Inspiration and Special Thanks

This project would not have been possible without the outstanding work from the following communities:

  • Apache Spark: Unified Analytics Engine for Big Data, the underlying backend execution engine for .NET for Apache Spark
  • Mobius: C# and F# language binding and extensions to Apache Spark, a pre-cursor project to .NET for Apache Spark from the same Microsoft group.
  • PySpark: Python bindings for Apache Spark, one of the implementations .NET for Apache Spark derives inspiration from.
  • sparkR: one of the implementations .NET for Apache Spark derives inspiration from.
  • Apache Arrow: A cross-language development platform for in-memory data. This library provides .NET for Apache Spark with efficient ways to transfer column major data between the JVM and .NET CLR.
  • Pyrolite - Java and .NET interface to Python's pickle and Pyro protocols. This library provides .NET for Apache Spark with efficient ways to transfer row major data between the JVM and .NET CLR.
  • Databricks: Unified analytics platform. Many thanks to all the suggestions from them towards making .NET for Apache Spark run on Azure and AWS Databricks.

How to Engage, Contribute and Provide Feedback

The .NET for Apache Spark team encourages contributions, both issues and PRs. The first step is finding an existing issue you want to contribute to or if you cannot find any, open an issue.


.NET for Apache Spark is an open source project under the .NET Foundation and does not come with Microsoft Support unless otherwise noted by the specific product. For issues with or questions about .NET for Apache Spark, please create an issue. The community is active and is monitoring submissions.

.NET Foundation

The .NET for Apache Spark project is part of the .NET Foundation.

Code of Conduct

This project has adopted the code of conduct defined by the Contributor Covenant to clarify expected behavior in our community. For more information, see the .NET Foundation Code of Conduct.

Supported Apache Spark

Apache Spark.NET for Apache Spark

*2.4.2 is not supported.


.NET for Apache Spark releases are available here and NuGet packages are available here.

Download Details:

Author: Dotnet
Source Code: 
License: MIT license

#machinelearning #microsoft #emr #streaming #spark #csharp 

Spark: .NET for Apache® Spark™
Oral  Brekke

Oral Brekke


Difference between: MapReduce vs Spark

What is MapReduce in big data:

MapReduce is a programming model for processing large data sets in parallel across a cluster of computers. It is a key technology for handling big data. The model consists of two key functions: Map and Reduce. Map takes a set of data and converts it into another set of data. There individual elements are broken down into tuples (key/value pairs). Reduce takes the output from the Map as input and aggregates the tuples into a smaller set of tuples. The combination of these two functions allows for the efficient processing of large amounts of data by dividing the work into smaller, more manageable chunks.

Is there any point of learning MapReduce, then?

Definitely, learning MapReduce is worth it if you’re interested in big data processing or work in data-intensive fields. MapReduce is a fundamental concept that gives you a basic understanding of how to process and analyze large data sets in a distributed environment. The principles of MapReduce still play a crucial role during modern big data processing frameworks, such as Apache Hadoop and Apache Spark. Understanding MapReduce provides a solid foundation for learning these technologies. Also, many organizations still use MapReduce for processing large data sets accordingly, making it a valuable skill to have in the job market.


Let’s understand this with a simple example:

Imagine we have a large dataset of words and we want to count the frequency of each word. Here’s how we could do it in MapReduce:


  • The map function takes each line of the input dataset and splits it into words.
  • For each word, the map function outputs a tuple (word, 1) indicating that the word has been found once.


  • The reduce function takes all the tuples with the same word and adds up the values (counts) for each word.
  • The reduce function outputs a tuple (word, count) for each unique word in the input dataset.
import{IntWritable, Text}
import org.apache.hadoop.mapreduce.{Mapper, Reducer}
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path

class TokenizerMapper extends Mapper[Object, Text, Text, IntWritable] {
  val one = new IntWritable(1)
  val word = new Text()

  override def map(key: Object, value: Text, context: Mapper[Object, Text, Text, IntWritable]#Context): Unit = {
    val itr = new StringTokenizer(value.toString)
    while (itr.hasMoreTokens) {
      context.write(word, one)

class IntSumReducer extends Reducer[Text, IntWritable, Text, IntWritable] {
  val result = new IntWritable

  override def reduce(key: Text, values: java.lang.Iterable[IntWritable], context: Reducer[Text, IntWritable, Text, IntWritable]#Context): Unit = {
    var sum = 0
    val valuesIter = values.iterator
    while (valuesIter.hasNext) {
      sum +=
    context.write(key, result)

object WordCount {
  def main(args: Array[String]): Unit = {
    val conf = new Configuration
    val job = Job.getInstance(conf, "word count")
    FileInputFormat.addInputPath(job, new Path(args(0)))
    FileOutputFormat.setOutputPath(job, new Path(args(1)))
    System.exit(if (job.waitForCompletion(true)) 0 else 1)

This code defines a MapReduce job that splits each line of the input into words using the TokenizerMapper class, maps each word to a tuple (word, 1) and then reduces the tuples to count the frequency of each word using the IntSumReducer class. The job is configured using a Job object and the input and output paths are specified using FileInputFormat and FileOutputFormat. The job is then executed by calling waitForCompletion.

And here’s how you could perform the same operation in Apache Spark:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object WordCount {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("WordCount")
    val sc = new SparkContext(conf)
    val textFile = sc.textFile("<input_file>.txt")
    val counts = textFile.flatMap(line => line.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _)

This code sets up a SparkConf and SparkContext, reads in the input data using textFile, splits each line into words using flatMap, maps each word to a tuple (word, 1) using map, and reduces the tuples to count the frequency of each word using reduceByKey. The result is then printed using foreach.


MapReduce is a programming paradigm for processing large datasets in a distributed environment. The MapReduce process consists of two main phases: the map phase and the reduce phase. In the map phase, data is transformed into intermediate key-value pairs. In the reduce phase, the intermediate results are aggregated to produce the final output. Spark is a popular alternative to MapReduce. It provides a high-level API and in-memory processing that can make big data processing faster and easier. Whether to choose MapReduce or Spark, depends on the specific needs of the task and the resources available.

Original article source at:

#hadoop #spark 

Difference between: MapReduce vs Spark
Royce  Reinger

Royce Reinger


lakeFS: Data version control for your data lake | Git for data

lakeFS is a data version control - Git for data

lakeFS is an open-source tool that transforms your object storage into a Git-like repository. It enables you to manage your data lake the way you manage your code.

With lakeFS you can build repeatable, atomic, and versioned data lake operations - from complex ETL jobs to data science and analytics.

lakeFS supports AWS S3, Azure Blob Storage, and Google Cloud Storage as its underlying storage service. It is API compatible with S3 and works seamlessly with all modern data frameworks such as Spark, Hive, AWS Athena, Presto, etc.

For more information, see the official lakeFS documentation.


ETL Testing with Isolated Dev/Test Environment

When working with a data lake, it’s useful to have replicas of your production environment. These replicas allow you to test these ETLs and understand changes to your data without impacting downstream data consumers.

Running ETL and transformation jobs directly in production without proper ETL Testing is a guaranteed way to have data issues flow into dashboards, ML models, and other consumers sooner or later. The most common approach to avoid making changes directly in production is to create and maintain multiple data environments and perform ETL testing on them. Dev environment to develop the data pipelines and test environment where pipeline changes are tested before pushing it to production. With lakeFS you can create branches, and get a copy of the full production data, without copying anything. This enables a faster and easier process of ETL testing.


Data changes frequently. This makes the task of keeping track of its exact state over time difficult. Oftentimes, people maintain only one state of their data––its current state.

This has a negative impact on the work, as it becomes hard to:

  • Debug a data issue.
  • Validate machine learning training accuracy (re-running a model over different data gives different results). Comply with data audits.

In comparison, lakeFS exposes a Git-like interface to data that allows keeping track of more than just the current state of data. This makes reproducing its state at any point in time straightforward.

CI/CD for Data

Data pipelines feed processed data from data lakes to downstream consumers like business dashboards and machine learning models. As more and more organizations rely on data to enable business critical decisions, data reliability and trust are of paramount concern. Thus, it’s important to ensure that production data adheres to the data governance policies of businesses. These data governance requirements can be as simple as a file format validation, schema check, or an exhaustive PII(Personally Identifiable Information) data removal from all of organization’s data.

Thus, to ensure the quality and reliability at each stage of the data lifecycle, data quality gates need to be implemented. That is, we need to run Continuous Integration(CI) tests on the data, and only if data governance requirements are met can the data can be promoted to production for business use.

Everytime there is an update to production data, the best practice would be to run CI tests and then promote(deploy) the data to production. With lakeFS you can create hooks that make sure that only data that passed these tests will become part of production.


A rollback operation is used to to fix critical data errors immediately.

What is a critical data error? Think of a situation where erroneous or misformatted data causes a signficant issue with an important service or function. In such situations, the first thing to do is stop the bleeding.

Rolling back returns data to a state in the past, before the error was present. You might not be showing all the latest data after a rollback, but at least you aren’t showing incorrect data or raising errors. Since lakeFS provides versions of the data without making copies of the data, you can time travel between versions and roll back to the version of the data before the error was presented.

Getting Started

Using Docker

Use this section to learn about lakeFS. For a production-suitable deployment, see the docs.

Ensure you have Docker installed on your computer.

Run the following command:

docker run --pull always --name lakefs -p 8000:8000 treeverse/lakefs run --local-settings

Open in your web browser to set up an initial admin user. You will use this user to log in and send API requests.

Other quickstart methods

You can try lakeFS:

Setting up a repository

Once lakeFS is installed, you are ready to create your first repository!


Stay up to date and get lakeFS support via:

  • Share your lakeFS experience and get support on our Slack.
  • Follow us and join the conversation on Twitter.
  • Learn from video tutorials on our YouTube channel.
  • Read more on data versioning and other data lake best practices in our blog.
  • Feel free to contact us about anything else.

More information

Download Details:

Author: Treeverse
Source Code: 
License: Apache-2.0 license

#machinelearning #go #golang #apache #spark 

lakeFS: Data version control for your data lake | Git for data
Royce  Reinger

Royce Reinger


Delta Lake: Storage Layer That Brings Scalable, ACID Transactions

Delta Lake

Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.

The following are some of the more popular Delta Lake integrations, refer to for the complete list:

  • Apache Spark™: This connector allows Apache Spark™ to read from and write to Delta Lake.
  • Apache Flink (Preview): This connector allows Apache Flink to write to Delta Lake.
  • PrestoDB: This connector allows PrestoDB to read from Delta Lake.
  • Trino: This connector allows Trino to read from and write to Delta Lake.
  • Delta Standalone: This library allows Scala and Java-based projects (including Apache Flink, Apache Hive, Apache Beam, and PrestoDB) to read from and write to Delta Lake.
  • Apache Hive: This connector allows Apache Hive to read from Delta Lake.
  • Delta Rust API: This library allows Rust (with Python and Ruby bindings) low level access to Delta tables and is intended to be used with data processing frameworks like datafusion, ballista, rust-dataframe, vega, etc.

Latest Binaries

See the online documentation for the latest release.

API Documentation


Delta Standalone library is a single-node Java library that can be used to read from and write to Delta tables. Specifically, this library provides APIs to interact with a table’s metadata in the transaction log, implementing the Delta Transaction Log Protocol to achieve the transactional guarantees of the Delta Lake format.

API Compatibility

There are two types of APIs provided by the Delta Lake project.

  • Direct Java/Scala/Python APIs - The classes and methods documented in the API docs are considered as stable public APIs. All other classes, interfaces, methods that may be directly accessible in code are considered internal, and they are subject to change across releases.
  • Spark-based APIs - You can read Delta tables through the DataFrameReader/Writer (i.e., df.write, spark.readStream and df.writeStream). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).
  • See the online documentation for the releases and their compatibility with Apache Spark versions.

Data Storage Compatibility

Delta Lake guarantees backward compatibility for all Delta Lake tables (i.e., newer versions of Delta Lake will always be able to read tables written by older versions of Delta Lake). However, we reserve the right to break forward compatibility as new features are introduced to the transaction protocol (i.e., an older version of Delta Lake may not be able to read a table produced by a newer version).

Breaking changes in the protocol are indicated by incrementing the minimum reader/writer version in the Protocol action.


Transaction Protocol

Delta Transaction Log Protocol document provides a specification of the transaction protocol.

Requirements for Underlying Storage Systems

Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, we require the storage system to provide the following.

  1. Atomic visibility: There must be a way for a file to be visible in its entirety or not visible at all.
  2. Mutual exclusion: Only one writer must be able to create (or rename) a file at the final destination.
  3. Consistent listing: Once a file has been written in a directory, all future listings for that directory must return that file.

See the online documentation on Storage Configuration for details.

Concurrency Control

Delta Lake ensures serializability for concurrent reads and writes. Please see Delta Lake Concurrency Control for more details.

Reporting issues

We use GitHub Issues to track community reported issues. You can also contact the community for getting answers.


We welcome contributions to Delta Lake. See our for more details.

We also adhere to the Delta Lake Code of Conduct.


Delta Lake is compiled using SBT.

To compile, run

build/sbt compile

To generate artifacts, run

build/sbt package

To execute tests, run

build/sbt test

To execute a single test suite, run

build/sbt 'testOnly'

To execute a single test within and a single test suite, run

build/sbt 'testOnly *.OptimizeCompactionSuite -- -z "optimize command: on partitioned table - all partitions"'

Refer to SBT docs for more commands.

IntelliJ Setup

IntelliJ is the recommended IDE to use when developing Delta Lake. To import Delta Lake as a new project:

  1. Clone Delta Lake into, for example, ~/delta.
  2. In IntelliJ, select File > New Project > Project from Existing Sources... and select ~/delta.
  3. Under Import project from external model select sbt. Click Next.
  4. Under Project JDK specify a valid Java 1.8 JDK and opt to use SBT shell for project reload and builds.
  5. Click Finish.

Setup Verification

After waiting for IntelliJ to index, verify your setup by running a test suite in IntelliJ.

  1. Search for and open DeltaLogSuite
  2. Next to the class declaration, right click on the two green arrows and select Run 'DeltaLogSuite'


If you see errors of the form

Error:(46, 28) object DeltaSqlBaseParser is not a member of package
Error:(91, 22) not found: type DeltaSqlBaseParser
    val parser = new DeltaSqlBaseParser(tokenStream)

then follow these steps:

  1. Compile using the SBT CLI: build/sbt compile.
  2. Go to File > Project Structure... > Modules > delta-core.
  3. In the right panel under Source Folders remove any target folders, e.g. target/scala-2.12/src_managed/main [generated]
  4. Click Apply and then re-run your test.


There are two mediums of communication within the Delta Lake community.

Download Details:

Author: Delta-io
Source Code: 
License: Apache-2.0 license

#machinelearning #python #spark #analytics 

Delta Lake: Storage Layer That Brings Scalable, ACID Transactions
Riley Lambert

Riley Lambert


Spark Tutorial for Beginners

Explore Spark in depth and get a strong foundation in Spark. You'll learn: Why do we need Spark when we have Hadoop? What is the need for RDD? How Spark is faster than Hadoop? How Spark achieves the speed and efficiency it claims? How does memory gets managed in Spark? How fault tolerance work in Spark? and more

Spark Tutorial for Beginners

Most courses and other online help including Spark's documentation is not good in helping students understand the foundational concepts. They explain what is Spark, what is RDD, what is "this" and what is "that" but students were most interested in understanding core fundamentals and more importantly answer questions like:

  •        Why do we need Spark when we have Hadoop ? 
  •        What is the need for RDD ?
  •        How Spark is faster than Hadoop?
  •        How Spark achieves the speed and efficiency it claims ?
  •        How does memory gets managed in Spark?
  •        How fault tolerance work in Spark ?

and that is exactly what you will learn in this Spark Starter Kit course. The aim of this course is to give you a strong foundation in Spark.

What you’ll learn

  •        Learn about the similarities and differences between Spark and Hadoop.
  •        Explore the challenges Spark tries to address, you will give you a good idea about the need for spark.
  •        Learn “How Spark is faster than Hadoop?”, you will understand the reasons behind Spark’s performance and efficiency.
  •        Before we talk about what is RDD, we explain in detail what is the need for something like RDD.
  •        You will get a strong foundantion in understanding RDDs in depth and then we take a step further to point out and clarify some of the common misconceptions about RDD among new Spark learners.
  •        You will understand the types of dependencies between RDD and more importantly we will see why dependencies are important.
  •        We will walk you through step by step how the program we write gets translated in to actual execution behind the scenes in a Spark cluster.
  •        You will get a very good understanding of some of the key concepts behind Spark’s execution engine and the reasons why it is efficient.
  •        Master fault tolerance by simulating a fault situation and examine how Spark recover from it.
  •        You will learn how memory and the contents in memory are managed by spark.
  •        Understand the need for a new programming language like Scala.
  •        Examine object oriented programming vs. functional programming.
  •        Explore Scala's features and functions.

Are there any course requirements or prerequisites?

  •        Basic Hadoop concepts.

Who this course is for:

  •        Anyone who is interested in distributed systems and computing and big data related technologies.

#spark #hadoop #bigdata

Spark Tutorial for Beginners
Royce  Reinger

Royce Reinger


TransmogrifAI: An AutoML Library for Building Modular, Reusable


TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Apache Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time.

Use TransmogrifAI if you need a machine learning library to:

  • Build production ready machine learning applications in hours, not months
  • Build machine learning models without getting a Ph.D. in machine learning
  • Build modular, reusable, strongly typed machine learning workflows

To understand the motivation behind TransmogrifAI check out these:

Skip to Quick Start and Documentation.

Predicting Titanic Survivors with TransmogrifAI

The Titanic dataset is an often-cited dataset in the machine learning community. The goal is to build a machine learnt model that will predict survivors from the Titanic passenger manifest. Here is how you would build the model using TransmogrifAI:

import com.salesforce.op._
import com.salesforce.op.readers._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._

// Read Titanic data as a DataFrame
val passengersData = DataReaders.Simple.csvCase[Passenger](path = pathToData).readDataset().toDF()

// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")

// Automated feature engineering
val featureVector = predictors.transmogrify()

// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)

// Automated model selection
val pred = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(pred).train()

println("Model summary:\n" + model.summaryPretty())

Model summary:

Evaluated Logistic Regression, Random Forest models with 3 folds and AuPR metric.
Evaluated 3 Logistic Regression models with AuPR between [0.6751930383321765, 0.7768725281794376]
Evaluated 16 Random Forest models with AuPR between [0.7781671467343991, 0.8104798040316159]

Selected model Random Forest classifier with parameters:
| Model Param           |     Value    |
| modelType             | RandomForest |
| featureSubsetStrategy |         auto |
| impurity              |         gini |
| maxBins               |           32 |
| maxDepth              |           12 |
| minInfoGain           |        0.001 |
| minInstancesPerNode   |           10 |
| numTrees              |           50 |
| subsamplingRate       |          1.0 |

Model evaluation metrics:
| Metric Name | Hold Out Set Value |  Training Set Value |
| Precision   |               0.85 |   0.773851590106007 |
| Recall      | 0.6538461538461539 |  0.6930379746835443 |
| F1          | 0.7391304347826088 |  0.7312186978297163 |
| AuROC       | 0.8821603927986905 |  0.8766642291593114 |
| AuPR        | 0.8225075757571668 |   0.850331080886535 |
| Error       | 0.1643835616438356 | 0.19682151589242053 |
| TP          |               17.0 |               219.0 |
| TN          |               44.0 |               438.0 |
| FP          |                3.0 |                64.0 |
| FN          |                9.0 |                97.0 |

Top model insights computed using correlation:
| Top Positive Insights |      Correlation     |
| sex = "female"        |   0.5177801026737666 |
| cabin = "OTHER"       |   0.3331391338844782 |
| pClass = 1            |   0.3059642953159715 |
| Top Negative Insights |      Correlation     |
| sex = "male"          |  -0.5100301587292186 |
| pClass = 3            |  -0.5075774968534326 |
| cabin = null          | -0.31463114463832633 |

Top model insights computed using CramersV:
|      Top Insights     |       CramersV       |
| sex                   |    0.525557139885501 |
| embarked              |  0.31582347194683386 |
| age                   |  0.21582347194683386 |

While this may seem a bit too magical, for those who want more control, TransmogrifAI also provides the flexibility to completely specify all the features being extracted and all the algorithms being applied in your ML pipeline. Visit our docs site for full documentation, getting started, examples, faq and other information.

Adding TransmogrifAI into your project

You can simply add TransmogrifAI as a regular dependency to an existing project. Start by picking TransmogrifAI version to match your project dependencies from the version matrix below (if not sure - take the stable version):

TransmogrifAI VersionSpark VersionScala VersionJava Version
0.7.1 (unreleased, master), 0.7.0 (stable)
0.6.1, 0.6.0, 0.5.3, 0.5.2, 0.5.1,

For Gradle in build.gradle add:

repositories {
dependencies {
    // TransmogrifAI core dependency
    compile 'com.salesforce.transmogrifai:transmogrifai-core_2.11:0.7.0'

    // TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
    // compile 'com.salesforce.transmogrifai:transmogrifai-models_2.11:0.7.0'

For SBT in build.sbt add:

scalaVersion := "2.11.12"

resolvers += Resolver.jcenterRepo

// TransmogrifAI core dependency
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.7.0"

// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-models" % "0.7.0"

Then import TransmogrifAI into your code:

// TransmogrifAI functionality: feature types, feature builders, feature dsl, readers, aggregators etc.
import com.salesforce.op._
import com.salesforce.op.aggregators._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.readers._

// Spark enrichments (optional)
import com.salesforce.op.utils.spark.RichDataset._
import com.salesforce.op.utils.spark.RichRDD._
import com.salesforce.op.utils.spark.RichRow._
import com.salesforce.op.utils.spark.RichMetadata._
import com.salesforce.op.utils.spark.RichStructType._

Quick Start and Documentation

Visit our docs site for full documentation, getting started, examples, faq and other information.

See scaladoc for the programming API.


Internal Contributors (prior to release)

Download Details:

Author: Salesforce
Source Code: 
License: BSD-3-Clause license

#machinelearning #scala #ai #spark #transformers 

TransmogrifAI: An AutoML Library for Building Modular, Reusable
Lawson  Wehner

Lawson Wehner


Convert Spark RDD into DataFrame and Dataset

In this blog, we will be talking about Spark RDD, Dataframe, Datasets, and how we can transform RDD into Dataframes and Datasets.

What is RDD?

A RDD is an immutable distributed collection of elements of your data. It’s partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

RDDs are so integral to the function of Spark that the entire Spark API can be considered to be a collection of operations to create, transform, and export RDDs. Every algorithm implemented in Spark is effectively a series of transformative operations performed upon data represented as an RDD.

What is Dataframe?

A DataFrame is a Dataset that is organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

What is Dataset?

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.

Dataset can be constructed from JVM objects and then manipulated using functional transformations (mapflatMapfilter, etc.). The Dataset API is available in Scala and Java. Python does not have support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName).

Working with RDD

Prerequisites: In order to work with RDD we need to create a SparkContext object

val conf: SparkConf =

  new SparkConf()



   .set("", "localhost")

val sc: SparkContext = new SparkContext(conf)

There are 2 common ways to build the RDD:

* Pass your existing collection to SparkContext.parallelize method (you will do it mostly for tests or POC)

scala> val data = Array(1, 2, 3, 4, 5)

data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val rdd = sc.parallelize(data)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 

at <console>:26

* Read from external sources

val lines = sc.textFile("data.txt")

val lineLengths = => s.length)

val totalLength = lineLengths.reduce((a, b) => a + b

Things are getting interesting when you want to convert your Spark RDD to DataFrame. It might not be obvious why you want to switch to Spark DataFrame or Dataset. You will write less code, the code itself will be more expressive, and there are a lot of out-of-the-box optimizations available for DataFrames and Datasets.

Working with Dataframe:-

DataFrame has two main advantages over RDD:

Prerequisites: To work with DataFrames we will need SparkSession

val spark: SparkSession =




    .config("spark.master", "local")


First, let’s sum up the main ways of creating the DataFrame:

  • From existing RDD using a reflection

In case you have structured or semi-structured data with simple unambiguous data types, you can infer a schema using a reflection.

import spark.implicits._

// for implicit conversions from Spark RDD to Dataframe

val dataFrame = rdd.toDF()
  • From existing RDD by programmatically specifying the schema
def dfSchema(columnNames: List[String]): StructType =



      StructField(name = "name", dataType = StringType, nullable = false),

      StructField(name = "age", dataType = IntegerType, nullable = false)


def row(line: List[String]): Row = Row(line(0), line(1).toInt)

val rdd: RDD[String] = ...

val schema = dfSchema(Seq("name", "age"))

val data =",").to[List]).map(row)

val dataFrame = spark.createDataFrame(data, schema)
  • Loading data from a structured file (JSON, Parquet, CSV)
val dataFrame ="example.json")

val dataFrame ="example.csv")

val dataFrame ="example.parquet")
  • External database via JDBC
val dataFrame =,"person",prop)

The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute.

Working with Dataset

The Dataset API aims to provide the best of both worlds: the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.

The idea behind Dataset “is to provide an API that allows users to easily perform transformations on domain objects, while also providing the performance and robustness advantages of the Spark SQL execution engine”. It represents competition to RDDs as they have overlapping functions.

Let’s say we have a case class, you can create Dataset By implicit conversion, By hand.

case class FeedbackRow(manager_name: String, response_time: Double, 

satisfaction_level: Double)
  • By implicit conversion
// create Dataset via implicit conversions

val ds: Dataset[FeedbackRow] =[FeedbackRow]

val theSameDS ="example.parquet").as[FeedbackRow]
  • By hand
// create Dataset by hand

val ds1: Dataset[FeedbackRow] = {

  row => FeedbackRow(row.getAs[String](0), row.getAs[Double](4), 


  • From collection
import spark.implicits._

case class Person(name: String, age: Long)

val data = Seq(Person("Bob", 21), Person("Mandy", 22), Person("Julia", 19))

val ds = spark.createDataset(data)
  • From RDD
val rdd = sc.textFile("data.txt")

val ds = spark.createDataset(rdd)

Original article source at:

#dataframe #dataset #spark 

Convert Spark RDD into DataFrame and Dataset

Train Machine Learning Model with SparkML and Python | Hands-on tutorial

To build and train a Machine Learning (#ML) model with Spark is not hard. With this tutorial we will build a simple Binary Classification ML model with Spark. We will use Logistic Regression built-in Spark algorithm, and then evaluate it by getting performance metrics from the model.

There are some different from we do it in Scikit-Learn. Spark provides a built-in SparkML engine with rich #SparkML API which you can leverage to build your unique Machine Learning model.

In this tutorial we are using SparkUI v.3.2.1 with pyspark-shell.

The critical points you should pay your attention to is:
- Datatypes (DTypes)
- String Indexer and One-Hot-Encoding for categorical features.
- Vector Assembler.

All these parts are explained and demonstrated in details in this tutorial. Also, you will learn what is SparkContext and SparkSession (differences between them). Therefore you will be able to check Data schema and handle data types in Spark DataFrame, selected features within your data. As required for ML modelling, you will also learn how to split your data into train and test sets.

Here you also learn how to setup ML stages with Spark and build a custom ML Pipeline to build your Machine Learning Model with Spark.

At the end, you will learn hot to get model performance metrics, such as Precision, Recall, or ROC curve values.

The tutorial is prepared with Jupyter Notebook, using Python programming language, so all the steps are executed with #pyspark .

The content of the video:
0:00 - Intro
0:32 - Start of Hands-on with Jupyter Notebook
0:46 - 1. Import main dependencies for Spark and Python
1:14 - Theory: Spark Session vs. Spark Context
3:10 - 1. Continuing importing dependencies
3:28 - 2. Load External CSV data to Spark (as Spark DataFrame)
5:40 - 3. Train and Test splits
6:39 - 4. Check Data Types
8:27 - 5. One-Hot-Encoding with Spark
10:07 - Theory: StringIndexer and One-Hot-Encoer
11:01 - 5. Continuing with StringIndexer hands-on
12:19 - 6. Vector Assembling
12:55 - Theory: Vector Assembling in Spark
13:53 - 6. Continuing with Vector Assembling
15:24 - 7. Make Spark ML Pipeline
18:31 - 8. Train ML Model with Spark
20:07 - 9. Get Model Performance Metrics

Spark API  and SparkML API method used in the tutorial (incl. documentation):
- Spark Datatypes (
- PySpark SQL DataFrame Random Split (
- StringIndexer (
- OneHotEncoder (
- VectorAssembler (
- Spark DataFrame aggregation (
- Count Distinct values from Spark DataFrame (
- Group by to check feature distribution (
- SparkML Pipelines (
- Logistic Regression in Spark (

Link to the Github repo to hand-on everything on your side (data file is included there):


#python #machinelearning #spark 

Train Machine Learning Model with SparkML and Python | Hands-on tutorial
Rupert  Beatty

Rupert Beatty


Spark RDDs: Transformation with Examples

Transformation is one of the RDD operation in spark before moving this first discuss about what actual Spark and RDD is.

What is Spark?

Apache Spark is an open-source cluster computing framework. Its main objective is to manage the data created in real time.

Hadoop MapReduce was the foundation upon which Spark was developed. Unlike competing methods like Hadoop’s MapReduce, which writes and reads data to and from computer hard drives, it was optimized to run in memory. As a result, Spark processes the data far more quickly than other options.

What is RDD?

The fundamental abstraction of Spark is the RDD (Resilient Distributed Dataset). It is a group of components that have been divided up across the cluster nodes so that we can process different parallel operations on it.

RDDs can be produced in one of two ways:

  • Parallelizing data in the driver program already in use.
  • Any data source that offers a Hadoop InputFormat, such as a shared filesystem, HDFS, HBase, or any other external storage system.

Spark RDD Operations

The RDD provides the two types of operations:

  • Transformations
  • Actions

A Transformation is a function that generates new RDDs from existing RDDs, but when we want to work with the actual dataset, we perform an Action. When the action is triggered after the result, a new RDD is not formed in the same way that transformation is.

Transformations with Examples

The role of transformation in Spark is to create a new dataset from an existing one. Lazy transformations are those that are computed only when an action requires a result to be returned to the driver programme.

When we call an action, transformations are executed since they are inherently lazy. Not right away are they carried out. There are two primary types of transformations: map() and filter ().
The outcome RDD is always distinct from the parent RDD after the transformation. It could be smaller (filter, count, distinct, sample, for example), bigger (flatMap(), union(), Cartesian()), or the same size (e.g. map).

In this section, I will explain a few RDD Transformations with word count example in scala, before we start first, let’s create an RDD by reading a text file. The text file used here is a dummy datasets you can use any datasets here.

val spark:SparkSession = SparkSession.builder()

val sc = spark.sparkContext

val rdd:RDD[String] = sc.textFile("src/main/scala/test.txt")

flatMap() Transformation

After applying the function, the flatMap() transformation flattens the RDD and creates a new RDD. The example below first divides each record in an RDD by space before flattening it. Each entry in the resulting RDD only contains one word.

val rdd2 = rdd.flatMap(f=>f.split(" "))

map() Transformation

Any complex actions, such as the addition of a column or the updating of a column, are applied using the map() transformation, and the output of these transformations always has the same amount of records as the input.

In our word count example, we are creating a new column and assigning a value of 1 to each word. The RDD produces a PairRDDFunction that has key-value pairs with the keys being words of type String and the values being 1 of type Int. I’ve defined the type of the rdd3 variable for your understanding.

val rdd3:RDD[(String,Int)]=>(m,1))

filter() Transformation

The records in an RDD can be filtered with the filter() transformation. In our illustration, we are filtering out all terms that begin with “a.”

val rdd4 = rdd3.filter(a=> a._1.startsWith("a"))

reduceByKey() Transformation

The method supplied by reduceByKey() merges the values for each key. By using the sum function on value in our example, the word string is condensed. Our RDD’s output includes a count of the number of unique words.

val rdd5 = rdd3.reduceByKey(_ + _)


We can obtain the elements from both RDDs in the new RDD using the union() function. The two RDDs must be of the same type in order for this function to work.
For instance, if RDD1’s elements are Spark, Spark, Hadoop, and Flink, and RDD2’s elements are Big data, Spark, and Flink, the resulting rdd1.union(rdd2) will have the following elements: Spark, Spark, Spark, Hadoop, Flink, and Flink, Big data.

val rdd6 = rdd5.union(rdd3)


With the intersection() function, we get only the common element of both the RDD in new RDD. The key rule of this function is that the two RDDs should be of the same type.

val rdd7 = rdd1.intersection(rdd2)


In this Spark RDD Transformations blog, you have learned different transformation functions and their usage with scala examples. In the next blog, we will learn about actions.

Happy Learning !!

Original article source at:

#spark #transform 

Spark RDDs: Transformation with Examples
Rupert  Beatty

Rupert Beatty


Introduction to Spark with Python

Introduction to Spark with Python – PySpark for Beginners

Apache Spark is one the most widely used framework when it comes to handling and working with Big Data AND Python is one of the most widely used programming languages for Data Analysis, Machine Learning and much more. So, why not use them together? This is where Spark with Python also known as PySpark comes into the picture.

With an average salary of $110,000 pa for an Apache Spark Developer, there’s no doubt that Spark is used in the industry a lot. Because of its rich library set, Python is used by the majority of Data Scientists and Analytics experts today. Integrating Python with Spark was a major gift to the community. Spark was developed in Scala language, which is very much similar to Java. It compiles the program code into bytecode for the JVM for spark big data processing. To support Spark with python, the Apache Spark community released PySpark. Ever since, PySpark Certification has been known to be one of the most sought-after skills throughout the industry due of the wide range of benefits that came after combining the best of both these worlds. In this Spark with Python blog, I’ll discuss the following topics.

Introduction to Apache Spark

Apache Spark is an open-source cluster-computing framework for real-time processing developed by the Apache Software Foundation. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Below are some of the features of Apache Spark which gives it an edge over other frameworks:

Spark Features - Spark with Python - Edureka

  • Speed: It is 100x faster than traditional large-scale data processing frameworks.
  • Powerful Caching: Simple programming layer provides powerful caching and disk persistence capabilities.
  • Deployment: Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manager.
  • Real Time: Real-time computation & low latency because of in-memory computation.
  • Polyglot: It is one of the most important features of this framework as it can be programmed in Scala, Java, Python and R.

Why go for Python?

Although Spark was designed in scala, which makes it almost 10 times faster than Python, but Scala is faster only when the number of cores being used is less. As most of the analysis and process nowadays require a large number of cores, the performance advantage of Scala is not that much.

Course Curriculum

PySpark Certification Training Course

Explore Curriculum

For programmers Python is comparatively easier to learn because of its syntax and standard libraries. Moreover, it’s a dynamically typed language, which means RDDs can hold objects of multiple types.

Although Scala has SparkMLlib it doesn’t have enough libraries and tools for Machine Learning and NLP purposes. Moreover, Scala lacks Data Visualization.

Setting up Spark with Python (PySpark)

I hope you guys know how to download spark and install it. So, once you’ve unzipped the spark file, installed it and added it’s path to .bashrc file, you need to type in source .bashrc

export SPARK_HOME = /usr/lib/hadoop/spark-2.1.0-bin-hadoop2.7
export PATH = $PATH:/usr/lib/hadoop/spark-2.1.0-bin-hadoop2.7/bin

To open pyspark shell you need to type in the command  ./bin/pyspark

Pyspark shell - Spark with python - Edureka

Spark in Industry

Apache Spark because of it’s amazing features like in-memory processing, polyglot and fast processing are being used by many companies all around the globe for various purposes in various industries:

Companies using Spark - Spark with Python - Edureka

Yahoo uses Apache Spark for its Machine Learning capabilities to personalize its news, web pages and also for target advertising. They use Spark with python to find out what kind of news – users are interested to read and categorizing the news stories to find out what kind of users would be interested in reading each category of news.

TripAdvisor uses apache spark to provide advice to millions of travelers by comparing hundreds of websites to find the best hotel prices for its customers. The time taken to read and process the reviews of the hotels in a readable format is done with the help of Apache Spark.

One of the world’s largest e-commerce platform Alibaba runs some of the largest Apache Spark jobs in the world in order to analyze hundreds of petabytes of data on its e-commerce platform.

PySpark SparkContext and Data Flow

Talking about Spark with Python, working with RDDs is made possible by the library Py4j. PySpark Shell links the Python API to spark core and initializes the Spark Context. Spark Context is the heart of any spark application.

  1. Spark context sets up internal services and establishes a connection to a Spark execution environment.
  2. The sparkcontext object in driver program coordinates all the distributed process and allows resource allocation.
  3. Cluster Managers provide Executors, which are JVM process with logic.
  4. SparkContext object sends the application to executors.
  5. SparkContext executes tasks in each executor.

Pyspark Sparkcontext - Spark with Python - Edureka

PySpark KDD Use Case

Now Let’s have a look at a Use Case of KDD’99 Cup (International Knowledge Discovery and Data Mining Tools Competition). Here we will take a fraction of the dataset because the original dataset is too big

import urllib
f = urllib.urlretrieve ("<a href=""></a>", "kddcup.data_10_percent.gz")

Now we can use this file to create our RDD.

data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)


Suppose We want to count how many normal. interactions we have in our dataset. We can filter our raw_data RDD as follows.

normal_raw_data = raw_data.filter(lambda x: 'normal.' in x)


Now we can count how many elements we have in the new RDD.

from time import time
t0 = time()
normal_count = normal_raw_data.count()
tt = time() - t0
print "There are {} 'normal' interactions".format(normal_count)
print "Count completed in {} seconds".format(round(tt,3))


There are 97278 'normal' interactions
Count completed in 5.951 seconds


In this case we want to read our data file as a CSV formatted one. We can do this by applying a lambda function to each element in the RDD as follows. Here we will use the map() and take() transformation.

from pprint import pprint
csv_data = x: x.split(","))
t0 = time()
head_rows = csv_data.take(5)
tt = time() - t0
print "Parse completed in {} seconds".format(round(tt,3))


Parse completed in 1.715 seconds


Now we want to have each element in the RDD as a key-value pair where the key is the tag (e.g. normal) and the value is the whole list of elements that represents the row in the CSV formatted file. We could proceed as follows. Here we use the line.split() and map().

def parse_interaction(line):
elems = line.split(",")
tag = elems[41]
return (tag, elems)
key_csv_data =
head_rows = key_csv_data.take(5)




Here we are going to use the collect() action. It will get all the elements of RDD into memory. For this reason, it has to be used with care when working with large RDDs.

t0 = time()
all_raw_data = raw_data.collect()
tt = time() - t0
print "Data collected in {} seconds".format(round(tt,3))


Data collected in 17.927 seconds

That took longer as any other action we used before, of course. Every Spark worker node that has a fragment of the RDD has to be coordinated in order to retrieve its part and then reduce everything together.

As a last example combining all the previous, we want to collect all the normal interactions as key-value pairs.

# get data from file
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)
# parse into key-value pairs
key_csv_data =
# filter normal key interactions
normal_key_interactions = key_csv_data.filter(lambda x: x[0] == "normal.")
# collect all
t0 = time()
all_normal = normal_key_interactions.collect()
tt = time() - t0
normal_count = len(all_normal)
print "Data collected in {} seconds".format(round(tt,3))
print "There are {} 'normal' interactions".format(normal_count)


Data collected in 12.485 seconds
There are 97278 normal interactions

So this is it, guys!

I hope you enjoyed this Spark with Python blog. If you are reading this, Congratulations! You are no longer a newbie to PySpark. Try out this simple example on your systems now.

Now that you have understood basics of PySpark, check out the Python Spark Certification Training using PySpark by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. Edureka’s Python Spark Certification Training using PySpark is designed to provide you the knowledge and skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).

Got a question for us? Please mention it in the comments section and we will get back to you.

Original article source at:

#python #spark 

Introduction to Spark with Python