1679683380
Redash is designed to enable anyone, regardless of the level of technical sophistication, to harness the power of data big and small. SQL users leverage Redash to explore, query, visualize, and share data from any data sources. Their work in turn enables anybody in their organization to use the data. Every day, millions of users at thousands of organizations around the world use Redash to develop insights and make data-driven decisions.
Redash features:
Redash supports more than 35 SQL and NoSQL data sources. It can also be extended to support more. Below is a list of built-in sources:
Please email security@redash.io to report any security vulnerabilities. We will acknowledge receipt of your vulnerability and strive to send you regular updates about our progress. If you're curious about the status of your disclosure please feel free to email us again. If you want to encrypt your disclosure email, you can use this PGP key.
Author: Getredash
Source Code: https://github.com/getredash/redash
License: BSD-2-Clause license
#python #javascript #visualization #mysql #bigquery #spark #dashboard
1678468080
Озеро данных — это репозиторий, в котором дешево хранятся огромные объемы необработанных данных в собственном формате.
Он состоит из дампов текущих и исторических данных в различных форматах, включая XML, JSON, CSV, Parquet и т. д.
Delta Lake позволяет нам постепенно улучшать качество, пока оно не будет готово к употреблению. Данные перетекают, как вода в озере Дельта, от одного этапа к другому (бронза -> серебро -> золото).
Бронзовые Столы
Данные могут поступать из различных источников, которые могут быть грязными. Таким образом, это свалка для необработанных данных.
Серебряные столы
Состоит из промежуточных данных с некоторой очисткой.
Это Queryable для легкой отладки.
Золотые столы
Он состоит из чистых данных, готовых к использованию.
Оригинальный источник статьи: https://www.c-sharpcorner.com/
1678464260
Data Lake 是一个存储库,可以廉价地以其本机格式存储大量原始数据。
它由各种格式的当前和历史数据转储组成,包括 XML、JSON、CSV、Parquet 等。
Delta Lake 使我们能够逐步提高质量,直到可以使用为止。数据像 Delta Lake 中的水一样从一个阶段流到另一个阶段(青铜 -> 白银 -> 黄金)。
青铜桌
数据可能来自各种来源,这些来源可能是脏的。因此,它是原始数据的垃圾场。
银表
由应用了一些清理的中间数据组成。
它是可查询的,便于调试。
黄金表
它由干净的数据组成,可以随时使用。
文章原文出处:https: //www.c-sharpcorner.com/
1678460460
Data Lake is a storage repository that cheaply stores vast raw data in its native format.
It consists of current and historical data dumps in various formats, including XML, JSON, CSV, Parquet, etc.
Delta Lake allows us to incrementally improve the quality until it is ready for consumption. Data flows like water in Delta Lake from one stage to another stage (Bronze -> Silver -> Gold).
Bronze Tables
Data may come from various sources, which could be Dirty. Thus, It is a dumping ground for raw data.
Silver Tables
Consists of Intermediate data with some cleanup applied.
It is Queryable for easy debugging.
Gold Tables
It consists of clean data, which is ready for consumption.
Original article source at: https://www.c-sharpcorner.com/
1677730683
Apache Doris is an easy-to-use, high-performance and real-time analytical database based on MPP architecture, known for its extreme speed and ease of use. It only requires a sub-second response time to return query results under massive data and can support not only high-concurrent point query scenarios but also high-throughput complex analysis scenarios.
All this makes Apache Doris an ideal tool for scenarios including report analysis, ad-hoc query, unified data warehouse, and data lake query acceleration. On Apache Doris, users can build various applications, such as user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, and order analysis.
As shown in the figure below, after various data integration and processing, the data sources are usually stored in the real-time data warehouse Apache Doris and the offline data lake or data warehouse (in Apache Hive, Apache Iceberg or Apache Hudi).
Apache Doris is widely used in the following scenarios:
Reporting Analysis
Ad-Hoc Query. Analyst-oriented self-service analytics with irregular query patterns and high throughput requirements. XiaoMi has built a growth analytics platform (Growth Analytics, GA) based on Doris, using user behavior data for business growth analysis, with an average query latency of 10 seconds and a 95th percentile query latency of 30 seconds or less, and tens of thousands of SQL queries per day.
Unified Data Warehouse Construction. Apache Doris allows users to build a unified data warehouse via one single platform and save the trouble of handling complicated software stacks. Chinese hot pot chain Haidilao has built a unified data warehouse with Doris to replace its old complex architecture consisting of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix.
Data Lake Query. Apache Doris avoids data copying by federating the data in Apache Hive, Apache Iceberg, and Apache Hudi using external tables, and thus achieves outstanding query performance.
The overall architecture of Apache Doris is shown in the following figure. The Doris architecture is very simple, with only two types of processes.
Frontend (FE): user request access, query parsing and planning, metadata management, node management, etc.
Backend (BE): data storage and query plan execution
Both types of processes are horizontally scalable, and a single cluster can support up to hundreds of machines and tens of petabytes of storage capacity. And these two types of processes guarantee high availability of services and high reliability of data through consistency protocols. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system.
In terms of interfaces, Apache Doris adopts MySQL protocol, supports standard SQL, and is highly compatible with MySQL dialect. Users can access Doris through various client tools and it supports seamless connection with BI tools.
Doris uses a columnar storage engine, which encodes, compresses, and reads data by column. This enables a very high compression ratio and largely reduces irrelavant data scans, thus making more efficient use of IO and CPU resources. Doris supports various index structures to minimize data scans:
Doris supports a variety of storage models and has optimized them for different scenarios:
Aggregate Key Model: able to merge the value columns with the same keys and significantly improve performance
Unique Key Model: Keys are unique in this model and data with the same key will be overwritten to achieve row-level data updates.
Duplicate Key Model: This is a detailed data model capable of detailed storage of fact tables.
Doris also supports strongly consistent materialized views. Materialized views are automatically selected and updated, which greatly reduces maintenance costs for users.
Doris adopts the MPP model in its query engine to realize parallel execution between and within nodes. It also supports distributed shuffle join for multiple large tables so as to handle complex queries.
The Doris query engine is vectorized, with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, improve cache hit rates, and make efficient use of SIMD instructions. Doris delivers a 5–10 times higher performance in wide table aggregation scenarios than non-vectorized engines.
Apache Doris uses Adaptive Query Execution technology to dynamically adjust the execution plan based on runtime statistics. For example, it can generate runtime filter, push it to the probe side, and automatically penetrate it to the Scan node at the bottom, which drastically reduces the amount of data in the probe and increases join performance. The runtime filter in Doris supports In/Min/Max/Bloom filter.
In terms of optimizers, Doris uses a combination of CBO and RBO. RBO supports constant folding, subquery rewriting, predicate pushdown and CBO supports Join Reorder. The Doris CBO is under continuous optimization for more accurate statistical information collection and derivation, and more accurate cost model prediction.
Technical Overview: 🔗Introduction to Apache Doris
🎯 Easy to Use: Two processes, no other dependencies; online cluster scaling, automatic replica recovery; compatible with MySQL protocol, and using standard SQL.
🚀 High Performance: Extremely fast performance for low-latency and high-throughput queries with columnar storage engine, modern MPP architecture, vectorized query engine, pre-aggregated materialized view and data index.
🖥️ Single Unified: A single system can support real-time data serving, interactive data analysis and offline data processing scenarios.
⚛️ Federated Querying: Supports federated querying of data lakes such as Hive, Iceberg, Hudi, and databases such as MySQL and Elasticsearch.
⏩ Various Data Import Methods: Supports batch import from HDFS/S3 and stream import from MySQL Binlog/Kafka; supports micro-batch writing through HTTP interface and real-time writing using Insert in JDBC.
🚙 Rich Ecology: Spark uses Spark-Doris-Connector to read and write Doris; Flink-Doris-Connector enables Flink CDC to implement exactly-once data writing to Doris; DBT Doris Adapter is provided to transform data in Doris with DBT.
Apache Doris has graduated from Apache incubator successfully and become a Top-Level Project in June 2022.
Currently, the Apache Doris community has gathered more than 400 contributors from nearly 200 companies in different industries, and the number of active contributors is close to 100 per month.
We deeply appreciate 🔗community contributors for their contribution to Apache Doris.
Apache Doris now has a wide user base in China and around the world, and as of today, Apache Doris is used in production environments in thousands of companies worldwide. More than 80% of the top 50 Internet companies in China in terms of market capitalization or valuation have been using Apache Doris for a long time, including Baidu, Meituan, Xiaomi, Jingdong, Bytedance, Tencent, NetEase, Kwai, Sina, 360, Mihoyo, and Ke Holdings. It is also widely used in some traditional industries such as finance, energy, manufacturing, and telecommunications.
The users of Apache Doris: 🔗https://doris.apache.org/users
Add your company logo at Apache Doris Website: 🔗Add Your Company
All Documentation 🔗Docs
All release and binary version 🔗Download
See how to compile 🔗Compilation
See how to install and deploy 🔗Installation and deployment
Doris provides support for Spark/Flink to read data stored in Doris through Connector, and also supports to write data to Doris through Connector.
Mail List is the most recognized form of communication in Apache community. See how to 🔗Subscribe Mailing Lists
If you meet any questions, feel free to file a 🔗GitHub Issue or post it in 🔗GitHub Discussion and fix it by submitting a 🔗Pull Request
We welcome your suggestions, comments (including criticisms), comments and contributions. See 🔗How to Contribute and 🔗Code Submission Guide
🔗Doris Improvement Proposal (DSIP) can be thought of as A Collection of Design Documents for all Major Feature Updates or Improvements.
Contact us through the following mailing list.
Name | Scope | |||
---|---|---|---|---|
dev@doris.apache.org | Development-related discussions | Subscribe | Unsubscribe | Archives |
🎉 Version 1.2.2 released now! It is fully evolved release and all users are encouraged to upgrade to this release. Check out the 🔗Release Notes here.
🎉 Version 1.1.5 released now. It is a LTS(Long-term Support) release based on version 1.1. Check out the 🔗Release Notes here.
👀 Have a look at the 🔗Official Website for a comprehensive list of Apache Doris's core features, blogs and user cases.
Author: Apache
Source Code: https://github.com/apache/doris
License: Apache-2.0 license
1676434620
.NET for Apache® Spark™
.NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data.
.NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.
.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including Azure HDInsight Spark, Amazon EMR Spark, AWS & Azure Databricks.
Note: We currently have a Spark Project Improvement Proposal JIRA at SPIP: .NET bindings for Apache Spark to work with the community towards getting .NET support by default into Apache Spark. We highly encourage you to participate in the discussion.
These instructions will show you how to run a .NET for Apache Spark app using .NET Core.
Building from source is very easy and the whole process (from cloning to being able to run your app) should take less than 15 minutes!
Instructions | ||
---|---|---|
![]() | Windows |
|
![]() | Ubuntu |
There are two types of samples/apps in the .NET for Apache Spark repo:
Getting Started - .NET for Apache Spark code focused on simple and minimalistic scenarios.
End-End apps/scenarios - Real world examples of industry standard benchmarks, usecases and business applications implemented using .NET for Apache Spark.
We welcome contributions to both categories!
Analytics Scenario | Description | Scenarios |
Dataframes and SparkSQL | Simple code snippets to help you get familiarized with the programmability experience of .NET for Apache Spark. | Basic C# F# ![]() |
Structured Streaming | Code snippets to show you how to utilize Apache Spark's Structured Streaming (2.3.1, 2.3.2, 2.4.1, Latest) | |
TPC-H Queries | Code to show you how to author complex queries using .NET for Apache Spark. | TPC-H Functional C# TPC-H SparkSQL C# |
We welcome contributions! Please review our contribution guide.
This project would not have been possible without the outstanding work from the following communities:
The .NET for Apache Spark team encourages contributions, both issues and PRs. The first step is finding an existing issue you want to contribute to or if you cannot find any, open an issue.
.NET for Apache Spark is an open source project under the .NET Foundation and does not come with Microsoft Support unless otherwise noted by the specific product. For issues with or questions about .NET for Apache Spark, please create an issue. The community is active and is monitoring submissions.
The .NET for Apache Spark project is part of the .NET Foundation.
This project has adopted the code of conduct defined by the Contributor Covenant to clarify expected behavior in our community. For more information, see the .NET Foundation Code of Conduct.
Apache Spark | .NET for Apache Spark |
---|---|
2.4* | v2.1.1 |
3.0 | |
3.1 | |
3.2 |
*2.4.2 is not supported.
.NET for Apache Spark releases are available here and NuGet packages are available here.
Author: Dotnet
Source Code: https://github.com/dotnet/spark
License: MIT license
1675807800
MapReduce is a programming model for processing large data sets in parallel across a cluster of computers. It is a key technology for handling big data. The model consists of two key functions: Map and Reduce. Map takes a set of data and converts it into another set of data. There individual elements are broken down into tuples (key/value pairs). Reduce takes the output from the Map as input and aggregates the tuples into a smaller set of tuples. The combination of these two functions allows for the efficient processing of large amounts of data by dividing the work into smaller, more manageable chunks.
Definitely, learning MapReduce is worth it if you’re interested in big data processing or work in data-intensive fields. MapReduce is a fundamental concept that gives you a basic understanding of how to process and analyze large data sets in a distributed environment. The principles of MapReduce still play a crucial role during modern big data processing frameworks, such as Apache Hadoop and Apache Spark. Understanding MapReduce provides a solid foundation for learning these technologies. Also, many organizations still use MapReduce for processing large data sets accordingly, making it a valuable skill to have in the job market.
Let’s understand this with a simple example:
Imagine we have a large dataset of words and we want to count the frequency of each word. Here’s how we could do it in MapReduce:
Map:
Reduce:
import org.apache.hadoop.io.{IntWritable, Text}
import org.apache.hadoop.mapreduce.{Mapper, Reducer}
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
class TokenizerMapper extends Mapper[Object, Text, Text, IntWritable] {
val one = new IntWritable(1)
val word = new Text()
override def map(key: Object, value: Text, context: Mapper[Object, Text, Text, IntWritable]#Context): Unit = {
val itr = new StringTokenizer(value.toString)
while (itr.hasMoreTokens) {
word.set(itr.nextToken)
context.write(word, one)
}
}
}
class IntSumReducer extends Reducer[Text, IntWritable, Text, IntWritable] {
val result = new IntWritable
override def reduce(key: Text, values: java.lang.Iterable[IntWritable], context: Reducer[Text, IntWritable, Text, IntWritable]#Context): Unit = {
var sum = 0
val valuesIter = values.iterator
while (valuesIter.hasNext) {
sum += valuesIter.next.get
}
result.set(sum)
context.write(key, result)
}
}
object WordCount {
def main(args: Array[String]): Unit = {
val conf = new Configuration
val job = Job.getInstance(conf, "word count")
job.setJarByClass(this.getClass)
job.setMapperClass(classOf[TokenizerMapper])
job.setCombinerClass(classOf[IntSumReducer])
job.setReducerClass(classOf[IntSumReducer])
job.setOutputKeyClass(classOf[Text])
job.setOutputValueClass(classOf[IntWritable])
FileInputFormat.addInputPath(job, new Path(args(0)))
FileOutputFormat.setOutputPath(job, new Path(args(1)))
System.exit(if (job.waitForCompletion(true)) 0 else 1)
}
}
This code defines a MapReduce job that splits each line of the input into words using the TokenizerMapper
class, maps each word to a tuple (word, 1) and then reduces the tuples to count the frequency of each word using the IntSumReducer
class. The job is configured using a Job
object and the input and output paths are specified using FileInputFormat
and FileOutputFormat
. The job is then executed by calling waitForCompletion
.
And here’s how you could perform the same operation in Apache Spark:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val textFile = sc.textFile("<input_file>.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.foreach(println)
sc.stop()
}
}
This code sets up a SparkConf and SparkContext, reads in the input data using textFile
, splits each line into words using flatMap
, maps each word to a tuple (word, 1) using map
, and reduces the tuples to count the frequency of each word using reduceByKey
. The result is then printed using foreach
.
MapReduce is a programming paradigm for processing large datasets in a distributed environment. The MapReduce process consists of two main phases: the map phase and the reduce phase. In the map phase, data is transformed into intermediate key-value pairs. In the reduce phase, the intermediate results are aggregated to produce the final output. Spark is a popular alternative to MapReduce. It provides a high-level API and in-memory processing that can make big data processing faster and easier. Whether to choose MapReduce or Spark, depends on the specific needs of the task and the resources available.
Original article source at: https://blog.knoldus.com/
1675732620
lakeFS is an open-source tool that transforms your object storage into a Git-like repository. It enables you to manage your data lake the way you manage your code.
With lakeFS you can build repeatable, atomic, and versioned data lake operations - from complex ETL jobs to data science and analytics.
lakeFS supports AWS S3, Azure Blob Storage, and Google Cloud Storage as its underlying storage service. It is API compatible with S3 and works seamlessly with all modern data frameworks such as Spark, Hive, AWS Athena, Presto, etc.
For more information, see the official lakeFS documentation.
When working with a data lake, it’s useful to have replicas of your production environment. These replicas allow you to test these ETLs and understand changes to your data without impacting downstream data consumers.
Running ETL and transformation jobs directly in production without proper ETL Testing is a guaranteed way to have data issues flow into dashboards, ML models, and other consumers sooner or later. The most common approach to avoid making changes directly in production is to create and maintain multiple data environments and perform ETL testing on them. Dev environment to develop the data pipelines and test environment where pipeline changes are tested before pushing it to production. With lakeFS you can create branches, and get a copy of the full production data, without copying anything. This enables a faster and easier process of ETL testing.
Data changes frequently. This makes the task of keeping track of its exact state over time difficult. Oftentimes, people maintain only one state of their data––its current state.
This has a negative impact on the work, as it becomes hard to:
In comparison, lakeFS exposes a Git-like interface to data that allows keeping track of more than just the current state of data. This makes reproducing its state at any point in time straightforward.
Data pipelines feed processed data from data lakes to downstream consumers like business dashboards and machine learning models. As more and more organizations rely on data to enable business critical decisions, data reliability and trust are of paramount concern. Thus, it’s important to ensure that production data adheres to the data governance policies of businesses. These data governance requirements can be as simple as a file format validation, schema check, or an exhaustive PII(Personally Identifiable Information) data removal from all of organization’s data.
Thus, to ensure the quality and reliability at each stage of the data lifecycle, data quality gates need to be implemented. That is, we need to run Continuous Integration(CI) tests on the data, and only if data governance requirements are met can the data can be promoted to production for business use.
Everytime there is an update to production data, the best practice would be to run CI tests and then promote(deploy) the data to production. With lakeFS you can create hooks that make sure that only data that passed these tests will become part of production.
A rollback operation is used to to fix critical data errors immediately.
What is a critical data error? Think of a situation where erroneous or misformatted data causes a signficant issue with an important service or function. In such situations, the first thing to do is stop the bleeding.
Rolling back returns data to a state in the past, before the error was present. You might not be showing all the latest data after a rollback, but at least you aren’t showing incorrect data or raising errors. Since lakeFS provides versions of the data without making copies of the data, you can time travel between versions and roll back to the version of the data before the error was presented.
Use this section to learn about lakeFS. For a production-suitable deployment, see the docs.
Ensure you have Docker installed on your computer.
Run the following command:
docker run --pull always --name lakefs -p 8000:8000 treeverse/lakefs run --local-settings
Open http://127.0.0.1:8000/ in your web browser to set up an initial admin user. You will use this user to log in and send API requests.
You can try lakeFS:
Once lakeFS is installed, you are ready to create your first repository!
Stay up to date and get lakeFS support via:
Author: Treeverse
Source Code: https://github.com/treeverse/lakeFS
License: Apache-2.0 license
1675716540
Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.
The following are some of the more popular Delta Lake integrations, refer to delta.io/integrations for the complete list:
See the online documentation for the latest release.
Delta Standalone library is a single-node Java library that can be used to read from and write to Delta tables. Specifically, this library provides APIs to interact with a table’s metadata in the transaction log, implementing the Delta Transaction Log Protocol to achieve the transactional guarantees of the Delta Lake format.
There are two types of APIs provided by the Delta Lake project.
DataFrameReader
/Writer
(i.e. spark.read
, df.write
, spark.readStream
and df.writeStream
). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).Delta Lake guarantees backward compatibility for all Delta Lake tables (i.e., newer versions of Delta Lake will always be able to read tables written by older versions of Delta Lake). However, we reserve the right to break forward compatibility as new features are introduced to the transaction protocol (i.e., an older version of Delta Lake may not be able to read a table produced by a newer version).
Breaking changes in the protocol are indicated by incrementing the minimum reader/writer version in the Protocol
action.
Delta Transaction Log Protocol document provides a specification of the transaction protocol.
Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, we require the storage system to provide the following.
See the online documentation on Storage Configuration for details.
Delta Lake ensures serializability for concurrent reads and writes. Please see Delta Lake Concurrency Control for more details.
We use GitHub Issues to track community reported issues. You can also contact the community for getting answers.
We welcome contributions to Delta Lake. See our CONTRIBUTING.md for more details.
We also adhere to the Delta Lake Code of Conduct.
Delta Lake is compiled using SBT.
To compile, run
build/sbt compile
To generate artifacts, run
build/sbt package
To execute tests, run
build/sbt test
To execute a single test suite, run
build/sbt 'testOnly org.apache.spark.sql.delta.optimize.OptimizeCompactionSuite'
To execute a single test within and a single test suite, run
build/sbt 'testOnly *.OptimizeCompactionSuite -- -z "optimize command: on partitioned table - all partitions"'
Refer to SBT docs for more commands.
IntelliJ is the recommended IDE to use when developing Delta Lake. To import Delta Lake as a new project:
~/delta
.File
> New Project
> Project from Existing Sources...
and select ~/delta
.Import project from external model
select sbt
. Click Next
.Project JDK
specify a valid Java 1.8
JDK and opt to use SBT shell for project reload
and builds
.Finish
.After waiting for IntelliJ to index, verify your setup by running a test suite in IntelliJ.
DeltaLogSuite
Run 'DeltaLogSuite'
If you see errors of the form
Error:(46, 28) object DeltaSqlBaseParser is not a member of package io.delta.sql.parser
import io.delta.sql.parser.DeltaSqlBaseParser._
...
Error:(91, 22) not found: type DeltaSqlBaseParser
val parser = new DeltaSqlBaseParser(tokenStream)
then follow these steps:
build/sbt compile
.File
> Project Structure...
> Modules
> delta-core
.Source Folders
remove any target
folders, e.g. target/scala-2.12/src_managed/main [generated]
Apply
and then re-run your test.There are two mediums of communication within the Delta Lake community.
Author: Delta-io
Source Code: https://github.com/delta-io/delta
License: Apache-2.0 license
1675669635
Explore Spark in depth and get a strong foundation in Spark. You'll learn: Why do we need Spark when we have Hadoop? What is the need for RDD? How Spark is faster than Hadoop? How Spark achieves the speed and efficiency it claims? How does memory gets managed in Spark? How fault tolerance work in Spark? and more
Most courses and other online help including Spark's documentation is not good in helping students understand the foundational concepts. They explain what is Spark, what is RDD, what is "this" and what is "that" but students were most interested in understanding core fundamentals and more importantly answer questions like:
and that is exactly what you will learn in this Spark Starter Kit course. The aim of this course is to give you a strong foundation in Spark.
#spark #hadoop #bigdata
1675330080
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Apache Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time.
Use TransmogrifAI if you need a machine learning library to:
To understand the motivation behind TransmogrifAI check out these:
Skip to Quick Start and Documentation.
The Titanic dataset is an often-cited dataset in the machine learning community. The goal is to build a machine learnt model that will predict survivors from the Titanic passenger manifest. Here is how you would build the model using TransmogrifAI:
import com.salesforce.op._
import com.salesforce.op.readers._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._
// Read Titanic data as a DataFrame
val passengersData = DataReaders.Simple.csvCase[Passenger](path = pathToData).readDataset().toDF()
// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
// Automated feature engineering
val featureVector = predictors.transmogrify()
// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)
// Automated model selection
val pred = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()
// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(pred).train()
println("Model summary:\n" + model.summaryPretty())
Model summary:
Evaluated Logistic Regression, Random Forest models with 3 folds and AuPR metric.
Evaluated 3 Logistic Regression models with AuPR between [0.6751930383321765, 0.7768725281794376]
Evaluated 16 Random Forest models with AuPR between [0.7781671467343991, 0.8104798040316159]
Selected model Random Forest classifier with parameters:
|-----------------------|--------------|
| Model Param | Value |
|-----------------------|--------------|
| modelType | RandomForest |
| featureSubsetStrategy | auto |
| impurity | gini |
| maxBins | 32 |
| maxDepth | 12 |
| minInfoGain | 0.001 |
| minInstancesPerNode | 10 |
| numTrees | 50 |
| subsamplingRate | 1.0 |
|-----------------------|--------------|
Model evaluation metrics:
|-------------|--------------------|---------------------|
| Metric Name | Hold Out Set Value | Training Set Value |
|-------------|--------------------|---------------------|
| Precision | 0.85 | 0.773851590106007 |
| Recall | 0.6538461538461539 | 0.6930379746835443 |
| F1 | 0.7391304347826088 | 0.7312186978297163 |
| AuROC | 0.8821603927986905 | 0.8766642291593114 |
| AuPR | 0.8225075757571668 | 0.850331080886535 |
| Error | 0.1643835616438356 | 0.19682151589242053 |
| TP | 17.0 | 219.0 |
| TN | 44.0 | 438.0 |
| FP | 3.0 | 64.0 |
| FN | 9.0 | 97.0 |
|-------------|--------------------|---------------------|
Top model insights computed using correlation:
|-----------------------|----------------------|
| Top Positive Insights | Correlation |
|-----------------------|----------------------|
| sex = "female" | 0.5177801026737666 |
| cabin = "OTHER" | 0.3331391338844782 |
| pClass = 1 | 0.3059642953159715 |
|-----------------------|----------------------|
| Top Negative Insights | Correlation |
|-----------------------|----------------------|
| sex = "male" | -0.5100301587292186 |
| pClass = 3 | -0.5075774968534326 |
| cabin = null | -0.31463114463832633 |
|-----------------------|----------------------|
Top model insights computed using CramersV:
|-----------------------|----------------------|
| Top Insights | CramersV |
|-----------------------|----------------------|
| sex | 0.525557139885501 |
| embarked | 0.31582347194683386 |
| age | 0.21582347194683386 |
|-----------------------|----------------------|
While this may seem a bit too magical, for those who want more control, TransmogrifAI also provides the flexibility to completely specify all the features being extracted and all the algorithms being applied in your ML pipeline. Visit our docs site for full documentation, getting started, examples, faq and other information.
You can simply add TransmogrifAI as a regular dependency to an existing project. Start by picking TransmogrifAI version to match your project dependencies from the version matrix below (if not sure - take the stable version):
TransmogrifAI Version | Spark Version | Scala Version | Java Version |
---|---|---|---|
0.7.1 (unreleased, master), 0.7.0 (stable) | 2.4 | 2.11 | 1.8 |
0.6.1, 0.6.0, 0.5.3, 0.5.2, 0.5.1, 0.5.0 | 2.3 | 2.11 | 1.8 |
0.4.0, 0.3.4 | 2.2 | 2.11 | 1.8 |
For Gradle in build.gradle
add:
repositories {
jcenter()
mavenCentral()
}
dependencies {
// TransmogrifAI core dependency
compile 'com.salesforce.transmogrifai:transmogrifai-core_2.11:0.7.0'
// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// compile 'com.salesforce.transmogrifai:transmogrifai-models_2.11:0.7.0'
}
For SBT in build.sbt
add:
scalaVersion := "2.11.12"
resolvers += Resolver.jcenterRepo
// TransmogrifAI core dependency
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.7.0"
// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-models" % "0.7.0"
Then import TransmogrifAI into your code:
// TransmogrifAI functionality: feature types, feature builders, feature dsl, readers, aggregators etc.
import com.salesforce.op._
import com.salesforce.op.aggregators._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.readers._
// Spark enrichments (optional)
import com.salesforce.op.utils.spark.RichDataset._
import com.salesforce.op.utils.spark.RichRDD._
import com.salesforce.op.utils.spark.RichRow._
import com.salesforce.op.utils.spark.RichMetadata._
import com.salesforce.op.utils.spark.RichStructType._
Visit our docs site for full documentation, getting started, examples, faq and other information.
See scaladoc for the programming API.
Author: Salesforce
Source Code: https://github.com/salesforce/TransmogrifAI
License: BSD-3-Clause license
1672825278
In this blog, we will be talking about Spark RDD, Dataframe, Datasets, and how we can transform RDD into Dataframes and Datasets.
A RDD is an immutable distributed collection of elements of your data. It’s partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.
RDDs are so integral to the function of Spark that the entire Spark API can be considered to be a collection of operations to create, transform, and export RDDs. Every algorithm implemented in Spark is effectively a series of transformative operations performed upon data represented as an RDD.
A DataFrame is a Dataset that is organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Row
s. In the Scala API, DataFrame
is simply a type alias of Dataset[Row]
. While, in Java API, users need to use Dataset<Row>
to represent a DataFrame
.
A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
Dataset can be constructed from JVM objects and then manipulated using functional transformations (map
, flatMap
, filter
, etc.). The Dataset API is available in Scala and Java. Python does not have support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName
).
Prerequisites: In order to work with RDD we need to create a SparkContext object
val conf: SparkConf =
new SparkConf()
.setMaster("local[*]")
.setAppName("AppName")
.set("spark.driver.host", "localhost")
val sc: SparkContext = new SparkContext(conf)
There are 2 common ways to build the RDD:
* Pass your existing collection to SparkContext.parallelize method (you will do it mostly for tests or POC)
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val rdd = sc.parallelize(data)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize
at <console>:26
* Read from external sources
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
https://blog.knoldus.com/?p=187442&preview=true
val totalLength = lineLengths.reduce((a, b) => a + b
Things are getting interesting when you want to convert your Spark RDD to DataFrame. It might not be obvious why you want to switch to Spark DataFrame or Dataset. You will write less code, the code itself will be more expressive, and there are a lot of out-of-the-box optimizations available for DataFrames and Datasets.
DataFrame has two main advantages over RDD:
Prerequisites: To work with DataFrames we will need SparkSession
val spark: SparkSession =
SparkSession
.builder()
.appName("AppName")
.config("spark.master", "local")
.getOrCreate()
First, let’s sum up the main ways of creating the DataFrame:
In case you have structured or semi-structured data with simple unambiguous data types, you can infer a schema using a reflection.
import spark.implicits._
// for implicit conversions from Spark RDD to Dataframe
val dataFrame = rdd.toDF()
def dfSchema(columnNames: List[String]): StructType =
StructType(
Seq(
StructField(name = "name", dataType = StringType, nullable = false),
StructField(name = "age", dataType = IntegerType, nullable = false)
)
)
def row(line: List[String]): Row = Row(line(0), line(1).toInt)
val rdd: RDD[String] = ...
val schema = dfSchema(Seq("name", "age"))
val data = rdd.map(_.split(",").to[List]).map(row)
val dataFrame = spark.createDataFrame(data, schema)
val dataFrame = spark.read.json("example.json")
val dataFrame = spark.read.csv("example.csv")
val dataFrame = spark.read.parquet("example.parquet")
val dataFrame = spark.read.jdbc(url,"person",prop)
The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute.
The Dataset API aims to provide the best of both worlds: the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.
The idea behind Dataset “is to provide an API that allows users to easily perform transformations on domain objects, while also providing the performance and robustness advantages of the Spark SQL execution engine”. It represents competition to RDDs as they have overlapping functions.
Let’s say we have a case class, you can create Dataset By implicit conversion, By hand.
case class FeedbackRow(manager_name: String, response_time: Double,
satisfaction_level: Double)
// create Dataset via implicit conversions
val ds: Dataset[FeedbackRow] = dataFrame.as[FeedbackRow]
val theSameDS = spark.read.parquet("example.parquet").as[FeedbackRow]
// create Dataset by hand
val ds1: Dataset[FeedbackRow] = dataFrame.map {
row => FeedbackRow(row.getAs[String](0), row.getAs[Double](4),
row.getAs[Double](5))
}
import spark.implicits._
case class Person(name: String, age: Long)
val data = Seq(Person("Bob", 21), Person("Mandy", 22), Person("Julia", 19))
val ds = spark.createDataset(data)
val rdd = sc.textFile("data.txt")
val ds = spark.createDataset(rdd)
Original article source at: https://blog.knoldus.com/
1671674702
To build and train a Machine Learning (#ML) model with Spark is not hard. With this tutorial we will build a simple Binary Classification ML model with Spark. We will use Logistic Regression built-in Spark algorithm, and then evaluate it by getting performance metrics from the model.
There are some different from we do it in Scikit-Learn. Spark provides a built-in SparkML engine with rich #SparkML API which you can leverage to build your unique Machine Learning model.
In this tutorial we are using SparkUI v.3.2.1 with pyspark-shell.
The critical points you should pay your attention to is:
- Datatypes (DTypes)
- String Indexer and One-Hot-Encoding for categorical features.
- Vector Assembler.
All these parts are explained and demonstrated in details in this tutorial. Also, you will learn what is SparkContext and SparkSession (differences between them). Therefore you will be able to check Data schema and handle data types in Spark DataFrame, selected features within your data. As required for ML modelling, you will also learn how to split your data into train and test sets.
Here you also learn how to setup ML stages with Spark and build a custom ML Pipeline to build your Machine Learning Model with Spark.
At the end, you will learn hot to get model performance metrics, such as Precision, Recall, or ROC curve values.
The tutorial is prepared with Jupyter Notebook, using Python programming language, so all the steps are executed with #pyspark .
The content of the video:
0:00 - Intro
0:32 - Start of Hands-on with Jupyter Notebook
0:46 - 1. Import main dependencies for Spark and Python
1:14 - Theory: Spark Session vs. Spark Context
3:10 - 1. Continuing importing dependencies
3:28 - 2. Load External CSV data to Spark (as Spark DataFrame)
5:40 - 3. Train and Test splits
6:39 - 4. Check Data Types
8:27 - 5. One-Hot-Encoding with Spark
10:07 - Theory: StringIndexer and One-Hot-Encoer
11:01 - 5. Continuing with StringIndexer hands-on
12:19 - 6. Vector Assembling
12:55 - Theory: Vector Assembling in Spark
13:53 - 6. Continuing with Vector Assembling
15:24 - 7. Make Spark ML Pipeline
18:31 - 8. Train ML Model with Spark
20:07 - 9. Get Model Performance Metrics
Spark API and SparkML API method used in the tutorial (incl. documentation):
- Spark Datatypes (https://spark.apache.org/docs/latest/sql-ref-datatypes.html)
- PySpark SQL DataFrame Random Split (https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.randomSplit.html)
- StringIndexer (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html)
- OneHotEncoder (https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.OneHotEncoder.html)
- VectorAssembler (https://spark.apache.org/docs/latest/ml-features#vectorassembler)
- Spark DataFrame aggregation (https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.aggregate.html)
- Count Distinct values from Spark DataFrame (https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.countDistinct.html)
- Group by to check feature distribution (https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.groupby.html)
- SparkML Pipelines (https://spark.apache.org/docs/latest/ml-pipeline.html)
- Logistic Regression in Spark (https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#logistic-regression)
Link to the Github repo to hand-on everything on your side (data file is included there): https://github.com/vb100/spark_ml_train_model
Subscribe: https://www.youtube.com/@DataScienceGarage/featured
1669925700
Transformation is one of the RDD operation in spark before moving this first discuss about what actual Spark and RDD is.
Apache Spark is an open-source cluster computing framework. Its main objective is to manage the data created in real time.
Hadoop MapReduce was the foundation upon which Spark was developed. Unlike competing methods like Hadoop’s MapReduce, which writes and reads data to and from computer hard drives, it was optimized to run in memory. As a result, Spark processes the data far more quickly than other options.
The fundamental abstraction of Spark is the RDD (Resilient Distributed Dataset). It is a group of components that have been divided up across the cluster nodes so that we can process different parallel operations on it.
RDDs can be produced in one of two ways:
The RDD provides the two types of operations:
A Transformation is a function that generates new RDDs from existing RDDs, but when we want to work with the actual dataset, we perform an Action. When the action is triggered after the result, a new RDD is not formed in the same way that transformation is.
The role of transformation in Spark is to create a new dataset from an existing one. Lazy transformations are those that are computed only when an action requires a result to be returned to the driver programme.
When we call an action, transformations are executed since they are inherently lazy. Not right away are they carried out. There are two primary types of transformations: map() and filter ().
The outcome RDD is always distinct from the parent RDD after the transformation. It could be smaller (filter, count, distinct, sample, for example), bigger (flatMap(), union(), Cartesian()), or the same size (e.g. map).
In this section, I will explain a few RDD Transformations with word count example in scala, before we start first, let’s create an RDD by reading a text file. The text file used here is a dummy datasets you can use any datasets here.
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples.com")
.getOrCreate()
val sc = spark.sparkContext
val rdd:RDD[String] = sc.textFile("src/main/scala/test.txt")
After applying the function, the flatMap() transformation flattens the RDD and creates a new RDD. The example below first divides each record in an RDD by space before flattening it. Each entry in the resulting RDD only contains one word.
val rdd2 = rdd.flatMap(f=>f.split(" "))
Any complex actions, such as the addition of a column or the updating of a column, are applied using the map() transformation, and the output of these transformations always has the same amount of records as the input.
In our word count example, we are creating a new column and assigning a value of 1 to each word. The RDD produces a PairRDDFunction that has key-value pairs with the keys being words of type String and the values being 1 of type Int. I’ve defined the type of the rdd3 variable for your understanding.
val rdd3:RDD[(String,Int)]= rdd2.map(m=>(m,1))
The records in an RDD can be filtered with the filter() transformation. In our illustration, we are filtering out all terms that begin with “a.”
val rdd4 = rdd3.filter(a=> a._1.startsWith("a"))
The method supplied by reduceByKey() merges the values for each key. By using the sum function on value in our example, the word string is condensed. Our RDD’s output includes a count of the number of unique words.
val rdd5 = rdd3.reduceByKey(_ + _)
We can obtain the elements from both RDDs in the new RDD using the union() function. The two RDDs must be of the same type in order for this function to work.
For instance, if RDD1’s elements are Spark, Spark, Hadoop, and Flink, and RDD2’s elements are Big data, Spark, and Flink, the resulting rdd1.union(rdd2) will have the following elements: Spark, Spark, Spark, Hadoop, Flink, and Flink, Big data.
val rdd6 = rdd5.union(rdd3)
With the intersection() function, we get only the common element of both the RDD in new RDD. The key rule of this function is that the two RDDs should be of the same type.
val rdd7 = rdd1.intersection(rdd2)
In this Spark RDD Transformations blog, you have learned different transformation functions and their usage with scala examples. In the next blog, we will learn about actions.
Happy Learning !!
Original article source at: https://blog.knoldus.com/
1669446600
Apache Spark is one the most widely used framework when it comes to handling and working with Big Data AND Python is one of the most widely used programming languages for Data Analysis, Machine Learning and much more. So, why not use them together? This is where Spark with Python also known as PySpark comes into the picture.
With an average salary of $110,000 pa for an Apache Spark Developer, there’s no doubt that Spark is used in the industry a lot. Because of its rich library set, Python is used by the majority of Data Scientists and Analytics experts today. Integrating Python with Spark was a major gift to the community. Spark was developed in Scala language, which is very much similar to Java. It compiles the program code into bytecode for the JVM for spark big data processing. To support Spark with python, the Apache Spark community released PySpark. Ever since, PySpark Certification has been known to be one of the most sought-after skills throughout the industry due of the wide range of benefits that came after combining the best of both these worlds. In this Spark with Python blog, I’ll discuss the following topics.
Apache Spark is an open-source cluster-computing framework for real-time processing developed by the Apache Software Foundation. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Below are some of the features of Apache Spark which gives it an edge over other frameworks:
Although Spark was designed in scala, which makes it almost 10 times faster than Python, but Scala is faster only when the number of cores being used is less. As most of the analysis and process nowadays require a large number of cores, the performance advantage of Scala is not that much.
For programmers Python is comparatively easier to learn because of its syntax and standard libraries. Moreover, it’s a dynamically typed language, which means RDDs can hold objects of multiple types.
Although Scala has SparkMLlib it doesn’t have enough libraries and tools for Machine Learning and NLP purposes. Moreover, Scala lacks Data Visualization.
I hope you guys know how to download spark and install it. So, once you’ve unzipped the spark file, installed it and added it’s path to .bashrc file, you need to type in source .bashrc
export SPARK_HOME = /usr/lib/hadoop/spark-2.1.0-bin-hadoop2.7
export PATH = $PATH:/usr/lib/hadoop/spark-2.1.0-bin-hadoop2.7/bin
To open pyspark shell you need to type in the command ./bin/pyspark
Apache Spark because of it’s amazing features like in-memory processing, polyglot and fast processing are being used by many companies all around the globe for various purposes in various industries:
Yahoo uses Apache Spark for its Machine Learning capabilities to personalize its news, web pages and also for target advertising. They use Spark with python to find out what kind of news – users are interested to read and categorizing the news stories to find out what kind of users would be interested in reading each category of news.
TripAdvisor uses apache spark to provide advice to millions of travelers by comparing hundreds of websites to find the best hotel prices for its customers. The time taken to read and process the reviews of the hotels in a readable format is done with the help of Apache Spark.
One of the world’s largest e-commerce platform Alibaba runs some of the largest Apache Spark jobs in the world in order to analyze hundreds of petabytes of data on its e-commerce platform.
Talking about Spark with Python, working with RDDs is made possible by the library Py4j. PySpark Shell links the Python API to spark core and initializes the Spark Context. Spark Context is the heart of any spark application.
Now Let’s have a look at a Use Case of KDD’99 Cup (International Knowledge Discovery and Data Mining Tools Competition). Here we will take a fraction of the dataset because the original dataset is too big
import urllib
f = urllib.urlretrieve ("<a href="http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz">http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz</a>", "kddcup.data_10_percent.gz")
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)
Suppose We want to count how many normal. interactions we have in our dataset. We can filter our raw_data RDD as follows.
normal_raw_data = raw_data.filter(lambda x: 'normal.' in x)
Now we can count how many elements we have in the new RDD.
from time import time
t0 = time()
normal_count = normal_raw_data.count()
tt = time() - t0
print "There are {} 'normal' interactions".format(normal_count)
print "Count completed in {} seconds".format(round(tt,3))
Output:
There are 97278 'normal' interactions
Count completed in 5.951 seconds
In this case we want to read our data file as a CSV formatted one. We can do this by applying a lambda function to each element in the RDD as follows. Here we will use the map() and take() transformation.
from pprint import pprint
csv_data = raw_data.map(lambda x: x.split(","))
t0 = time()
head_rows = csv_data.take(5)
tt = time() - t0
print "Parse completed in {} seconds".format(round(tt,3))
pprint(head_rows[0])
Output:
Parse completed in 1.715 seconds
[u'0',
u'tcp',
u'http',
u'SF',
u'181',
u'5450',
u'0',
u'0',
.
.
u'normal.']
Now we want to have each element in the RDD as a key-value pair where the key is the tag (e.g. normal) and the value is the whole list of elements that represents the row in the CSV formatted file. We could proceed as follows. Here we use the line.split() and map().
def parse_interaction(line):
elems = line.split(",")
tag = elems[41]
return (tag, elems)
key_csv_data = raw_data.map(parse_interaction)
head_rows = key_csv_data.take(5)
pprint(head_rows[0])
Output:
(u'normal.',
[u'0',
u'tcp',
u'http',
u'SF',
u'181',
u'5450',
u'0',
u'0',
u'0.00',
u'1.00',
.
.
.
.
u'normal.'])
Here we are going to use the collect() action. It will get all the elements of RDD into memory. For this reason, it has to be used with care when working with large RDDs.
t0 = time()
all_raw_data = raw_data.collect()
tt = time() - t0
print "Data collected in {} seconds".format(round(tt,3))
Output:
Data collected in 17.927 seconds
That took longer as any other action we used before, of course. Every Spark worker node that has a fragment of the RDD has to be coordinated in order to retrieve its part and then reduce everything together.
As a last example combining all the previous, we want to collect all the normal
interactions as key-value pairs.
# get data from file
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)
# parse into key-value pairs
key_csv_data = raw_data.map(parse_interaction)
# filter normal key interactions
normal_key_interactions = key_csv_data.filter(lambda x: x[0] == "normal.")
# collect all
t0 = time()
all_normal = normal_key_interactions.collect()
tt = time() - t0
normal_count = len(all_normal)
print "Data collected in {} seconds".format(round(tt,3))
print "There are {} 'normal' interactions".format(normal_count)
Output:
Data collected in 12.485 seconds
There are 97278 normal interactions
So this is it, guys!
I hope you enjoyed this Spark with Python blog. If you are reading this, Congratulations! You are no longer a newbie to PySpark. Try out this simple example on your systems now.
Now that you have understood basics of PySpark, check out the Python Spark Certification Training using PySpark by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. Edureka’s Python Spark Certification Training using PySpark is designed to provide you the knowledge and skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).
Got a question for us? Please mention it in the comments section and we will get back to you.
Original article source at: https://www.edureka.co/