Hertha  Walsh

Hertha Walsh

1603710600

Demystifying Apache Arrow

In my work as a data scientist, I’ve come across Apache Arrow in a range of seemingly-unrelated circumstances. However, I’ve always struggled to describe exactly what it is, and what it does.

The official description of Arrow is:

a cross-language development platform for in-memory analytics

which is quite abstract — and for good reason. The project is extremely ambitious, and aims to provide the backbone for a wide range of data processing tasks. This means it sits at a low level, providing building blocks for higher level, user-facing, analytics tools like pandas or dplyr.

As a result, the importance of the project can be hard to understand for users who run into it occasionally in their day to day work, because much of what it does is behind the scenes.

In this post I describe some of the user-facing features of Apache Arrow which I have run into in my work, and explain why they are all facets of more fundamental problems which Apache Arrow aims to solve.

By connecting these dots it becomes clear why Arrow is not just a useful tool to solve some practical problems today, but one of the most exciting emerging tools, with the potential to be the engine behind large parts of future data science workflows.

Faster csv reading

A striking feature of Arrow is that it can read csvs into Pandas more than 10x faster than pandas.read.csv.

This is actually a two step process: Arrow reads the data into memory in an Arrow table, which is really just a collection of record batches, and then converts the Arrow table into a pandas dataframe.

The speedup is thus a consequence of the underlying design of Arrow:

  • Arrow has its own in-memory storage format. When we use Arrow to load data into Pandas, we are really loading data into the Arrow format (an in-memory format for data frames), and then translating this into the pandas in-memory format. Part of the speedup in reading csvs therefore comes from the careful design of the Arrow columnar format itself.
  • Data in Arrow is stored in-memory in record batches, a 2D data structure containing contiguous columns of data of equal length. A ‘table’ can be created from these batches without requiring additional memory copying, because tables can have ‘chunked’ columns (i.e. sections of data, each part representing a contiguous chunk of memory). This design means that data can be read in parallel rather than the single-threaded approach of pandas.

#data-science #data-engineering #data

What is GEEK

Buddha Community

Demystifying Apache Arrow

Apache Arrow and Distributed Compute with Kubernetes

Apache Arrow is a cross-language development platform for In-Memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. Also, provides inter-process communication, zero-copy streaming messaging and also computational libraries. C, C++, Java, JavaScript, Python, and Ruby are the languages currently supported include. ” as quoted by the official website.

This project is a move to standardize the In-Memory data representation, used between libraries, systems, languages, and frameworks.

#insights #apache #apache arrow

Arjun  Goodwin

Arjun Goodwin

1594130760

Reading Avro files using Apache Flink

In this blog, we will see how to read the Avro files using Flink.

Before reading the files, let’s get an overview of Flink.

There are two types of processing –** batch and real-time.**

  • **Batch Processing: **Processing based on the data collected over time.
  • **Real-time Processing: **Processing based on immediate data for an instant result.

Real-time processing is in demand and Apache Flink is the real-time processing tool.

Some of the flink features include:

  • Fast speed
  • Support for scala and java
  • Low-latency
  • Fault-tolerance
  • Scalability

Let’s get started.

Step 1:

Add the required dependencies in build.sbt:

name := "flink-demo"

version := "0.1"

scalaVersion := "2.12.8"

libraryDependencies ++= Seq(

"org.apache.flink" %% "flink-scala" % "1.10.0",

"org.apache.flink" % "flink-avro" % "1.10.0",

"org.apache.flink" %% "flink-streaming-scala" % "1.10.0"

)

Step 2:

The next step is to create a pointer to the environment on which this program runs. In spark, it is similar to spark context.

val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

Step 3:

Setting parallelism of x here will cause all operators (such as join, map, reduce) to run with x parallel instance.

I am using 1 as it is a demo application.

env.setParallelism(1)

#apache flink #flink #scala #streaming ##apache-flink ##avro files #apache #avro

Myrl  Prosacco

Myrl Prosacco

1594533600

Using Apache Flink for Kinesis to Kafka Connect

In this blog, we are going to use kinesis as a source and kafka as a consumer.

Let’s get started.

Step 1:

Apache Flink provides the kinesis and kafka connector dependencies. Let’s add them in our build.sbt:

name := "flink-demo"

version := "0.1"

scalaVersion := "2.12.8"

libraryDependencies ++= Seq(
  "org.apache.flink" %% "flink-scala" % "1.10.0",
  "org.apache.flink" %% "flink-connector-kinesis" % "1.10.0",
  "org.apache.flink" %% "flink-connector-kafka" % "1.10.0",
  "org.apache.flink" %% "flink-streaming-scala" % "1.10.0"
)

Step 2:

The next step is to create a pointer to the environment on which this program runs.

val env = StreamExecutionEnvironment.getExecutionEnvironment

Step 3:

Setting parallelism of x here will cause all operators (such as join, map, reduce) to run with x parallel instance.

I am using 1 as it is a demo application.

env.setParallelism(1)

Step 4:

Disabling the aws cbor, as we are testing locally.

System.setProperty("com.amazonaws.sdk.disableCbor", "true")
System.setProperty("org.apache.flink.kinesis.shaded.com.amazonaws.sdk.disableCbor", "true")

Step 5:

Defining Kinesis consumer properties.

  • Region
  • Stream Position – TRIM_HORIZON to read all the records available in the stream
  • Aws keys
  • Do not worry about the endpoint, it is set to http://localhost:4568 as we will test the kinesis using localstack.

Do not worry about the endpoint, it is set to http://localhost:4568 as we will test the kinesis using localstack.

#apache flink #flink #scala ##apache-flink ##kinesis #apache #flink streaming #kafka #scala

Gilberto  Block

Gilberto Block

1595063428

Running Streaming ETL Pipelines with Apache Flink on Zeppelin Notebooks

A step-by-step tutorial for running Streaming ETL with Flink on Zeppelin. Let’s dive deeper into the Flink interpreter in Zeppelin Notebooks.

Apache Zeppelin 0.9 comes with a redesigned interpreter for Apache Flink that allows developers and data engineers to use Flink directly on Zeppelin notebooks for interactive data analysis. Over the next paragraphs, we describe why Streaming ETL is a great fit for stream processing frameworks like Apache Flink and we dive deeper into the Flink interpreter in Zeppelin Notebooks by showcasing a tutorial of how developers can run Streaming ETL data pipelines with Flink on Zeppelin.

zeppelin image

Streaming ETL and Apache Flink

Extract-transform-load (ETL) is a common operation related to massaging and moving data between storage systems. ETL jobs have historically been triggered periodically, frequently copying data from transactional database systems to an analytical database or a data warehouse.

Streaming ETL pipelines serve a similar purpose traditional ETL: they transform and enrich data and can move it from one storage system to another. However, streaming ETL pipelines are different from traditional ETL in that they operate continuously and are capable of both reading records from sources that continuously produce data as well as moving the data, with low latency, to their desired destination.

Streaming ETL is a common use case for Apache Flink because of its ability to address most common data transformation or enrichment tasks with Flink SQL (or Table API) and its support for user-defined functions. Additionally, Flink provides a rich set of connectors to various storage systems such as KafkaKinesisElasticsearch, and JDBC database systems. It also features continuous sources for file systems that monitor directories and sinks and write files in a time-bucketed fashion. Let us now describe how the Flink interpreter works in Zeppelin notebooks.

#open source #tutorial #apache flink #apache zeppelin #apache

Ursa Labs and Apache Arrow in 2019

For the last 3 years, the Apache Arrow project has been developing a language-independent open standard in-memory format for tabular data and a library ecosystem that builds on top of the format.

In this talk I will discuss the current status of the project and Ursa Labs, a new not-for-profit development group I founded to focus on Arrow development. I will give a preview of our Arrow development roadmap for 2019 and how it may impact the Python data science ecosystem

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading about Apache

Apache Kafka Series - Learn Apache Kafka for Beginners v2

How To Install the Apache Web Server on CentOS 7

Apache Spark 2 with Scala - Hands On with Big Data!

Taming Big Data with Apache Spark and Python - Hands On!

#python #data-science #apache #apache-spark #big-data