Thomas Coper

Thomas Coper

1601937720

An overview of Apache NiFi and toolkit cli deployments

Apache NiFi is a tool used to create Data Pipelines via a drag/drop interface and is designed to automate the flow of data between systems. It fits into the ‘no code/low code’ category of tools, primarily geared towards companies who feel less comfortable writing and managing code or building a solution that requires a significant amount of engineering effort.

What can NiFi be used for?

NiFi is good at reliably transferring data between platforms e.g. Kafka to ES and allows you to perform some lightweight ETL during the process. It can enrich and prepare data, perform conversion between data formats, change named fields and filter/route data to different locations.

#deployment #dataflow #nifi #aws #apache

What is GEEK

Buddha Community

An overview of Apache NiFi and toolkit cli deployments
Thomas Coper

Thomas Coper

1601937720

An overview of Apache NiFi and toolkit cli deployments

Apache NiFi is a tool used to create Data Pipelines via a drag/drop interface and is designed to automate the flow of data between systems. It fits into the ‘no code/low code’ category of tools, primarily geared towards companies who feel less comfortable writing and managing code or building a solution that requires a significant amount of engineering effort.

What can NiFi be used for?

NiFi is good at reliably transferring data between platforms e.g. Kafka to ES and allows you to perform some lightweight ETL during the process. It can enrich and prepare data, perform conversion between data formats, change named fields and filter/route data to different locations.

#deployment #dataflow #nifi #aws #apache

Arjun  Goodwin

Arjun Goodwin

1594142580

Setting Apache Nifi on Docker Containers

The system I work on is about to undergo a significant tech-refresh. The goal is to improve the current hindrances that prevent ingesting more data. The remedy requires a product that receives inputs from external sources, processes them, and disseminates the outcomes to their destinations.

Apache Nifi implements the flow-based programming (FBP) paradigm; it composes of black-box processes that exchange data across predefined connections (an excerpt from Wikipedia).

In short, Apache NiFi is a tool to process and distribute data. Its intuitive UI supports routing definitions, a variety of connectors (in/out), and many built-in processors. All these features combined together make it a suitable optional platform for our use case.

In view of our system’s future needs, we decided to evaluate Nifi thoroughly. The starting point is setting up an environment.

In this article, I’ll describe how to set up a Nifi environment using Docker images and run a simple predefined template; building a Nifi flow from scratch will be covered in another article. The main three parts of the article are:

  • Reviewing Apache Nifi concepts and building-blocks
  • Setting Nifi Flow and Nifi Registry (based on Docker images)
  • Loading a template and running it

Ready? Let’s start with the foundations.

Nifi Components and Concepts

Nifi is based on the following hierarchy:

  • Process Group
  • A collection of processors and their connections. A process group is the smallest unit to be saved in version control (Nifi Registry). A process group can have input and output ports that allow connecting Process Groups. With that, data flow can be composed of more than one Process Group.
  • Processor
  • A processing unit that (mostly) has input and output linked to another processor by a connector_._ Each processor is a black-box that executes a single operation; for example, processors can change the content or the attributes of the FlowFile (see below).
  • FlowFile
  • This is the logical set of data with two parts (content and attributes), which passes between the Nifi Processors. The FlowFile object is immutable, but its contents and attributes can change during the processing.
  • Connection
  • A Connection is a queue that routes FlowFiles between processors. The routing logic is based on conditions related to the processor’s result; a connection is associated with one or more result types. A connection’s conditions are the relationships between processors, which can be static or dynamic. While static relationships are fixed (for example — Success, Failure, Match, or Unmatch), the dynamic relationships are based on attributes of the FlowFile, defined by the user; the last section in this article exemplifies this feature with _RouteOnAttribute _processor.
  • Port
  • The entry and exit points of a Process Group. Each Process Group can have one or more input or output ports, distinguished by their names.
  • Funnel
  • Combines the data from several connections into a single connection.

The Nifi flow below depicts these components:

Process Group and Nifi elements

After reviewing Nifi data flow components, let’s see how to set up an environmen

#dataflow #software-development #apache-nifi #programming #docker #apache

Karlee  Will

Karlee Will

1622157480

Smart Stocks With NiFi, Kafka, and Flink SQL

This article is a tutorial on how to use the Cloud-Native application in Real-Time Analytics for Continuous SQL on Stock Data.

I would like to track stocks from some companies frequently during the day using Apache NiFi to read the REST API. After that, I have some Streaming Analytics to perform with Apache Flink SQL, and I also want permanent fast storage in Apache Kudu queried with Apache Impala.

I will show you below how to build that in the cloud-native application in seconds.

Source Code: https://github.com/tspannhw/SmartStocks

To Script Loading Schemas, Tables, Alerts see scripts/setup.sh:

Source Code: https://github.com/tspannhw/ApacheConAtHome2020

  • Kafka Topic
  • Kafka Schema
  • Kudu Table
  • Flink Prep
  • Flink SQL Client Run
  • Flink SQL Client Configuration

Once our automated admin has built our cloud environment and populated it with the goodness of our app, we can begin our continuous SQL. If you know your data, build a schema, share it with the registry

One unique thing we added was a default value in our Avro schema and made it a logicalType for timestamp-millis. This is helpful for Flink SQL timestamp related queries.

#sql #apache kafka #apache nifi #apache flink

Apache Spark Cluster on Docker

Apache Spark is arguably the most popular big data processing engine. With more than 25k stars on GitHub, the framework is an excellent starting point to learn parallel computing in distributed systems using Python, Scala and R.

To get started, you can run Apache Spark on your machine by using one of the many great Docker distributions available out there. Jupyter offers an excellent _dockerized _Apache Spark with a JupyterLab interface but misses the framework distributed core by running it on a single container. Some GitHub projects offer a distributed cluster experience however lack the JupyterLab interface, undermining the usability provided by the IDE.

I believe a comprehensive environment to learn and practice Apache Spark code must keep its distributed nature while providing an awesome user experience.

This article is all about this belief.

In the next sections, I will show you how to build your own cluster. By the end, you will have a fully functional Apache Spark cluster built with Docker and shipped with a Spark master node, two Spark worker nodes and a JupyterLab interface. It will also include the Apache Spark Python API (PySpark) and a simulated Hadoop distributed file system (HDFS).

TL;DR

This article shows how to build an Apache Spark cluster in standalone mode using Docker as the infrastructure layer. It is shipped with the following:

  • Python 3.7 with PySpark 3.0.0 and Java 8;
  • Apache Spark 3.0.0 with one master and two worker nodes;
  • JupyterLab IDE 2.1.5;
  • Simulated HDFS 2.7.

To make the cluster, we need to create, build and compose the Docker images for JupyterLab and Spark nodes. You can skip the tutorial by using the **out-of-the-box distribution **hosted on my GitHub.

Requirements

Docker 1.13.0+;

Docker Compose 3.0+.

Table of contents

  1. Cluster overview;
  2. Creating the images;
  3. Building the images;
  4. Composing the cluster;
  5. Creating a PySpark application.

#overviews #apache spark #docker #python #apache

Gilberto  Block

Gilberto Block

1595064275

Apache Spark on Dataproc vs. Google BigQuery

This post looks at research undertaken to provide interactive business intelligence reports and visualizations for thousands of end users, in the hopes of addressing some of the challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.

Introduction

When it comes to Big Data infrastructure on Google Cloud Platform, the most popular choices Data architects need to consider today are Google BigQuery – A serverless, highly scalable and cost-effective cloud data warehouse, Apache Beam based Cloud Dataflow and Dataproc – a fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.

This variety also presents challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.

In the following sections, we look at research we had undertaken to provide interactive business intelligence reports and visualizations for thousands of end users. Furthermore, as these users can concurrently generate a variety of such interactive reports, we need to design a system that can analyze billions of data points in real time.

Requirements

For technology evaluation purposes, we narrowed down to following requirements –

  1. Raw data set of 175TB size : This dataset is quite diverse with scores of tables and columns consisting of metrics and dimensions derived from multiple sources.
  2. Catering to 30,000 unique users
  3. Serving upto 60 concurrent queries to the platform users

The problem statement due to the size of the base dataset and requirement for a high real time querying paradigm requires a solution in the Big Data domain.

Salient Features of Proposed Solution

The solution took into consideration following 3 main characteristics of desired system:

  1. Analyzing and classifying expected user queries and their frequency.
  2. Developing various pre-aggregations and projections to reduce data churn while serving various classes of user queries.
  3. Developing state of the art ‘Query Rewrite Algorithm’ to serve the user queries using a combination of aggregated datasets. This will allow the Query Engine to serve maximum user queries with minimum number of aggregations.

Tech Stack Considerations

For benchmarking performance and the resulting cost implications, following technology stack on Google Cloud Platform were considered:

1. Cloud DataProc + Google Cloud Storage

For Distributed processing – Apache Spark on Cloud DataProc

For Distributed Storage – Apache Parquet File format stored in Google Cloud Storage

2. Cloud DataProc + Google BigQuery using Storage API

For Distributed processing – Apache Spark on Cloud DataProc

For Distributed Storage – BigQuery Native Storage (Capacitor File Format over Colossus Storage) accessible through BigQuery Storage API

3. Native Google BigQuery for both Storage and processing – On Demand Queries

Using BigQuery Native Storage (Capacitor File Format over Colossus Storage) and execution on BigQuery Native MPP (Dremel Query Engine)

All the queries were run in on demand fashion. Project will be billed on the total amount of data processed by user queries.

4. Native Google BigQuery with fixed price model

Using BigQuery Native Storage (Capacitor File Format over Colossus Storage) and execution on BigQuery Native MPP (Dremel Query Engine)

Slots reservations were made and slots assignments were done to dedicated GCP projects. All the queries and their processing will be done on the fixed number of BigQuery Slots assigned to the project.

#overviews #apache spark #bigquery #google #apache