Apache

Apache

Use this topic (along with an appropriate programming-language tag) for programming questions relating to the Apache HTTP Server.
Gordon  Murray

Gordon Murray

1679129221

Thrift: Apache Thrift

Apache Thrift

Introduction

Thrift is a lightweight, language-independent software stack for point-to-point RPC implementation. Thrift provides clean abstractions and implementations for data transport, data serialization, and application level processing. The code generation system takes a simple definition language as input and generates code across programming languages that uses the abstracted stack to build interoperable RPC clients and servers.

Apache Thrift Layered Architecture

Thrift makes it easy for programs written in different programming languages to share data and call remote procedures. With support for 28 programming languages, chances are Thrift supports the languages that you currently use.

Thrift is specifically designed to support non-atomic version changes across client and server code. This allows you to upgrade your server while still being able to service older clients; or have newer clients issue requests to older servers. An excellent community-provided write-up about thrift and compatibility when versioning an API can be found in the Thrift Missing Guide.

For more details on Thrift's design and implementation, see the Thrift whitepaper included in this distribution, or at the README.md file in your particular subdirectory of interest.

Status

BranchTravisAppveyorCoverity Scancodecov.ioWebsite
masterBuild StatusBuild statusCoverity Scan Build Status Website
0.17.0Build Status    

Releases

Thrift does not maintain a specific release calendar at this time.

We strive to release twice yearly. Download the current release.

Project Hierarchy

thrift/

compiler/

Contains the Thrift compiler, implemented in C++.

lib/

Contains the Thrift software library implementation, subdivided by
language of implementation.

cpp/
go/
java/
php/
py/
rb/
...

test/

Contains sample Thrift files and test code across the target programming
languages.

tutorial/

Contains a basic tutorial that will teach you how to develop software
using Thrift.

Development

To build the same way Travis CI builds the project you should use docker. We have comprehensive building instructions for docker.

Requirements

See http://thrift.apache.org/docs/install for a list of build requirements (may be stale). Alternatively, see the docker build environments for a list of prerequisites.

Resources

More information about Thrift can be obtained on the Thrift webpage at:

 http://thrift.apache.org

Acknowledgments

Thrift was inspired by pillar, a lightweight RPC tool written by Adam D'Angelo, and also by Google's protocol buffers.

Installation

If you are building from the first time out of the source repository, you will need to generate the configure scripts. (This is not necessary if you downloaded a tarball.) From the top directory, do:

./bootstrap.sh

Once the configure scripts are generated, thrift can be configured. From the top directory, do:

./configure

You may need to specify the location of the boost files explicitly. If you installed boost in /usr/local, you would run configure as follows:

./configure --with-boost=/usr/local

Note that by default the thrift C++ library is typically built with debugging symbols included. If you want to customize these options you should use the CXXFLAGS option in configure, as such:

./configure CXXFLAGS='-g -O2'
./configure CFLAGS='-g -O2'
./configure CPPFLAGS='-DDEBUG_MY_FEATURE'

To enable gcov required options -fprofile-arcs -ftest-coverage enable them:

./configure  --enable-coverage

Run ./configure --help to see other configuration options

Please be aware that the Python library will ignore the --prefix option and just install wherever Python's distutils puts it (usually along the lines of /usr/lib/pythonX.Y/site-packages/). If you need to control where the Python modules are installed, set the PY_PREFIX variable. (DESTDIR is respected for Python and C++.)

Make thrift:

make

From the top directory, become superuser and do:

make install

Uninstall thrift:

make uninstall

Note that some language packages must be installed manually using build tools better suited to those languages (at the time of this writing, this applies to Java, Ruby, PHP).

Look for the README.md file in the lib// folder for more details on the installation of each language library package.

Package Managers

Apache Thrift is available via a number of package managers, a list which is is steadily growing. A more detailed overview can be found at the Apache Thrift web site under "Libraries" and/or in the respective READMEs for each language under /lib

Testing

There are a large number of client library tests that can all be run from the top-level directory.

make -k check

This will make all of the libraries (as necessary), and run through the unit tests defined in each of the client libraries. If a single language fails, the make check will continue on and provide a synopsis at the end.

To run the cross-language test suite, please run:

make cross

This will run a set of tests that use different language clients and servers.


Download Details:

Author: Apache
Source Code: https://github.com/apache/thrift 
License: Apache-2.0 license

#c #dart #http #library #csharp #cplusplus #apache 

Thrift: Apache Thrift
Gordon  Murray

Gordon Murray

1677063241

Install Apache Cassandra on Ubuntu 20.04

Install Apache Cassandra on Ubuntu 20.04

Apache Cassandra is a free and open-source NoSQL database with no single point of failure. It provides linear scalability and high availability without compromising performance. Apache Cassandra is used by many companies that have large, active data sets, including Reddit, NetFlix, Instagram, and Github.

This article guides you through the installation of Apache Cassandra on Ubuntu 20.04.

Installing the Apache Cassandra on Ubuntu is straightforward. We’ll install Java, enable the Apache Cassandra repository, import the repository GPG key, and install the Apache Cassandra server.

Installing Java

At the time of writing this article, the latest version of Apache Cassandra is 3.11 and requires OpenJDK 8 to be installed on the system.

Run the following command as root or user with sudo privileges to install OpenJDK :

sudo apt update
sudo apt install openjdk-8-jdk

Verify the Java installation by printing the Java version :

java -version

The output should look something like this:

openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)

Installing Apache Cassandra

Install the dependencies necessary to add a new repository over HTTPS:

sudo apt install apt-transport-https

Import the repository’s GPG key and add the Cassandra repository to the system:

wget -q -O - https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -
sudo sh -c 'echo "deb http://www.apache.org/dist/cassandra/debian 311x main" > /etc/apt/sources.list.d/cassandra.list'

Once the repository is enabled, update the packages list and install the latest version of Apache Cassandra:

sudo apt update
sudo apt install cassandra

Apache Cassandra service will automatically start after the installation process is complete. You can verify it by typing:

nodetool status

You should see something similar to this:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load    Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.1  70 KiB  256     100.0%            2eaab399-be32-49c8-80d1-780dcbab694f  rack1

That’s it. At this point, you have Apache Cassandra installed on your Ubuntu server.

Configuring Apache Cassandra

Apache Cassandra data is stored in the /var/lib/cassandra directory, configuration files are located in /etc/cassandra, and Java start-up options can be configured in the /etc/default/cassandra file.

By default, Cassandra is configured to listen on localhost only. If the client connecting to the database is also running on the same host, you don’t need to change the default configuration file.

To interact with Cassandra through CQL (the Cassandra Query Language) you can use a command-line tool named cqlsh that is shipped with the Cassandra package.

cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

Renaming Apache Cassandra Cluster

The default Cassandra cluster is named “Test Cluster”. If you want to change the cluster name, perform the steps below:

Login to the Cassandra CQL terminal with cqlsh:

cqlsh

Run the following command to change the cluster name to “Linuxize Cluster”:

UPDATE system.local SET cluster_name = 'Linuxize Cluster' WHERE KEY = 'local';

Change “Linuxize Cluster” with your desired name.

Once done, type exit to exit the console.

Open the cassandra.yaml configuration file and enter your new cluster name.

/etc/cassandra/cassandra.yaml

cluster_name: 'Linuxize Cluster'

Save and close the file.

Clear the system cache:

nodetool flush system

Restart the Cassandra service:

sudo systemctl restart cassandra

Conclusion

We’ve shown you how to install Apache Cassandra on Ubuntu 20.04. You can now visit the official Apache Cassandra Documentation page and learn how to get started with Cassandra.

If you hit a problem or have feedback, leave a comment below.

Original article source at: https://linuxize.com/

#ubuntu #apache #cassandra 

Install Apache Cassandra on Ubuntu 20.04
Desmond  Gerber

Desmond Gerber

1676619507

Comparing Apache Prefork vs Worker MPMs

Comparing Apache Prefork vs Worker MPMs

Apache web server is the most widely used web server in the world. It comes with a modular architecture, which allows users to extend its functionality and customize it according to their needs. One of the important modules of Apache is Multi-Processing Module (MPM), which handles incoming requests and manages multiple processes or threads to handle them efficiently.

Apache provides two popular MPMs, Prefork and Worker, each with its own advantages and limitations. Choosing the right MPM for your website is critical to its performance and stability. In this article, we will compare the two MPMs in detail and help you make an informed decision.

Prefork MPM

The Prefork MPM is the traditional and default MPM in Apache web server. It creates multiple child processes to handle incoming requests, each running its own copy of the Apache web server. Each child process can handle only one request at a time, which makes it less efficient than other MPMs. However, it is still popular because of its stability and compatibility with older PHP and other scripts.

Advantages of Prefork MPM:

  1. Stable: Prefork MPM is known for its stability and reliability. Each child process runs independently of others, which ensures that if one process crashes, it does not affect other processes.
  2. Compatibility: Prefork MPM is compatible with older PHP and other scripts that are not thread-safe. It runs each script in a separate process, which avoids issues related to thread-safety.

Limitations of Prefork MPM:

  1. Resource-Intensive: Prefork MPM creates multiple processes, which consume a significant amount of system resources. It can cause high memory usage and slow down the server under heavy load.
  2. Limited Concurrency: Each child process can handle only one request at a time, which limits the number of concurrent connections that the server can handle.

Worker MPM

The Worker MPM is a newer MPM in Apache web server, which is designed to improve performance and scalability. It creates multiple threads within a single process, each handling a separate connection. It is more efficient than Prefork MPM in terms of resource usage and concurrency. However, it requires a more modern version of PHP and other scripts that are thread-safe.

Advantages of Worker MPM:

  1. Resource-efficient: Worker MPM creates multiple threads within a single process, which reduces the amount of system resources used. It can handle a larger number of connections without slowing down the server.
  2. High Concurrency: Each thread can handle a separate connection, which allows the server to handle a higher number of concurrent connections.

Limitations of Worker MPM:

  1. Stability: Worker MPM is less stable than Prefork MPM because all threads share the same process. If one thread crashes, it can affect other threads and cause the server to crash.
  2. Compatibility: Worker MPM is not compatible with older PHP and other scripts that are not thread-safe. It requires a more modern version of PHP and other scripts to work properly.

Comparing Apache Prefork vs. Worker MPMs

The following table compares the key features of Apache Prefork and Worker MPMs:

FeatureApache PreforkApache Worker
ArchitectureProcess-basedThread-based
ScalabilityPoorGood
Memory UsageHighLow
PerformanceSlowFast
CompatibilityGoodGood
StabilityGoodGood
FlexibilityLimitedFlexible

As you can see, Apache Worker has several advantages over Apache Prefork. It is more scalable, uses less memory, and performs better for high-traffic websites. Apache Prefork, on the other hand, is simpler and more stable. It is still a good option for small websites or websites that do not receive a lot of traffic.

Configuring Apache Prefork and Worker

Here are some sample configurations for Apache Prefork and Worker:

Apache Prefork:

<IfModule mpm_prefork_module>
  ServerLimit 100
  StartServers 5
  MinSpareServers 5
  MaxSpareServers 10
  MaxClients 100
  MaxRequestsPerChild 0
</IfModule>

Apache Worker:

<IfModule mpm_worker_module>
  ServerLimit 100
  StartServers 2
  MaxClients 150
  MinSpareThreads 25
  MaxSpareThreads 75
  ThreadsPerChild 25
  MaxRequestsPerChild 0
</IfModule>

Wrap Up!

In conclusion, both the Prefork and Worker MPMs have their own advantages and disadvantages. It ultimately depends on the specific needs of your website and the amount of traffic it receives. If you’re unsure which MPM to use, it’s recommended to start with the default Prefork MPM and then switch to the Worker MPM if you experience high traffic and want to improve performance.

Original article source at: https://www.cloudbooklet.com/

#apache #worker #webserver 

Comparing Apache Prefork vs Worker MPMs

What is Apache Druid?

Introduction to Apache Druid

It is a real-time analytics database that is designed for rapid analytics on large datasets. This database is used more often for powering use cases where real-time ingestion, high uptime, and fast query performance is needed.  Druid can be used to analyze billions of rows not only in batch but also in real-time. It offers many integrations with different technologies like Apache Kafka Security, Cloud Storage, S3, Hive, HDFS, DataSketches, Redis, etc. It also follows the immutable past and append-only future. As past events happen once and never change, these are immutable, whereas the only append takes place for new events. It provides users with a fast and deep exploration of large scale transaction data.

Characteristics of Apache Druid

Some of the exciting characteristics are:

  • Cloud-Native, making easy horizontal scaling
  • Supports SQL for analyzing data
  • REST API enabled for querying or uploading data

What are its use cases?

Some of the common use cases of Druid are:

  • Clickstream analytics
  • Server metrics storage
  • OLAP/Business intelligence
  • Digital Marketing/Advertising analytics
  • Network Telemetry analytics
  • Supply chain analytics
  • Application performance metrics

What are its key features?

Druid’s core architecture is made by combining the ideas of different data warehouses, log search systems, and time-series databases.

Columnar Storage Format

It uses column-oriented storage; hence only loads required columns needed for a particular query. It helps in fast scans and aggregations.

Parallel Processing

It can process a query in parallel across the entire cluster. It is also termed as Massively Parallel Processing.

Scalable Distributed System

Druid is mostly deployed in clusters ranging from tens to hundreds that offer ingest rate of millions of records/sec, query latencies of sub-second to a few seconds, and retention of trillions of records.

Real-time or Batch Ingestion

Druid can ingest data either in real-time (Ingested data can be queried immediately) or in batches.

Cloud-Native

It is a fault-tolerant architecture that won’t lose data. Once Druid ingests data, its copy is safely stored in deep storage (Cloud Storage, Amazon S3, Redis, HDFS, many more). Users' data can be easily recovered from this deep storage even if all the Druid’s servers fail. This replication ensures that queries are still possible while the system recovers.

Indexing

Druid uses concise and roaring compressed bitmap indexes to create indexes that help in faster filtering.

Timestamp Partitioning

Every data in Druid must have a timestamp column as the data is always partitioned by time, and every query has a time filter.

Easy Integration with Existing Pipelines

Users can easily stream data natively using Druid from message buses like Kafka, kinesis, and many more. It can also load batch files from the data lakes like HDFS and Amazon S3.

General Architecture of Apache Druid

Druid is mainly composed of the following processes:

  • Coordinator – This process manages data availability on the cluster.
  • Overlord – This process controls the assignment of data ingestion workloads.
  • Broker – This helps handle queries from external clients.
  • Historical – This process store data that is queryable.
  • Middle manager – This process is responsible for ingesting the data.
  • Router – These processes are used to route requests to Brokers, Coordinators, and Overlords. These processes are optional.

Apache Druid Architecture

The processes described above are organized into 3 types of servers: Master, Query, and Data.

Master

It runs the Coordinator and Overlord. Basically, it manages big data ingestion and availability. Master is responsible for the ingestion of jobs and coordinating the availability of data on the “Data Servers”.

Query

It runs Brokers and Optional Router processes. Basically, it handles queries and external clients by providing the endpoints of applications that users and clients interact with, routing queries to Data servers or other Query servers.

Data

It runs Middle Managers and Historical processes. This helps execute jobs and store the queryable data. Other than these 3 servers and six processes, Druid also requires storage for Metadata and Deep Storage.

Metadata Storage

It is basically used to store the metadata of the system (Audit, Datasource, Schemas, and so on). For experimental purposes, the environment suggested using Apache Derby. Derby is the default metadata store for Druid, but it is not suitable for production. For production purposes, MySQL or PostgreSQL is the best choice. Metadata storage stores the entire metadata, which is very useful for the cluster of Druid to work. Derby is not used for production as it does not support a multi-node cluster with high availability. MySQL as a metadata storage database is used to acquire:

  • Long term flexibility
  • Scaling on budget
  • Good with large datasets
  • Good high read speed

PostgreSQL, as a metadata storage database, is used to acquire:

  • Complex database designs
  • Performing customized procedures
  • Diverse indexing technique
  • Variety of replication methods
  • High read and write speed.

Deep Storage

Apache Druid uses separate storage for any data ingested that makes it fault-tolerant. Some of Deep Storage Technologies are Cloud Storage, Amazon S3, HDFS, Redis, and many more.

Data Ingestion in Druid

Data in Druid is organized into segments that generally have rows up to a few million. Loading data in Druid is known as Ingestion or Indexing. Druid fully supports batch ingestion and streaming ingestion. Some of the technologies supported by Druid is Kinesis, Cloud Storage, Apache Kafka, and local storage. Druid requires some structure to the data it ingests. In general, data should consist of OS timestamp, metrics and dimensions.

Zookeeper for Apache Druid

Ittransf uses Apache Zookeeper to integrate all the solutions. Users can use Zookeeper that comes with Druid for experiments, but one has to install Zookeeper for production. It’s cluster can only be as stable by a Zookeeper. Zookeeper is responsible for most of the communications that keep the Druid cluster functioning as Druid nodes are prevented from talking to each other.Understanding Zookeeper

Duties of a Zookeeper

Zookeeper is responsible for the following operations:

  • Segment “publishing” protocol from Historical
  • Coordinator leader election
  • Overlord and MiddleManager task management
  • Segment load/drop protocol between Coordinator and Historical
  • Overlord leader election

How to Keep a Zookeeper Stable?

For maximum Zookeeper stability, the user has to follow the following practices:

  • There should be a Zookeeper dedicated to Druid; avoid sharing it with any other products/applications.
  • Maintain an odd number of Zookeepers for increased reliability.
  • For highly available Zookeeper, 3-5 Zookeeper nodes are recommended. Users can either install Zookeeper on their own system or run 3 or 5 master servers and configure Zookeeper on them appropriately.
  • Share Zookeeper’s location with a master server rather than doing so with data or query servers. This is done because query and data are far much work-intensive than the master node (coordinator and overlord).
  • To fully achieve high availability, it is recommended to never out Zookeeper behind a load balancer.

If Zookeeper goes down, the cluster will operate. Failing of Zookeeper would neither result in addition to new data segments nor can it effectively react to the loss of one of the nodes. So, the failure of Zookeeper is a degraded state.

How to monitor Apache Druid?

Users can monitor Druid by using the metrics it generates. Druid generates metrics related to queries, coordination and ingestion. These metrics are emitted as a JSON object. It is either emitted to a runtime log file or over HTTP (to service like Kafka). The emission of a metric is disabled by default.

Fields of Metrics Emitted

Metrics emitted by Druid share a common set of fields.

  • Timestamp – the time at which metric was created
  • Metric – the name given to the metric
  • Service – the name of the service that emitted the metric
  • Host – the name of the host that emitted the metric
  • Value – the numeric value that is associated with the metric emitted

Briefing About Available Metrics

Metric emitted may have dimensions beyond the one listed. To change the emission period of Druid that is 1 minute by default, one can use `druid.monitoring.emissionPeriod` to change the default value. Metrics available are:

  • Query Metrics, mainly categorized as Broker, Historical, Real-time, Jetty and Cache
  • SQL Metrics (Only if SQL is enabled)
  • Ingestion Metrics (Kafka Indexing Service)
  • Real-time Metrics (Real-time process, available if Real-time Metrics Monitor is included)
  • Indexing Service
  • Coordination
  • JVM (Available if JVM Monitor module is included)
  • Event Receiver Firehose (available if Event Receiver Firehose Monitor module is included)
  • Sys (Available if Sys Monitor module is included)
  • General Health, mainly Historical

Conclusion

Apache Druid is the best in the market when it comes to analyzing data in clusters and providing brief insight to all the data processed. Plus having Zookeeper by the side, one can ease up their working with it and rule the DataOps market. Also, there are many libraries to interact with it. To Validate the running of service, one can use JPS commands. As Druid nodes are java processes, they would show up when JPS commands '$ jps -m' are used. With that much ease in monitoring Druid and working with such a vast architecture of Druid, it is really the last bite of an ice-cream for a DataOps Engineer.

Original article source at: https://www.xenonstack.com/

#apache #druid 

What is Apache Druid?
Royce  Reinger

Royce Reinger

1675732620

lakeFS: Data version control for your data lake | Git for data

lakeFS is a data version control - Git for data

lakeFS is an open-source tool that transforms your object storage into a Git-like repository. It enables you to manage your data lake the way you manage your code.

With lakeFS you can build repeatable, atomic, and versioned data lake operations - from complex ETL jobs to data science and analytics.

lakeFS supports AWS S3, Azure Blob Storage, and Google Cloud Storage as its underlying storage service. It is API compatible with S3 and works seamlessly with all modern data frameworks such as Spark, Hive, AWS Athena, Presto, etc.

For more information, see the official lakeFS documentation.

Capabilities

ETL Testing with Isolated Dev/Test Environment

When working with a data lake, it’s useful to have replicas of your production environment. These replicas allow you to test these ETLs and understand changes to your data without impacting downstream data consumers.

Running ETL and transformation jobs directly in production without proper ETL Testing is a guaranteed way to have data issues flow into dashboards, ML models, and other consumers sooner or later. The most common approach to avoid making changes directly in production is to create and maintain multiple data environments and perform ETL testing on them. Dev environment to develop the data pipelines and test environment where pipeline changes are tested before pushing it to production. With lakeFS you can create branches, and get a copy of the full production data, without copying anything. This enables a faster and easier process of ETL testing.

Reproducibility

Data changes frequently. This makes the task of keeping track of its exact state over time difficult. Oftentimes, people maintain only one state of their data––its current state.

This has a negative impact on the work, as it becomes hard to:

  • Debug a data issue.
  • Validate machine learning training accuracy (re-running a model over different data gives different results). Comply with data audits.

In comparison, lakeFS exposes a Git-like interface to data that allows keeping track of more than just the current state of data. This makes reproducing its state at any point in time straightforward.

CI/CD for Data

Data pipelines feed processed data from data lakes to downstream consumers like business dashboards and machine learning models. As more and more organizations rely on data to enable business critical decisions, data reliability and trust are of paramount concern. Thus, it’s important to ensure that production data adheres to the data governance policies of businesses. These data governance requirements can be as simple as a file format validation, schema check, or an exhaustive PII(Personally Identifiable Information) data removal from all of organization’s data.

Thus, to ensure the quality and reliability at each stage of the data lifecycle, data quality gates need to be implemented. That is, we need to run Continuous Integration(CI) tests on the data, and only if data governance requirements are met can the data can be promoted to production for business use.

Everytime there is an update to production data, the best practice would be to run CI tests and then promote(deploy) the data to production. With lakeFS you can create hooks that make sure that only data that passed these tests will become part of production.

Rollback

A rollback operation is used to to fix critical data errors immediately.

What is a critical data error? Think of a situation where erroneous or misformatted data causes a signficant issue with an important service or function. In such situations, the first thing to do is stop the bleeding.

Rolling back returns data to a state in the past, before the error was present. You might not be showing all the latest data after a rollback, but at least you aren’t showing incorrect data or raising errors. Since lakeFS provides versions of the data without making copies of the data, you can time travel between versions and roll back to the version of the data before the error was presented.

Getting Started

Using Docker

Use this section to learn about lakeFS. For a production-suitable deployment, see the docs.

Ensure you have Docker installed on your computer.

Run the following command:

docker run --pull always --name lakefs -p 8000:8000 treeverse/lakefs run --local-settings

Open http://127.0.0.1:8000/ in your web browser to set up an initial admin user. You will use this user to log in and send API requests.

Other quickstart methods

You can try lakeFS:

Setting up a repository

Once lakeFS is installed, you are ready to create your first repository!

Community

Stay up to date and get lakeFS support via:

  • Share your lakeFS experience and get support on our Slack.
  • Follow us and join the conversation on Twitter.
  • Learn from video tutorials on our YouTube channel.
  • Read more on data versioning and other data lake best practices in our blog.
  • Feel free to contact us about anything else.

More information

Download Details:

Author: Treeverse
Source Code: https://github.com/treeverse/lakeFS 
License: Apache-2.0 license

#machinelearning #go #golang #apache #spark 

lakeFS: Data version control for your data lake | Git for data
Jacob Banks

Jacob Banks

1675650379

Install Apache Airflow on Windows

Learn how to install Apache Airflow on a Windows machine without Docker and how to write a DAG script. Airflow is a crucial tool for data engineers and scientists. Apache Airflow is a tool that helps you manage and schedule data pipelines.

According to the documentation, it lets you "programmatically author, schedule, and monitor workflows."

Airflow is a crucial tool for data engineers and scientists. In this article, I'll show you how to install it on Windows without Docker.

Although it's recommended to run Airflow with Docker, this method works for low-memory machines that are unable to run Docker.

Prerequisites:

This article assumes that you're familiar with using the command line and can set up your development environment as directed.

Requirements:

You need Python 3.8 or higher, Windows 10 or higher, and the Windows Subsystem for Linux (WSL2) to follow this tutorial.

What is Windows Subsystem for Linux (WSL2)?

WSL2 allows you to run Linux commands and programs on a Windows operating system.

It provides a Linux-compatible environment that runs natively on Windows, enabling users to use Linux command-line tools and utilities on a Windows machine.

You can read more here to install WSL2 on your machine.

With Python and WSL2 installed and activated on your machine, launch the terminal by searching for Ubuntu from the start menu.

Step 1: Set Up the Virtual Environment

To work with Airflow on Windows, you need to set up a virtual environment. To do this, you'll need to install the virtualenv package.

Note: Make sure you are at the root of the terminal by typing:

cd ~
pip install virtualenv 

Create the virtual environment like this:

virtualenv airflow_env 

And then activate the environment:

 source airflow_env/bin/activate

Step 2: Set Up the Airflow Directory

Create a folder named airflow. Mine will be located at c/Users/[Username]. You can put yours wherever you prefer.

If you do not know how to navigate the terminal, you can follow the steps in the image below:

set-virtual_env-1

Create an Airflow directory from the terminal

Now that you have created this folder, you have to set it as an environment variable. Open a .bashrc script from the terminal with the command:

nano ~/.bashrc 

Then write the following:

AIRFLOW_HOME=/c/Users/[YourUsername]/airflow

airflow_env_variable

Setup Airflow directory path as an environment variable

Press ctrl s and ctrl x to exit the nano editor.

This part of the Airflow directory will be permanently saved as an environment variable. Anytime you open a new terminal, you can recover the value of the variable by typing:

cd $AIRFLOW_HOME

airflow_home-2

Navigate to Airflow directory using the environment variable

Step 3: Install Apache Airflow

With the virtual environment still active and the current directory pointing to the created Airflow folder, install Apache Airflow:

 pip install apache-airflow 

Initialize the database:

airflow db init 

Create a folder named dags inside the airflow folder. This will be used to store all Airflow scripts.

airflow_db_init-1

View files and folders generated by Airflow db init

Step 4: Create an Airflow User

When airflow is newly installed, you'll need to create a user. This user will be used to login into the Airflow UI and perform some admin functions.

airflow users create --username admin –password admin –firstname admin –lastname admin –role Admin –email youremail@email.com

Check the created user:

airflow users list

create-users

Create an Airflow user and list the created user

Step 5: Run the Webserver

Run the scheduler with this command:

airflow scheduler 

Launch another terminal, activate the airflow virtual environment, cd to $AIRFLOW_HOME, and run the webserver:

airflow webserver 

If the default port 8080 is in use, change the port by typing:

airflow webserver –port <port number>

Log in to the UI using the username created earlier with "airflow users create".

In the UI, you can view pre-created DAGs that come with Airflow by default.

How to Create the first DAG

A DAG is a Python script for organizing and managing tasks in a workflow.

To create a DAG, navigate into the dags folder created inside the $AIRFLOW_HOME directory. Create a file named "hello_world_dag.py". Use VS Code if it's available.

Enter the code from the image below, and save it:

first_dag

Example DAG script in VS Code editor

Go to the Airflow UI and search for hello_world_dag. If it does not show up, try refreshing your browser.

That's it. This completes the installation of Apache Airflow on Windows.

Wrapping Up

This guide covered how to install Apache Airflow on a Windows machine without Docker and how to write a DAG script.

I do hope the steps outlined above helped you install airflow on your Windows machine without Docker.

Original article source at https://www.freecodecamp.org

#apache #windows 

Install Apache Airflow on Windows
Billy Chandler

Billy Chandler

1675648644

Everything You Need to Know About Kafka - The Apache Kafka Handbook

Learn about Apache Kafka from the Apache Kafka Handbook. Learn how to get started using Kafka. Why should you learn Apache Kafka? Apache Kafka is an open-source event streaming platform that can transport huge volumes of data at very low latency.

Apache Kafka is an open-source event streaming platform that can transport huge volumes of data at very low latency.

Companies like LinkedIn, Uber, and Netflix use Kafka to process trillions of events and petabtyes of data each day.

Kafka was originally developed at LinkedIn, to help handle their real-time data feeds. It's now maintained by the Apache Software Foundation, and is widely adopted in industry (being used by 80% of Fortune 100 companies).

Why Should You Learn Apache Kafka?

Kafka lets you:

  • Publish and subscribe to streams of events
  • Store streams of events in the same order they happened
  • Process streams of events in real time

The main thing Kafka does is help you efficiently connect diverse data sources with the many different systems that might need to use that data.

Messy data integrations without Kafka, more organized data integrations with Kafka.

Kafka helps you connect data sources to the systems using that data

Some of the things you can use Kafka for include:

  • Personalizing recommendations for customers
  • Notifying passengers of flight delays
  • Payment processing in banking
  • Online fraud detection
  • Managing inventory and supply chains
  • Tracking order shipments
  • Collecting telemetry data from Internet of Things (IoT) devices

What all these uses have in common is that they need to take in and process data in real time, often at huge scales. This is something Kafka excels at. To give one example, Pinterest uses Kafka to handle up to 40 million events per second.

Kafka is distributed, which means it runs as a cluster of nodes spread across multiple servers. It's also replicated, meaning that data is copied in multiple locations to protect it from a single point of failure. This makes Kafka both scalable and fault-tolerant.

Kafka is also fast. It's optimized for high throughput, making effective use of disk storage and batched network requests.

This article will:

  • Introduce you to the core concepts behind Kafka
  • Show you how to install Kafka on your own computer
  • Get you started with the Kafka Command Line Interface (CLI)
  • Help you build a simple Java application that produces and consumes events via Kafka

Things the article won't cover:

  • More advanced Kafka topics, such as security, performance, and monitoring
  • Deploying a Kafka cluster to a server
  • Using managed Kafka services like Amazon MSK or Confluent Cloud

Table of Contents

  1. Event Streaming and Event-Driven Architectures
  2. Core Kafka Concepts
    a. Event Messages in Kafka
    b. Topics in Kafka
    c. Partitions in Kafka
    d. Offsets in Kafka
    e. Brokers in Kafka
    f. Replication in Kafka
    g. Producers in Kafka
    h. Consumers in Kafka
    i. Consumer Groups in Kafka
    j. Kafka Zookeeper
  3. How to Install Kafka on Your Computer
  4. How to Start Zookeeper and Kafka
  5. The Kafka CLI
    a. How to List Topics
    b. How to Create a Topic
    c. How to Describe Topics
    d. How to Partition a Topic
    e. How to Set a Replication Factor
    f. How to Delete a Topic
    g. How to use kafka-console-producer
    h. How to use kafka-console-consumer
    i. How to use kafka-consumer-groups
  6. How to Build a Kafka Client App with Java
    a. How to Set Up the Project
    b. How to Install the Dependencies
    c. How to Create a Kafka Producer
    d. How to Send Multiple Messages and Use Callbacks
    e. How to Create a Kafka Consumer
    f. How to Shut Down the Consumer
  7. Where to Take it From Here

Before we dive into Kafka, we need some context on event streaming and event-driven architectures.


Event Streaming and Event-Driven Architectures

An event is a record that something happened, as well as information about what happened. For example: a customer placed an order, a bank approved a transaction, inventory management updated stock levels.

Events can triggers one or more processes to respond to them. For example: sending an email receipt, transmitting funds to an account, updating a real-time dashboard.

Event streaming is the process of capturing events in real-time from sources (such as web applications, databases, or sensors) to create streams of events. These streams are potentially unending sequences of records.

The event stream can be stored, processed, and sent to different destinations, also called sinks. The destinations that consume the streams could be other applications, databases, or data pipelines for further processing.

As applications have become more complex, often being broken up into different microservices distributed across multiple data centers, many organizations have adopted an event-driven architecture for their applications.

This means that instead of parts of your application directly asking each other for updates about what happened, they each publish events to event streams. Other parts of the application continuously subscribe to these streams and only act when they receive an event that they are interested in.

This architecture helps ensure that if part of your application goes down, other parts won't also fail. Additionally, you can add new features by adding new subscribers to the event stream, without having to rewrite the existing codebase.

Core Kafka Concepts

Kafka has become one of the most popular ways to implement event streaming and event-driven architectures. But it does have a bit of a learning curve and you need to understand a couple of concepts before you can make effective use of it.

These core concepts are:

  • event messages
  • topics
  • partitions
  • offsets
  • brokers
  • producers
  • consumers
  • consumer groups
  • Zookeeper

Event Messages in Kafka

When you write data to Kafka, or read data from it, you do this in the form of messages. You'll also see them called events or records.

A message consists of:

  • a key
  • a value
  • a timestamp
  • a compression type
  • headers for metadata (optional)
  • partition and offset id (once the message is written to a topic)

A Kafka message consisting of key, value, timestamp, compression type, and headers.

A Kafka message consisting of key, value, timestamp, compression type, and headers

Every event in Kafka is, at its simplest, a key-value pair. These are serialized into binary, since Kafka itself handles arrays of bytes rather than complex language-specific objects.

Keys are usually strings or integers and aren't unique for every message. Instead, they point to a particular entity in the system, such as a specific user, order, or device. Keys can be null, but when they are included they are used for dividing topics into partitions (more on partitions below).

The message value contains details about the event that happened. This could be as simple as a string or as complex as an object with many nested properties. Values can be null, but usually aren't.

By default, the timestamp records when the message was created. You can overwrite this if your event actually occurred earlier and you want to record that time instead.

Messages are usually small (less than 1 MB) and sent in a standard data format, such as JSON, Avro, or Protobuf. Even so, they can be compressed to save on data. The compression type can be set to gzip, lz4, snappy, zstd, or none.

Events can also optionally have headers, which are key-value pairs of strings containing metadata, such as where the event originated from or where you want it routed to.

Once a message is sent into a Kafka topic, it also receives a partition number and offset id (more about these later).

Topics in Kafka

Kafka stores messages in a topic, an ordered sequence of events, also called an event log.

A Kafka topic containing messages, each with a unique offset.

A Kafka topic containing messages, each with a unique offset

Different topics are identified by their names and will store different kinds of events. For example a social media application might have posts, likes, and comments topics to record every time a user creates a post, likes a post, or leaves a comment.

Multiple applications can write to and read from the same topic. An application might also read messages from one topic, filter or transform the data, and then write the result to another topic.

One important feature of topics is that they are append-only. When you write a message to a topic, it's added to the end of the log. Events in a topic are immutable. Once they're written to a topic, you can't change them.

A Producer writing events to topics and a Consumer reading events from topics.

A Producer writing events to topics and a Consumer reading events from topics

Unlike with messaging queues, reading an event from a topic doesn't delete it. Events can be read as often as needed, perhaps several times by multiple different applications.

Topics are also durable, holding onto messages for a specific period (by default 7 days) by saving them to physical storage on disk.

You can configure topics so that messages expire after a certain amount of time, or when a certain amount of storage is exceeded. You can even store messages indefinitely as long as you can pay for the storage costs.

Partitions in Kafka

In order to help Kafka to scale, topics can be divided into partitions. This breaks up the event log into multiple logs, each of which lives on a separate node in the Kafka cluster. This means that the work of writing and storing messages can be spread across multiple machines.

When you create a topic, you specify the amount of partitions it has. The partitions are themselves numbered, starting at 0. When a new event is written to a topic, it's appended to one of the topic's partitions.

A topic divided into three partitions.

A topic divided into three partitions

If messages have no key, they will be evenly distributed among partitions in a round robin manner: partition 0, then partition 1, then partition 2, and so on. This way, all partitions get an even share of the data but there's no guarantee about the ordering of messages.

Messages that have the same key will always be sent to the same partition, and in the same order. The key is run through a hashing function which turns it into an integer. This output is then used to select a partition.

Messages without keys being sent across partitions while messages with the same keys are sent to the same partition

Messages without keys are sent across partitions, while messages with the same keys are sent to the same partition

Messages within each partition are guaranteed to be ordered. For example, all messages with the same customer_id as their key will be sent to the same partition in the order in which Kafka received them.

Offsets in Kafka

Each message in a partition gets an id that is an incrementing integer, called an offset. Offsets start at 0 and are incremented every time Kafka writes a message to a partition. This means that each message in a given partition has a unique offset.

Three partitions with offsets. Offsets are unique within a partition but not between partitions

Offsets are unique within a partition but not between partitions

Offsets are not reused, even when older messages get deleted. They continue to increment, giving each new message in the partition a unique id.

When data is read from a partition, it is read in order from the lowest existing offset upwards. We'll see more about offsets when we cover Kafka consumers.

Brokers in Kafka

A single "server" running Kafka is called a broker. In reality, this might be a Docker container running in a virtual machine. But it can be a helpful mental image to think of brokers as individual servers.

A Kafka cluster made up of three brokers

A Kafka cluster made up of three brokers

Multiple brokers working together make up a Kafka cluster. There might be a handful of brokers in a cluster, or more than 100. When a client application connects to one broker, Kafka automatically connects it to every broker in the cluster.

By running as a cluster, Kafka becomes more scalable and fault-tolerant. If one broker fails, the others will take over its work to ensure there is no downtime or data loss.

Each broker manages a set of partitions and handles requests to write data to or read data from these partitions. Partitions for a given topic will be spread evenly across the brokers in a cluster to help with load balancing. Brokers also manage replicating partitions to keep their data backed up.

Partitions spread across brokers

Partitions spread across brokers

Replication in Kafka

To protect against data loss if a broker fails, Kafka writes the same data to copies of a partition on multiple brokers. This is called replication.

The main copy of a partition is called the leader, while the replicas are called followers.

The data from the leader partition is copied to follower partitions on different brokers

The data from the leader partition is copied to follower partitions on different brokers

When a topic is created, you set a replication factor for it. This controls how many replicas get written to. A replication factor of three is common, meaning data gets written to one leader and replicated to two followers. So even if two brokers failed, your data would still be safe.

Whenever you write messages to a partition, you're writing to the leader partition. Kafka then automatically copies these messages to the followers. As such, the logs on the followers will have the same messages and offsets as on the leader.

Followers that are up to date with the leader are called In-Sync Replicas (ISRs). Kafka considers a message to be committed once a minimum number of replicas have saved it to their logs. You can configure this to get higher throughput at the expense of less certainty that a message has been backed up.

Producers in Kafka

Producers are client applications that write events to Kafka topics. These apps aren't themselves part of Kafka – you write them.

Usually you will use a library to help manage writing events to Kafka. There is an official client library for Java as well as dozens of community-supported libraries for languages such as Scala, JavaScript, Go, Rust, Python, C#, and C++.

A Producer application writing to multiple topics

A Producer application writing to multiple topics

Producers are totally decoupled from consumers, which read from Kafka. They don't know about each other and their speed doesn't affect each other. Producers aren't affected if consumers fail, and the same is true for consumers.

If you need to, you could write an application that writes certain events to Kafka and reads other events from Kafka, making it both a producer and a consumer.

Producers take a key-value pair, generate a Kafka message, and then serialize it into binary for transmission across the network. You can adjust the configuration of producers to batch messages together based on their size or some fixed time limit to optimize writing messages to the Kafka brokers.

It's the producer that decides which partition of a topic to send each message to. Again, messages without keys will be distributed evenly among partitions, while messages with keys are all sent to the same partition.

Consumers in Kafka

Consumers are client applications that read messages from topics in a Kafka cluster. Like with producers, you write these applications yourself and can make use of client libraries to support the programming language your application is built with.

A Consumer reading messages from multiple topics.

A Consumer reading messages from multiple topics

Consumers can read from one or more partitions within a topic, and from one or more topics. Messages are read in order within a partition, from the lowest available offset to the highest. But if a consumer reads data from several partitions in the same topic, the message order between these partitions is not guaranteed.

For example, a consumer might read messages from partition 0, then partition 2, then partition 1, then back to partition 0. The messages from partition 0 will be read in order, but there might be messages from the other partitions mixed among them.

It's important to remember that reading a message does not delete it. The message is still available to be read by any other consumer that needs to access it. It's normal for multiple consumers to read from the same topic if they each have uses for the data in it.

By default, when a consumer starts up it will read from the current offset in a partition. But consumers can also be configured to go back and read from the oldest existing offset.

Consumers deserialize messages, converting them from binary into a collection of key-value pairs that your application can then work with. The format of a message should not change during a topic's lifetime or your producers and consumers won't be able to serialize and deserialize it correctly.

One thing to be aware of is that consumers request messages from Kafka, it doesn't push messages to them. This protects consumers from becoming overwhelmed if Kafka is handling a high volume of messages. If you want to scale consumers, you can run multiple instances of a consumer together in a consumer group.

Consumer Groups in Kafka

An application that reads from Kafka can create multiple instances of the same consumer to split up the work of reading from different partitions in a topic. These consumers work together as a consumer group.

When you create a consumer, you can assign it a group id. All consumers in a group will have the same group id.

You can create consumer instances in a group up to the number of partitions in a topic. So if you have a topic with 5 partitions, you can create up to 5 instances of the same consumer in a consumer group. If you ever have more consumers in a group than partitions, the extra consumer will remain idle.

Consumers in a consumer group reading messages from a topic's partitions

Consumers in a consumer group reading messages from a topic's partitions

If you add another consumer instance to a consumer group, Kafka will automatically redistribute the partitions among the consumers in a process called rebalancing.

Each partition is only assigned to one consumer in a group, but a consumer can read from multiple partitions. Also, multiple different consumer groups (meaning different applications) can read from the same topic at the same time.

Kafka brokers use an internal topic called __consumer_offsets to keep track of which messages a specific consumer group has successfully processed.

As a consumer reads from a partition, it regularly saves the offset it has read up to and sends this data to the broker it is reading from. This is called the consumer offset and is handled automatically by most client libraries.

A Consumer committing the offsets it has read up to.

A Consumer committing the offsets it has read up to

If a consumer crashes, the consumer offset helps the remaining consumers to know where to start from when they take over reading from the partition.

The same thing happens if a new consumer is added to the group. The consumer group rebalances, the new consumer is assigned a partition, and it picks up reading from the consumer offset of that partition.

Kafka Zookeeper

One other topic that we briefly need to cover here is how Kafka clusters are managed. Currently this is usually done using Zookeeper, a service for managing and synchronizing distributed systems. Like Kafka, it's maintained by the Apache Foundation.  

Kafka uses Zookeeper to manage the brokers in a cluster, and requires Zookeeper even if you're running a Kafka cluster with only one broker.

Recently, a proposal has been accepted to remove Zookeeper and have Kafka manage itself (KIP-500), but this is not yet widely used in production.

Zookeeper keeps track of things like:

  • Which brokers are part of a Kafka cluster
  • Which broker is the leader for a given partition
  • How topics are configured, such as the number of partitions and the location of replicas
  • Consumer groups and their members
  • Access Control Lists – who is allowed to write to and read from each topic

A Zookeeper ensemble managing the brokers in a Kafka cluster.

A Zookeeper ensemble managing the brokers in a Kafka cluster

Zookeeper itself runs as a cluster called an ensemble. This means that Zookeeper can keep working even if one node in the cluster fails. New data gets written to the ensemble's leader and replicated to the followers. Your Kafka brokers can read this data from any of the Zookeeper nodes in the ensemble.

Now that you understand the main concepts behind Kafka, let's get some hands-on practice working with Kafka.

You're going to install Kafka on your own computer, practice interacting with Kafka brokers from the command line, and then build a simple producer and consumer application with Java.

How to Install Kafka on Your Computer

At the time of writing this guide, the latest stable version of Kafka is 3.3.1. Check kafka.apache.org/downloads to see if there is a more recent stable version. If there is, you can replace "3.3.1" with the latest stable version in all of the following instructions.

Install Kafka on macOS

If you're using macOS, I recommend using Homebrew to install Kafka. It will make sure you have Java installed before it installs Kafka.

If you don't already have Homebrew installed, install it by following the instructions at brew.sh.

Next, run brew install kafka in a terminal. This will install Kafka's binaries at usr/local/bin.

Finally, run kafka-topics --version in a terminal and you should see 3.3.1. If you do, you're all set.

To make it easier to work with Kafka, you can add Kafka to the PATH environment variable. Open your ~/.bashrc (if using Bash) or ~/.zshrc (if using Zsh) and add the following line, replacing USERNAME with your username:

PATH="$PATH:/Users/USERNAME/kafka_2.13-3.3.1/bin"

You'll need to close your terminal for this change to take effect.

Now, if you run echo $PATH you should see that the Kafka bin directory has been added to your path.

Install Kafka on Windows (WSL2) and Linux

Kafka isn't natively supported on Windows, so you will need to use either WSL2 or Docker. I'm going to show you WSL2 since it's the same steps as Linux.

To set up WSL2 on Widows, follow the instructions in the official docs.

From here on, the instructions are the same for both WSL2 and Linux.

First, install Java 11 by running the following commands:

wget -O- https://apt.corretto.aws/corretto.key | sudo apt-key add - 

sudo add-apt-repository 'deb https://apt.corretto.aws stable main'

sudo apt-get update; sudo apt-get install -y java-11-amazon-corretto-jdk

Once this has finished, run java -version and you should see something like:

openjdk version "11.0.17" 2022-10-18 LTS
OpenJDK Runtime Environment Corretto-11.0.17.8.1 (build 11.0.17+8-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.17.8.1 (build 11.0.17+8-LTS, mixed mode)

From your root directory, download Kafka with the following command:

wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz

The 2.13 means it is using version 2.13 of Scala, while 3.3.1 refers to the Kafka version.

Extract the contents of the download with:

tar xzf kafka_2.13-3.3.1.tgz

If you run ls, you'll now see kafka_2.13-3.3.1 in your root directory.

To make it easier to work with Kafka, you can add Kafka to the PATH environment variable. Open your ~/.bashrc (if using Bash) or ~/.zshrc (if using Zsh) and add the following line, replacing USERNAME with your username:

PATH="$PATH:home/USERNAME/kafka_2.13-3.3.1/bin"

You'll need to close your terminal for this change to take effect.

Now, if you run echo $PATH you should see that the Kafka bin directory has been added to your path.

Run kafka-topics.sh --version in a terminal and you should see 3.3.1. If you do, you're all set.

How to Start Zookeeper and Kafka

Since Kafka uses Zookeeper to manage clusters, you need to start Zookeeper before you start Kafka.

How to Start Kafka on macOS

In one terminal window, start Zookeeper with:

/usr/local/bin/zookeeper-server-start /usr/local/etc/zookeeper/zoo.cfg

In another terminal window, start Kafka with:

/usr/local/bin/kafka-server-start /usr/local/etc/kafka/server.properties

While using Kafka, you need to keep both these terminal windows open. Closing them will shut down Kafka.

How to Start Kafka on Windows (WSL2) and Linux

In one terminal window, start Zookeeper with:

~/kafka_2.13-3.3.1/bin/zookeeper-server-start.sh ~/kafka_2.13-3.3.1/config/zookeeper.properties

In another terminal window, start Kafka with:

~/kafka_2.13-3.3.1/bin/kafka-server-start.sh ~/kafka_2.13-3.3.1/config/server.properties

While using Kafka, you need to keep both these terminal windows open. Closing them will shut down Kafka.

Now that you have Kafka installed and running on your machine, it's time to get some hands-on practice.

The Kafka CLI

When you install Kafka, it comes with a Command Line Interface (CLI) that lets you create and manage topics, as well as produce and consume events.

First, make sure Zookeeper and Kafka are running in two terminal windows.

In a third terminal window, run kafka-topics.sh (on WSL2 or Linux) or kafka-topics (on macOS) to make sure the CLI is working. You'll see a list of all the options you can pass to the CLI.

A terminal displaying kafka-topics options.

kafka-topics options

Note: When working with the Kafka CLI, the command will be kafka-topics.sh on WSL2 and Linux. It will be kafka-topics.sh on macOS if you directly installed the Kafka binaries and kafka-topics if you used Homebrew. So if you're using Homebrew, remove the .sh extension from the example commands in this section.

How to List Topics

To see the topics available on the Kafka broker on your local machine, use:

kafka-topics.sh --bootstrap-server localhost:9092 --list

This means "Connect to the Kafka broker running on localhost:9092 and list all topics there". --bootstrap-server refers to the Kafka broker you are trying to connect to and localhost:9092 is the IP address it's running at. You won't see any output since you haven't created any topics yet.

How to Create a Topic

To create a topic (with the default replication factor and number of partitions), use the --create and --topic options and pass them a topic name:

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my_first_topic

If you use an _ or . in your topic name, you will see the following warning:

WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.

Since Kafka could confuse my.first.topic with my_first_topic, it's best to only use either underscores or periods when naming topics.

How to Describe Topics

To describe the topics on a broker, use the --describe option:

kafka-topics.sh --bootstrap-server localhost:9092 --describe

This will print the details of all the topics on this broker, including the number of partitions and their replication factor. By default, these will both be set to 1.

If you add the --topic option and the name of a topic, it will describe only that topic:

kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my_first_topic

How to Partition a Topic

To create a topic with multiple partitions, use the --partitions option and pass it a number:

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my_second_topic --partitions 3

How to Set a Replication Factor

To create a topic with a replication factor higher than the default, use the --replication-factor option and pass it a number:

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my_third_topic --partitions 3 --replication-factor 3

You should get the following error:

ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 2 larger than available brokers: 1.

Since you're only running one Kafka broker on your machine, you can't set a replication factor higher than one. If you were running a cluster with multiple brokers, you could set a replication factor as high as the total number of brokers.

How to Delete a Topic

To delete a topic, use the --delete option and specify a topic with the --topic option:

kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic my_first_topic

You won't get any output to say the topic was deleted but you can check using --list or --describe.

How to Use kafka-console-producer

You can produce messages to a topic from the command line using kafka-console-producer.

Run kafka-console-producer.sh to see the options you can pass to it.

Terminal showing kafka-console-producer options.

kafka-console-producer options

To create a producer connected to a specific topic, run:

kafka-console-producer.sh --bootstrap-server localhost:9092 --topic TOPIC_NAME

Let's produce messages to the my_first_topic topic.

kafka-console-producer.sh --bootstrap-server localhost:9092 --topic my_first_topic

Your prompt will change and you will be able to type text. Press enter to send that message. You can keep sending messages until you press ctrl + c.

Sending messages using kafka-console-producer

Sending messages using kafka-console-producer

If you produce messages to a topic that doesn't exist, you'll get a warning, but the topic will be created and the messages will still get sent. It's better to create a topic in advance, however, so you can specify partitions and replication.

By default, the messages sent from kafka-console-producer have their keys set to null, and so they will be evenly distributed to all partitions.

You can set a key by using the --property option to set  parse.key to be true and providing a key separator, such as :

For example, we can create a books topic and use the books' genre as a key.

kafka-topics.sh --bootstrap-server localhost:9092 --topic books --create

kafka-console-producer.sh --bootstrap-server localhost:9092 --topic books --property parse.key=true --property key.separator=:

Now you can enter keys and values in the format key:value. Anything to the left of the key separator will be interpreted as a message key, anything to the right as a message value.

science_fiction:All Systems Red
fantasy:Uprooted
horror:Mexican Gothic

Producing messages with keys and values.

Producing messages with keys and values

Now that you've produced messages to a topic from the command line, it's time to consume those messages from the command line.

How to Use kafka-console-consumer

You can consumer messages from a topic from the command line using kafka-console-consumer.

Run kafka-console-consumer.sh to see the options you can pass to it.

Terminal showing kafka-console-consumer options

kafka-console-consumer options

To create a consumer, run:

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TOPIC_NAME

When you start a consumer, by default it will read messages as they are written to the end of the topic. It won't read messages that were previously sent to the topic.

If you want to read the messages you already sent to a topic, use the --from-beginning option to read from the beginning of the topic:

kafka-console-consumer --bootstrap-server localhost:9092 --topic my_first_topic --from-beginning

The messages might appear "out of order". Remember, messages are ordered within a partition but ordering can't be guaranteed between partitions. If you don't set a key, they will be sent round robin between partitions and ordering isn't guaranteed.

You can display additional information about messages, such as their key and timestamp, by using the --property option and setting the print property to true.

Use the --formatter option to set the message formatter and the --property option to select which message properties to print.

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my_first_topic --from-beginning --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true

Consuming messages from a topic

Consuming messages from a topic

We get the messages' timestamp, key, and value. Since we didn't assign any keys when we sent these messages to my_first_topic, their key is null.

How to Use kafka-consumer-groups

You can run consumers in a consumer group using the Kafka CLI. To view the documentation for this, run:

kafka-consumer-groups.sh

kafka-consumer-groups options

kafka-consumer-groups options

First, create a topic with three partitions. Each consumer in a group will consume from one partition. If there are more consumers than partitions, any extra consumers will be idle.

kafka-topics.sh --bootstrap-server localhost:9092 --topic fantasy_novels --create --partitions 3

You add a consumer to a group when you create it using the --group option. If you run the same command multiple times with the same group name, each new consumer will be added to the group.

To create the first consumer in your consumer group, run:

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic fantasy_novels --group fantasy_consumer_group 

Next, open two new terminal windows and run the same command again to add a second and third consumer to the consumer group.

Three consumers running in a consumer group.

Three consumers running in a consumer group

In a different terminal window, create a producer and send a few messages with keys to the topic.

Note: Since Kafka 2.4, Kafka will send messages in batches to one "sticky" partition for better performance. In order to demonstrate messages being sent round robin between partitions (without sending a large volume of messages), we can set the partitioner to RoundRobinPartitioner.

kafka-console-producer.sh --bootstrap-server localhost:9092 --topic fantasy_novels --property parse.key=true --property key.separator=: --property partitioner.class=org.apache.kafka.clients.producer.RoundRobinPartitioner

tolkien:The Lord of the Rings
le_guin:A Wizard of Earthsea
leckie:The Raven Tower
de_bodard:The House of Shattered Wings
okorafor:Who Fears Death
liu:The Grace of Kings

Messages spread between consumers in a consumer group

Messages spread between consumers in a consumer group

If you stop one of the consumers, the consumer group will rebalance and future messages will be sent to the remaining consumers.

Now that you have some experience working with Kafka from the command line, the next step is to build a small application that connects to Kafka.

How to Build a Kafka Client App with Java

We're going to build a simple Java app that both produces messages to and consumes messages from Kafka. For this we'll use the official Kafka Java client.

If at any point you get stuck, the full code for this project is available on GitHub.

Preliminaries

First of all, make sure you have Java (at least JDK 11) and Kafka installed.

We're going to send messages about characters from The Lord of the Rings. So let's create a topic for these messages with three partitions.

From the command line, run:

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic lotr_characters --partitions 3

How to Set Up the Project

I recommend using IntelliJ for Java projects, so go ahead and install the Community Edition if you don't already have it. You can download it from jetbrains.com/idea

In Intellij, select File, New, and Project.

Give your project a name and select a location for it on your computer. Make sure you have selected Java as the language, Maven as the build system, and that the JDK is at least Java 11. Then click Create.

Setting up a Maven project in IntelliJ

Setting up a Maven project in IntelliJ

Note: If you're on Windows, IntelliJ can't use a JDK installed on WSL. To install Java on the Windows side of things, go to docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list and download the Windows installer. Follow the installation steps, open a command prompt, and run java -version. You should see something like:

openjdk version "11.0.18" 2023-01-17 LTS
OpenJDK Runtime Environment Corretto-11.0.18.10.1 (build 11.0.18+10-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.18.10.1 (build 11.0.18+10-LTS, mixed mode)

Once your Maven project finishes setting up, run the Main class to see "Hello world!" and make sure everything worked.

How to Install the Dependencies

Next, we're going to install our dependencies. Open up pom.xml and inside the <project> element, create a <dependencies> element.

We're going to use the Java Kafka client for interacting with Kafka and SLF4J for logging, so add the following inside your <dependencies> element:

<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->  
<dependency>  
    <groupId>org.apache.kafka</groupId>  
    <artifactId>kafka-clients</artifactId>  
    <version>3.3.1</version>  
</dependency>  
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-api -->  
<dependency>  
    <groupId>org.slf4j</groupId>  
    <artifactId>slf4j-api</artifactId>  
    <version>2.0.6</version>  
</dependency>  
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-simple -->  
<dependency>  
    <groupId>org.slf4j</groupId>  
    <artifactId>slf4j-simple</artifactId>  
    <version>2.0.6</version>  
</dependency>

The package names and version numbers might be red, meaning you haven't downloaded them yet. If this happens, click on View, Tool Windows, and Maven to open the Maven menu. Click on the Reload All Maven Projects icon and Maven will install these dependencies.

Reloading Maven dependencies in IntelliJ

Reloading Maven dependencies in IntelliJ

Create a HelloKafka class in the same directory as your Main class and give it the following contents:

package org.example;

import org.slf4j.Logger;  
import org.slf4j.LoggerFactory;  
  
public class HelloKafka {  
    private static final Logger log = LoggerFactory.getLogger(HelloKafka.class);  
  
    public static void main(String[] args) {  
        log.info("Hello Kafka");  
    }  
}

To make sure your dependencies are installed, run this class and you should see [main] INFO org.example.HelloKafka - Hello Kafka printed to the IntelliJ console.

How to Create a Kafka Producer

Next, we're going to create a Producer class. You can call this whatever you want as long as it doesn't clash with another class. So don't use KafkaProducer as you'll need that class in a minute.

package org.example;  
  
import org.slf4j.Logger;  
import org.slf4j.LoggerFactory;  
  
public class Producer {  
    private static final Logger log = LoggerFactory.getLogger(KafkaProducer.class);  
  
    public static void main(String[] args) {  
        log.info("This class will produce messages to Kafka");  
    }  
}

All of our Kafka-specific code is going to go inside this class's main() method.

The first thing we need to do is configure a few properties for the producer. Add the following inside the main() method:

Properties properties = new Properties(); 

properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");  
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());  
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());

Properties stores a set of properties as pairs of strings. The ones we're using are:

  • ProducerConfig.BOOTSTRAP_SERVERS_CONFIG which specifies the IP address to use to access the Kafka cluster
  • ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG which specifies the serializer to use for message keys
  • ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG which specifies the serializer to use for message values

We're going to connect to our local Kafka cluster running on localhost:9092, and use the StringSerializer since both our keys and values will be strings.

Now we can create our producer and pass it the configuration properties.

KafkaProducer<String, String> producer = new KafkaProducer<>(properties);

To send a message, we need to create a ProducerRecord and pass it to our producer. ProducerRecord contains a topic name, and optionally a key, value, and partition number.

We're going to create the ProducerRecord with the topic to use, the message's key, and the message's value.

ProducerRecord<String, String> producerRecord = new ProducerRecord<>("lotr_characters", "hobbits", "Bilbo");

We can now use the producer's send() method to send the message to Kafka.

producer.send(producerRecord);

Finally, we need to call the close() method to stop the producer. This method handles any messages currently being processed by send() and then closes the producer.

producer.close();

Now it's time to run our producer. Make sure you have Zookeeper and Kafka running. Then run the main() method of the Producer class.

Sending a message from a producer in a Java Kafka client app.

Sending a message from a producer in a Java Kafka client app

Note: On Windows, your producer might not be able to connect to a Kafka broker running on WSL. To fix this, you're going to need to do the following:

  • In a WSL terminal, navigate to Kafka's config folder: cd ~/kafka_2.13-3.3.1/config/
  • Open server.properties, for example with Nano: nano server.properties
  • Uncomment #listeners=PLAINTEXT//:9092
  • Replace it with listeners=PLAINTEXT//[::1]:9092
  • In your Producer class, replace "localhost:9092" with "[::1]:9092"

[::1], or 0:0:0:0:0:0:0:1, refers to the loopback address (or localhost) in IPv6. This is equivalent to 127.0.0.1 in IPv4.

If you change listeners, when you try to access the Kafka broker from the command line you'll also have to use the new IP address, so use --bootstrap-server ::1:9092 instead of --bootstrap-server localhost:9092 and it should work.

We can now check that Producer worked by using kafka-console-consumer in another terminal window to read from the lotr_characters topic and see the message printed to the console.

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic lotr_characters --from-beginning

kafka-console-consumer reading the message sent by the producer in our Java app

kafka-console-consumer reading the message sent by the producer in our Java app

How to Send Multiple Messages and Use Callbacks

So far we're only sending one message. If we update Producer to send multiple messages, we'll be able to see how keys are used to divide messages between partitions. We can also take this opportunity to use a callback to view the sent message's metadata.

To do this, we're going to loop over a collection of characters to generate our messages.

So replace this:

ProducerRecord<String, String> producerRecord = new ProducerRecord<>("lotr_characters", "hobbits", "Bilbo");  

producer.send(producerRecord);

with this:

HashMap<String, String> characters = new HashMap<String, String>();  
characters.put("hobbits", "Frodo");  
characters.put("hobbits", "Sam");  
characters.put("elves", "Galadriel");  
characters.put("elves", "Arwen");
characters.put("humans", "Éowyn");  
characters.put("humans", "Faramir");

for (HashMap.Entry<String, String> character : characters.entrySet()) {  
    ProducerRecord<String, String> producerRecord = new ProducerRecord<>("lotr_characters", character.getKey(), character.getValue());  
  
    producer.send(producerRecord, (RecordMetadata recordMetadata, Exception err) -> {  
        if (err == null) {  
            log.info("Message received. \n" +  
                    "topic [" + recordMetadata.topic() + "]\n" +  
                    "partition [" + recordMetadata.partition() + "]\n" +  
                    "offset [" + recordMetadata.offset() + "]\n" +  
                    "timestamp [" + recordMetadata.timestamp() + "]");  
        } else {  
            log.error("An error occurred while producing messages", err);  
        }  
    });  
}

Here, we're iterating over the collection, creating a ProducerRecord for each entry, and passing the record to send(). Behind the scenes, Kafka will batch these messages together to make fewer network requests. send() can also take a callback as a second argument. We're going to pass it a lambda which will run code when the send() request completes.

If the request completed successfully, we get back a RecordMetadata object with metadata about the message, which we can use to see things such as the partition and offset the message ended up in.

If we get back an exception, we could handle it by retrying to send the message, or alerting our application. In this case, we're just going to log the exception.

Run the main() method of the Producer class and you should see the message metadata get logged.

java-producer

The full code for the Producer class should now be:

package org.example;  
  
import org.apache.kafka.clients.producer.KafkaProducer;  
import org.apache.kafka.clients.producer.ProducerConfig;  
import org.apache.kafka.clients.producer.ProducerRecord;  
import org.apache.kafka.clients.producer.RecordMetadata;  
import org.apache.kafka.common.serialization.StringSerializer;  
import org.slf4j.Logger;  
import org.slf4j.LoggerFactory;  
  
import java.util.HashMap;  
import java.util.Properties;  
  
public class Producer {  
    private static final Logger log = LoggerFactory.getLogger(Producer.class);  
  
    public static void main(String[] args) {  
        log.info("This class produces messages to Kafka");  
   
        Properties properties = new Properties();
        properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); 
        properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());  
        properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());  
   
        KafkaProducer<String, String> producer = new KafkaProducer<>(properties);  
    
        HashMap<String, String> characters = new HashMap<String, String>();  
        characters.put("hobbits", "Frodo");  
		characters.put("hobbits", "Sam");  
		characters.put("elves", "Galadriel");  
		characters.put("elves", "Arwen");
		characters.put("humans", "Éowyn");  
		characters.put("humans", "Faramir"); 
  
        for (HashMap.Entry<String, String> character : characters.entrySet()) {  
            ProducerRecord<String, String> producerRecord = new ProducerRecord<>("lotr_characters", character.getKey(), character.getValue());  
  
            producer.send(producerRecord, (RecordMetadata recordMetadata, Exception err) -> {  
                if (err == null) {  
                    log.info("Message received. \n" +  
                            "topic [" + recordMetadata.topic() + "]\n" +  
                            "partition [" + recordMetadata.partition() + "]\n" +  
                            "offset [" + recordMetadata.offset() + "]\n" +  
                            "timestamp [" + recordMetadata.timestamp() + "]");  
                } else {  
                    log.error("An error occurred while producing messages", err);  
                }  
            });  
        }
        producer.close();  
    }  
}

Next, we're going to create a consumer to read these messages from Kafka.

How to Create a Kafka Consumer

First, create a Consumer class. Again, you can call it whatever you want, but don't call it KafkaConsumer as you will need that class in a moment.

All the Kafka-specific code will go in Consumer's main() method.

package org.example;  
  
import org.slf4j.Logger;  
import org.slf4j.LoggerFactory;  
  
public class Consumer {  
    private static final Logger log = LoggerFactory.getLogger(Consumer.class);  
  
    public static void main(String[] args) {  
        log.info("This class consumes messages from Kafka");  
    }  
}

Next, configure the consumer properties.

Properties properties = new Properties();  
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");  
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());  
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());  
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "lotr_consumer_group");  
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

Just like with Producer, these properties are a set of string pairs. The ones we're using are:

  • ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG which specifies the IP address to use to access the Kafka cluster
  • ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG which specifies the deserializer to use for message keys
  • ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG which specifies the deserializer to use for message values
  • ConsumerConfig.GROUP_ID_CONFIG which specifies the consumer group this consumer belongs to
  • ConsumerConfig.AUTO_OFFSET_RESET_CONFIG which specifies the offset to start reading from

We're connecting to the Kafka cluster on localhost:9092, using string deserializers since our keys and values are strings, setting a group id for our consumer, and telling the consumer to read from the start of the topic.

Note: If you're running the consumer on Windows and accessing a Kafka broker running on WSL, you'll need to change "localhost:9091" to "[::1]:9092" or "0:0:0:0:0:0:0:1:9092", like you did in Producer.

Next, we create a KafkaConsumer and pass it the configuration properties.

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);

We need to tell the consumer which topic, or topics, to subscribe to. The subscribe() method takes in a collection of one or more strings, naming the topics you want to read from. Remember, consumers can subscribe to more than one topic at the same time. For this example, we'll use one topic, the lotr_characters topic.

String topic = "lotr_characters";  
 
consumer.subscribe(Arrays.asList(topic));

The consumer is now ready to start reading messages from the topic. It does this by regularly polling for new messages.

We'll use a while loop to repeatedly call the poll() method to check for new messages.

poll() takes in a duration for how long it should read for at a time. It then batches these messages into an iterable called ConsumerRecords. We can then iterate over ConsumerRecords and do something with each individual ConsumerRecord.

In a real-world application, we would process this data or send it to some further destination, like a database or data pipeline. Here, we're just going to log the key, value, partition, and offset for each message we receive.

while(true){  
    ConsumerRecords<String, String> messages = consumer.poll(Duration.ofMillis(100));  
  
    for (ConsumerRecord<String, String> message : messages){  
        log.info("key [" + message.key() + "] value [" + message.value() +"]");  
        log.info("partition [" + message.partition() + "] offset [" + message.offset() + "]");  
    }  
}

Now it's time to run our consumer. Make sure you have Zookeeper and Kafka running. Run the Consumer class and you'll see the messages that Producer previously sent to the lotr_characters topic in Kafka.

The Kafka client app consuming messages that were previously produced to Kafka.

The Kafka client app consuming messages that were previously produced to Kafka

How to Shut Down the Consumer

Right now, our consumer is running in an infinite loop and polling for new messages every 100 ms. This isn't a problem, but we should add safeguards to handle shutting down the consumer if an exception occurs.

We're going to wrap our code in a try-catch-finally block. If an exception occurs, we can handle it in the catch block.

The finally block will then call the consumer's close() method. This will close the socket the consumer is using, commit the offsets it has processed, and trigger a consumer group rebalance so any other consumers in the group can take over reading the partitions this consumer was handling.

try {
            // subscribe to topic(s)
            String topic = "lotr_characters";
            consumer.subscribe(Arrays.asList(topic));

            while (true) {
                // poll for new messages
                ConsumerRecords<String, String> messages = consumer.poll(Duration.ofMillis(100));

                // handle message contents
                for (ConsumerRecord<String, String> message : messages) {
                    log.info("key [" + message.key() + "] value [" + message.value() + "]");
                    log.info("partition [" + message.partition() + "] offset [" + message.offset() + "]");
                }
            }
        } catch (Exception err) {
            // catch and handle exceptions
            log.error("Error: ", err);
        } finally {
            // close consumer and commit offsets
            consumer.close();
            log.info("consumer is now closed");
        }

Consumer will continuously poll its assigned topics for new messages and shut down safely if it experiences an exception.

The full code for the Consumer class should now be:

package org.example;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.time.Duration;
import java.util.Arrays;
import java.util.Properties;

public class Consumer {
    private static final Logger log = LoggerFactory.getLogger(Consumer.class);

    public static void main(String[] args) {
        log.info("This class consumes messages from Kafka");

        Properties properties = new Properties();
        properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "lotr_consumer_group");
        properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);

        try {
            String topic = "lotr_characters";
            consumer.subscribe(Arrays.asList(topic));

            while (true) {
                ConsumerRecords<String, String> messages = consumer.poll(Duration.ofMillis(100));

                for (ConsumerRecord<String, String> message : messages) {
                    log.info("key [" + message.key() + "] value [" + message.value() + "]");
                    log.info("partition [" + message.partition() + "] offset [" + message.offset() + "]");
                }
            }
        } catch (Exception err) {
            log.error("Error: ", err);
        } finally {
            consumer.close();
            log.info("The consumer is now closed");
        }
    }
}

You now have a basic Java application that can send messages to and read messages from Kafka. If you got stuck at any point, the full code is available on GitHub.

Where to Take it from Here

Congratulations on making it this far. You've learned:

  • the main concepts behind Kafka
  • how to communicate with Kafka from the command line
  • how to build a Java app that produces to and consumes from Kafka

There's plenty more to learn about Kafka, whether that's Kafka Connect for connecting Kafka to common data systems or the Kafka Streams API for processing and transforming your data.

Some resources you might find useful as you continue your journey with Kafka are:

I hope this guide has been helpful and made you excited to learn more about Kafka, event streaming, and real-time data processing.

Original article source at https://www.freecodecamp.org

#apache #kafka #java 

Everything You Need to Know About Kafka - The Apache Kafka Handbook
Hermann  Frami

Hermann Frami

1673116920

Camel-k: A Lightweight integration Platform, Born on Kubernetes

🐫 + ☁️ = Apache Camel K

Apache Camel K is a lightweight integration framework built from Apache Camel that runs natively on Kubernetes and is specifically designed for serverless and microservice architectures. Users of Camel K can instantly run integration code written in Camel DSL on their preferred Cloud provider.

How does it work?

⚙️ Installation

Camel K allows you to run integrations directly on a Kubernetes or OpenShift cluster. To use it, you need to be connected to a cloud environment or to a local cluster created for development purposes.

Installation procedure.

▶️ Running an Integration

You can use Camel DSL to define your Integration. Just save it in a file and use kamel command line interface (download latest release) to run it. As an example, just try running:

hello.groovy

from('timer:tick?period=3000')
  .setBody().constant('Hello world from Camel K')
  .to('log:info')

kamel run hello.groovy

You can even run your integrations in a dev mode. Change the code and see the changes automatically applied (instantly) to the remote integration pod! We have provided more examples that you can use to inspire your next Integration development.

See more details.

🐫 All the power from Apache Camel components

You can use any of the Apache Camel components available. The related dependencies will be resolved automatically.

Discover more about dependencies and components.

☕ Not Just Java

Camel K supports multiple languages for writing integrations.

See all the languages available.

🏁 Traits

The details of how the integration is mapped into Kubernetes resources can be customized using traits.

More information is provided in the official documentation traits section.

☁️ Engineered thinking on Cloud Native

Since the inception of the project, our goal was to bring Apache Camel to the cloud.

See the software architecture details.

❤️ Contributing

We love contributions and we want to make Camel K great!

Contributing is easy, just take a look at our developer’s guide.

Download Details:

Author: Apache
Source Code: https://github.com/apache/camel-k 
License: Apache-2.0 license

#serverless #apache #kubernetes #integration 

Camel-k: A Lightweight integration Platform, Born on Kubernetes

How to Customize Apache ShardingSphere high availability with MySQL

Learn how and why ShardingSphere can achieve database high availability using MySQL as an example.

Users have many options to customize and extend ShardingSphere's high availability (HA) solutions. Our team has completed two HA plans: A MySQL high availability solution based on MGR and an openGauss database high availability solution contributed by some community committers. The principles of the two solutions are the same.

Below is how and why ShardingSphere can achieve database high availability using MySQL as an example:

ShardingSphere high availability components

(Zhao Jinchao, CC BY-SA 4.0)

Prerequisite

ShardingSphere checks if the underlying MySQL cluster environment is ready by executing the following SQL statement. ShardingSphere cannot be started if any of the tests fail.

Check if MGR is installed:

SELECT * FROM information_schema.PLUGINS WHERE PLUGIN_NAME='group_replication'

View the MGR group member number. The underlying MGR cluster should consist of at least three nodes:

SELECT COUNT(*) FROM performance_schema.replication_group_members

Check whether the MGR cluster's group name is consistent with that in the configuration. The group name is the marker of an MGR group, and each group of an MGR cluster only has one group name:

SELECT * FROM performance_schema.global_variables WHERE VARIABLE_NAME='group_replication_group_name' 

Check if the current MGR is set as the single primary mode. Currently, ShardingSphere does not support dual-write or multi-write scenarios. It only supports single-write mode:

SELECT * FROM performance_schema.global_variables WHERE VARIABLE_NAME='group_replication_single_primary_mode'

Query all the node hosts, ports, and states in the MGR group cluster to check if the configured data source is correct:

SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE FROM performance_schema.replication_group_members

Dynamic primary database discovery

ShardingSphere finds the primary database URL according to the query master database SQL command provided by MySQL:

private String findPrimaryDataSourceURL(final Map<String, DataSource> dataSourceMap) {
    String RESULT = "";
    String SQL = "SELECT MEMBER_HOST, MEMBER_PORT FROM performance_schema.replication_group_members WHERE MEMBER_ID = "
            + "(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'group_replication_primary_member')";
    FOR (DataSource each : dataSourceMap.values()) {
        try (Connection connection = each.getConnection();
             Statement statement = connection.createStatement();
             ResultSet resultSet = statement.executeQuery(SQL)) {
            IF (resultSet.next()) {
                RETURN String.format("%s:%s", resultSet.getString("MEMBER_HOST"), resultSet.getString("MEMBER_PORT"));
            }
        } catch (final SQLException ex) {
            log.error("An exception occurred while find primary data source url", ex);
        }
    }
    RETURN RESULT;
}

Compare the primary database URLs found above one by one with the dataSources URLs configured. The matched data source is the primary database. It will be updated to the current ShardingSphere memory and be perpetuated to the registry center, through which it will be distributed to other compute nodes in the cluster.

registry center

(Zhao Jinchao, CC BY-SA 4.0)

Dynamic secondary database discovery

There are two types of secondary database states in ShardingSphere: enable and disable. The secondary database state will be synchronized to the ShardingSphere memory to ensure that read traffic can be routed correctly.

Get all the nodes in the MGR group:

SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE FROM performance_schema.replication_group_members

Disable secondary databases:

private void determineDisabledDataSource(final String schemaName, final Map<String, DataSource> activeDataSourceMap,
                                         final List<String> memberDataSourceURLs, final Map<String, String> dataSourceURLs) {
    FOR (Entry<String, DataSource> entry : activeDataSourceMap.entrySet()) {
        BOOLEAN disable = TRUE;
        String url = NULL;
        try (Connection connection = entry.getValue().getConnection()) {
            url = connection.getMetaData().getURL();
            FOR (String each : memberDataSourceURLs) {
                IF (NULL != url && url.contains(each)) {
                    disable = FALSE;
                    break;
                }
            }
        } catch (final SQLException ex) {
            log.error("An exception occurred while find data source urls", ex);
        }
        IF (disable) {
            ShardingSphereEventBus.getInstance().post(NEW DataSourceDisabledEvent(schemaName, entry.getKey(), TRUE));
        } ELSE IF (!url.isEmpty()) {
            dataSourceURLs.put(entry.getKey(), url);
        }
    }
}

Whether the secondary database is disabled is based on the data source configured and all the nodes in the MGR group.

ShardingSphere can check one by one whether the data source configured can obtain Connection properly and verify whether the data source URL contains nodes of the MGR group.

If Connection cannot be obtained or the verification fails, ShardingSphere will disable the data source by an event trigger and synchronize it to the registry center.

Enable secondary databases:

private void determineEnabledDataSource(final Map<String, DataSource> dataSourceMap, final String schemaName,
                                        final List<String> memberDataSourceURLs, final Map<String, String> dataSourceURLs) {
    FOR (String each : memberDataSourceURLs) {
        BOOLEAN enable = TRUE;
        FOR (Entry<String, String> entry : dataSourceURLs.entrySet()) {
            IF (entry.getValue().contains(each)) {
                enable = FALSE;
                break;
            }
        }
        IF (!enable) {
            continue;
        }
        FOR (Entry<String, DataSource> entry : dataSourceMap.entrySet()) {
            String url;
            try (Connection connection = entry.getValue().getConnection()) {
                url = connection.getMetaData().getURL();
                IF (NULL != url && url.contains(each)) {
                    ShardingSphereEventBus.getInstance().post(NEW DataSourceDisabledEvent(schemaName, entry.getKey(), FALSE));
                    break;
                }
            } catch (final SQLException ex) {
                log.error("An exception occurred while find enable data source urls", ex);
            }
        }
    }
}

After the crashed secondary database is recovered and added to the MGR group, the configuration will be checked to see whether the recovered data source is used. If yes, the event trigger will tell ShardingSphere that the data source needs to be enabled.

Heartbeat mechanism

The heartbeat mechanism is introduced to the HA module to ensure that the primary-secondary states are synchronized in real-time.

By integrating the ShardingSphere sub-project ElasticJob, the above processes are executed by the ElasticJob scheduler framework in the form of Job when the HA module is initialized, thus achieving the separation of function development and job scheduling.

Even if developers need to extend the HA function, they do not need to care about how jobs are developed and operated:

private void initHeartBeatJobs(final String schemaName, final Map<String, DataSource> dataSourceMap) {
    Optional<ModeScheduleContext> modeScheduleContext = ModeScheduleContextFactory.getInstance().get();
    IF (modeScheduleContext.isPresent()) {
        FOR (Entry<String, DatabaseDiscoveryDataSourceRule> entry : dataSourceRules.entrySet()) {
            Map<String, DataSource> dataSources = dataSourceMap.entrySet().stream().filter(dataSource -> !entry.getValue().getDisabledDataSourceNames().contains(dataSource.getKey()))
                    .collect(Collectors.toMap(Entry::getKey, Entry::getValue));
            CronJob job = NEW CronJob(entry.getValue().getDatabaseDiscoveryType().getType() + "-" + entry.getValue().getGroupName(),
                each -> NEW HeartbeatJob(schemaName, dataSources, entry.getValue().getGroupName(), entry.getValue().getDatabaseDiscoveryType(), entry.getValue().getDisabledDataSourceNames())
                            .execute(NULL), entry.getValue().getHeartbeatProps().getProperty("keep-alive-cron"));
            modeScheduleContext.get().startCronJob(job);
        }
    }
}

Wrap up

So far, Apache ShardingSphere's HA feature has proven applicable for MySQL and openGauss HA solutions. Moving forward, it will integrate more MySQL HA products and support more database HA solutions.

As always, if you're interested, you're more than welcome to join us and contribute to the Apache ShardingSphere project.

Original article source at: https://opensource.com/

#mysql #apache 

How to Customize Apache ShardingSphere high availability with MySQL
Gordon  Murray

Gordon Murray

1672915320

Install PHP on Windows 10 and 11 (with Apache & MySQL)

This article explains how to install PHP 8.2 and Apache 2.4 on Windows 10 or 11 (64-bit).

Linux and macOS users often have Apache and PHP pre-installed or available via package managers. Windows requires a little more effort. The steps below may work with other editions of Windows, PHP, and Apache, but check the documentation of each dependency for specific instructions.

Why PHP?

PHP remains the most widespread and popular server-side programming language on the Web. It’s installed by most web hosts, and has a simple learning curve, close ties with the MySQL database, superb documentation, and a wide collection of libraries to cut your development time. PHP may not be perfect, but you should consider it for your next web application. It’s the language of choice for Facebook, Slack, Wikipedia, MailChimp, Etsy, and WordPress (the content management system which powers almost 45% of the web).

Why Install PHP Locally?

Installing PHP on your development PC allows you to create and test websites and applications without affecting the data or systems on your live server.

Alternative Installation Options

Before you jump in, there may be a simpler installation options…

Using an all-in-one package

All-in-one packages are available for Windows. They contain Apache, PHP, MySQL, and other useful dependencies in a single installation file. These packages include XAMPP, WampServer and Web.Developer.

These packages are easy to use, but they may not match your live server environment. Installing Apache and PHP manually will help you learn more about the system and configuration options.

Using a Linux virtual machine

Microsoft Hyper-V (provided in Windows Professional) and VirtualBox are free hypervisors which emulate a PC so you can install another operating system.

You can install any version of Linux, then follow its Apache and PHP installation instructions. Alternatively, distros such as Ubuntu Server provide them as standard (although they may not be the latest editions).

Using Windows Subsystem for Linux 2

WSL2 is also a virtual machine, but it’s tightly integrated into Windows so activities such as file sharing and localhost resolution are seamless. You can install a variety of Linux distros, so refer to the appropriate Apache and PHP instructions.

Using Docker

Docker creates a wrapper (known as a container) around pre-configured application dependencies such as Apache, PHP, MySQL, MongoDB, and most other web software. Containers look like full Linux Virtual Machines but are considerably more lightweight.

Once you’ve installed Docker Desktop on Windows, it’s easy to download, configure, and run Apache and PHP.

Docker is currently considered the best option for setting up a PHP development environment. Check out SitePoint’s article Setting Up a Modern PHP Development Environment with Docker for a complete guide to setting it up.

Installing Apache (optional)

The following sections describe how to install Apache and PHP directly on Windows.

PHP provides a built-in web server, which you can launch by navigating to a folder and running the PHP executable with an -S parameter to set the localhost port. For example:

cd myproject
php -S localhost:8000

You can then view PHP pages in a browser at http://localhost:8000.

This may be adequate for quick tests, but your live server will use Apache or similar web server software. Emulating that environment as closely as possible permits more advanced customization and should prevent development errors.

To install Apache, download the latest Win64 ZIP file from https://www.apachelounge.com/download/ and extract its Apache24 folder to the root of your C: drive. You’ll also need to install the Visual C++ Redistributable for Visual Studio 2015–2020 (vc_redist_x64); the page has a link at the top.

Open a cmd command prompt (not PowerShell) and start Apache with:

cd C:\Apache24\bin
httpd

You may need to accept a firewall exception before the server starts to run. Open http://localhost in a browser and an “It works!” message should appear. Note:

C:\Apache24\conf\httpd.conf is Apache’s configuration file if you need to change server settings.

C:\Apache24\htdocs is the web server’s root content folder. It contains a single index.html file with the “It works!” message.

If Apache fails to start, another application could be hogging port 80. (Skype is the prime candidate, and the Windows app won’t let you disable it!) If this occurs, edit C:\Apache24\conf\httpd.conf and change the line Listen 80 to Listen 8080 or any other free port. Restart Apache and, from that point onward, you can load web files at http://localhost:8080.

Stop the server by pressing Ctrl + C in the cmd terminal. The ReadMe file in the ZIP also provides instructions for installing Apache as a Windows service so it auto-starts on boot.

Installing PHP

Install PHP by following the steps below. Note that there’s more than one way to configure Apache and PHP, but this is possibly the quickest method.

Step 1: Download the PHP files

Get the latest PHP x64 Thread Safe ZIP package from https://windows.php.net/download/.

Step 2: Extract the files

Create a new php folder in the root of your C:\ drive and extract the content of the ZIP into it.

You can install PHP anywhere on your system, but you’ll need to change the paths referenced below if you use anything other than C:\php.

Step 3: Configure php.ini

PHP’s configuration file is php.ini. This doesn’t exist initially, so copy C:\php\php.ini-development to C:\php\php.ini. This default configuration provides a development setup which reports all PHP errors and warnings.

You can edit php.ini in a text editor, and you may need to change lines such as those suggested below (use search to find the setting). In most cases, you’ll need to remove a leading semicolon (;) to uncomment a value.

First, enable any required extensions according to the libraries you want to use. The following extensions should be suitable for most applications including WordPress:

extension=curl
extension=gd
extension=mbstring
extension=pdo_mysql

If you want to send emails using PHP’s mail() function, enter the details of an SMTP server in the [mail function] section (your ISP’s settings should be suitable):

[mail function]
; For Win32 only.
; http://php.net/smtp
SMTP = mail.myisp.com
; http://php.net/smtp-port
smtp_port = 25

; For Win32 only.
; http://php.net/sendmail-from
sendmail_from = my@emailaddress.com

Step 4: Add C:\php to the PATH environment variable

To ensure Windows can find the PHP executable, you must add it to the PATH environment variable. Click the Windows Start button and type “environment”, then click Edit the system environment variables. Select the Advanced tab, and click the Environment Variables button.

Scroll down the System variables list and click Path, followed by the Edit button. Click New and add C:\php.

PHP path environment variable

Note that older editions of Windows provide a single text box with paths separated by semi-colons (;).

Now OK your way out. You shouldn’t need to reboot, but you may need to close and restart any cmd terminals you have open.

Step 5: Configure PHP as an Apache module

Ensure Apache is not running and open its C:\Apache24\conf\httpd.conf configuration file in a text editor. Add the following lines to the bottom of the file to set PHP as an Apache module (change the file locations if necessary but use forward slashes rather than Windows backslashes):

# PHP8 module
PHPIniDir "C:/php"
LoadModule php_module "C:/php/php8apache2_4.dll"
AddType application/x-httpd-php .php

Optionally, change the DirectoryIndex setting to use index.php as the default in preference to index.html. The initial setting is:

<IfModule dir_module>
    DirectoryIndex index.html
</IfModule>

Change it to:

<IfModule dir_module>
    DirectoryIndex index.php index.html
</IfModule>

Save httpd.conf and test the updates from a cmd command line:

cd C:\Apache24\bin
httpd -t

Syntax OK will appear … unless you have errors in your configuration.

If all went well, start Apache with httpd.

Step 6: Test a PHP file

Create a new file named index.php in Apache’s web page root folder at C:\Apache24\htdocs. Add the following PHP code:

<?php
phpinfo();
?>

Open a web browser and enter your server address: http://localhost/. A PHP version page should appear, showing all PHP and Apache configuration settings.

You can now create PHP sites and applications in any subfolder of C:\Apache24\htdocs. If you need to work more than one project, consider defining Apache Virtual Hosts so you can run separate codebases on different localhost subdomains or ports.

Further information:

Best of luck!

Original article source at: https://www.sitepoint.com/

#window #php #mysql #apache 

Install PHP on Windows 10 and 11 (with Apache & MySQL)

How to Create A Highly Available Distributed Database

Follow this example of ShardingSphere's high availability and dynamic read/write splitting as the basis for your own configurations. 

Modern business systems must be highly available, reliable, and stable in the digital age. As the cornerstone of the current business system, databases are supposed to embrace high availability.

High availability (HA) allows databases to switch services between primary and secondary database nodes. HA automatically selects a primary, picking the best node when the previous one crashes.

MySQL high availability

There are plenty of MySQL high availability options, each with pros and cons. Below are several common high availability options:

  • Orchestrator is a MySQL HA and replication topology management tool written in Go. Its advantage lies in its support for manual adjustment of the primary-secondary topology, automatic failover, and automatic or manual recovery of primary nodes through a graphical web console. However, the program needs to be deployed separately and has a steep learning curve due to its complex configurations.
  • MHA is another mature solution. It provides primary/secondary switching and failover capabilities. The good thing about it is that it can ensure the least data loss in the switching process and works with semi-synchronous and asynchronous replication frameworks. However, only the primary node is monitored after MHA starts, and MHA doesn't provide the load balancing feature for the read database.
  • MGR implements group replication based on the distributed Paxos protocol to ensure data consistency. It is an official HA component provided by MySQL, and no extra deployment program is required. Instead, users only need to install the MGR plugin on each data source node. The tool features high consistency, fault tolerance, scalability, and flexibility.

Apache ShardingSphere high availability

Apache ShardingSphere's architecture actually separates storage from computing. The storage node represents the underlying database, such as MySQL, PostgreSQL, openGauss, etc., while compute node refers to ShardingSphere-JDBC or ShardingSphere-Proxy.

Accordingly, the high availability solutions for storage nodes and compute nodes are different. Stateless compute nodes need to perceive the changes in storage nodes. They also need to set up separate load balancers and have the capabilities of service discovery and request distribution. Stateful storage nodes must provide data synchronization, connection testing, primary node election, and so on.

Although ShardingSphere doesn't provide a database with high availability, it can help users integrate database HA solutions such as primary-secondary switchover, faults discovery, traffic switching governance, and so on with the help of the database HA and through its capabilities of database discovery and dynamic perception.

When combined with the primary-secondary flow control feature in distributed scenarios, ShardingSphere can provide better high availability read/write splitting solutions. It will be easier to operate and manage ShardingSphere clusters using DistSQL's dynamic high availability adjustment rules to get primary/secondary nodes' information.

Best practices

Apache ShardingSphere adopts a plugin-oriented architecture so that you can use all its enhanced capabilities independently or together. Its high availability function is often used with read/write splitting to distribute query requests to the secondary databases according to the load balancing algorithm to ensure system HA, relieve primary database pressure, and improve business system throughput.

Note that ShardingSphere HA implementation leans on its distributed governance capability. Therefore, it can only be used under the cluster mode for the time being. Meanwhile, read/write splitting rules are revised in ShardingSphere 5.1.0. Please refer to the official documentation about read/write splitting for details.

Consider the following HA+read/write splitting configuration with ShardingSphere DistSQL RAL statements as an example. The example begins with the configuration, requirements, and initial SQL.

Configuration

schemaName: database_discovery_db

dataSources:
  ds_0:
    url: jdbc:mysql://127.0.0.1:1231/demo_primary_ds?serverTimezone=UTC&useSSL=false
    username: root
    password: 123456
    connectionTimeoutMilliseconds: 3000
    idleTimeoutMilliseconds: 60000
    maxLifetimeMilliseconds: 1800000
    maxPoolSize: 50
    minPoolSize: 1
  ds_1:
    url: jdbc:mysql://127.0.0.1:1232/demo_primary_ds?serverTimezone=UTC&useSSL=false
    username: root
    password: 123456
    connectionTimeoutMilliseconds: 3000
    idleTimeoutMilliseconds: 60000
    maxLifetimeMilliseconds: 1800000
    maxPoolSize: 50
    minPoolSize: 1
  ds_2:
    url: jdbc:mysql://127.0.0.1:1233/demo_primary_ds?serverTimezone=UTC&useSSL=false
    username: root
    password: 123456
    connectionTimeoutMilliseconds: 3000
    idleTimeoutMilliseconds: 50000
    maxLifetimeMilliseconds: 1300000
    maxPoolSize: 50
    minPoolSize: 1

rules:
  - !READWRITE_SPLITTING
    dataSources:
      replication_ds:
        type: Dynamic
        props:
          auto-aware-data-source-name: mgr_replication_ds
  - !DB_DISCOVERY
    dataSources:
      mgr_replication_ds:
        dataSourceNames:
          - ds_0
          - ds_1
          - ds_2
        discoveryHeartbeatName: mgr-heartbeat
        discoveryTypeName: mgr
    discoveryHeartbeats:
      mgr-heartbeat:
        props:
          keep-alive-cron: '0/5 * * * * ?'
    discoveryTypes:
      mgr:
        type: MGR
        props:
          group-name: b13df29e-90b6-11e8-8d1b-525400fc3996

Requirements

  • ShardingSphere-Proxy 5.1.0 (Cluster mode + HA + dynamic read/write splitting rule)
  • Zookeeper 3.7.0
  • MySQL MGR cluster

SQL script

CREATE TABLE `t_user` (
  `id` INT(8) NOT NULL,
  `mobile` CHAR(20) NOT NULL,
  `idcard` VARCHAR(18) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

First, view the primary-secondary relationship:

mysql> SHOW READWRITE_SPLITTING RULES;
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
| name           | auto_aware_data_source_name | write_data_source_name | read_data_source_names | load_balancer_type | load_balancer_props |
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
| replication_ds | mgr_replication_ds          | ds_0                   | ds_1,ds_2              | NULL               |                     |
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
1 ROW IN SET (0.09 sec)

You can also look at the secondary database state:

mysql> SHOW READWRITE_SPLITTING READ RESOURCES;
+----------+---------+
| resource | STATUS  |
+----------+---------+
| ds_1     | enabled |
| ds_2     | enabled |
+----------+---------+

The results above show that the primary database is currently ds_0, while secondary databases are ds_1 and ds_2.

Next, test INSERT:

mysql> INSERT INTO t_user(id, mobile, idcard) VALUE (10000, '13718687777', '141121xxxxx');
Query OK, 1 ROW affected (0.10 sec)

View the ShardingSphere-Proxy log and see if the route node is the primary database ds_0.

[INFO ] 2022-02-28 15:28:21.495 [ShardingSphere-Command-2] ShardingSphere-SQL - Logic SQL: INSERT INTO t_user(id, mobile, idcard) value (10000, '13718687777', '141121xxxxx')
[INFO ] 2022-02-28 15:28:21.495 [ShardingSphere-Command-2] ShardingSphere-SQL - SQLStatement: MySQLInsertStatement(setAssignment=Optional.empty, onDuplicateKeyColumns=Optional.empty)
[INFO ] 2022-02-28 15:28:21.495 [ShardingSphere-Command-2] ShardingSphere-SQL - Actual SQL: ds_0 ::: INSERT INTO t_user(id, mobile, idcard) value (10000, '13718687777', '141121xxxxx')

Now test SELECT (repeat it twice):

mysql> SELECT id, mobile, idcard FROM t_user WHERE id = 10000;

View the ShardingSphere-Proxy log and see if the route node is ds_1 or ds_2.

[INFO ] 2022-02-28 15:34:07.912 [ShardingSphere-Command-4] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:07.913 [ShardingSphere-Command-4] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:34:07.913 [ShardingSphere-Command-4] ShardingSphere-SQL - Actual SQL: ds_1 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:21.501 [ShardingSphere-Command-4] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:21.502 [ShardingSphere-Command-4] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:34:21.502 [ShardingSphere-Command-4] ShardingSphere-SQL - Actual SQL: ds_2 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10000

Switch to the primary database

Close the primary database ds_0:

Close primary database

(Zhao Jinchao, CC BY-SA 4.0)

View whether the primary database has changed and if the secondary database state is correct through DistSQL:

[INFO ] 2022-02-28 15:34:07.912 [ShardingSphere-Command-4] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:07.913 [ShardingSphere-Command-4] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:34:07.913 [ShardingSphere-Command-4] ShardingSphere-SQL - Actual SQL: ds_1 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:21.501 [ShardingSphere-Command-4] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:21.502 [ShardingSphere-Command-4] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:34:21.502 [ShardingSphere-Command-4] ShardingSphere-SQL - Actual SQL: ds_2 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10000

Now, INSERT another line of data:

mysql> INSERT INTO t_user(id, mobile, idcard) VALUE (10001, '13521207777', '110xxxxx');
Query OK, 1 ROW affected (0.04 sec)

View the ShardingSphere-Proxy log and see if the route node is the primary database ds_1:

[INFO ] 2022-02-28 15:40:26.784 [ShardingSphere-Command-6] ShardingSphere-SQL - Logic SQL: INSERT INTO t_user(id, mobile, idcard) value (10001, '13521207777', '110xxxxx')
[INFO ] 2022-02-28 15:40:26.784 [ShardingSphere-Command-6] ShardingSphere-SQL - SQLStatement: MySQLInsertStatement(setAssignment=Optional.empty, onDuplicateKeyColumns=Optional.empty)
[INFO ] 2022-02-28 15:40:26.784 [ShardingSphere-Command-6] ShardingSphere-SQL - Actual SQL: ds_1 ::: INSERT INTO t_user(id, mobile, idcard) value (10001, '13521207777', '110xxxxx')

Finally, test SELECT(repeat it twice):

mysql> SELECT id, mobile, idcard FROM t_user WHERE id = 10001;

View the ShardingSphere-Proxy log and see if the route node is ds_2:

[INFO ] 2022-02-28 15:42:00.651 [ShardingSphere-Command-7] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10001
[INFO ] 2022-02-28 15:42:00.651 [ShardingSphere-Command-7] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:42:00.651 [ShardingSphere-Command-7] ShardingSphere-SQL - Actual SQL: ds_2 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10001
[INFO ] 2022-02-28 15:42:02.148 [ShardingSphere-Command-7] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10001
[INFO ] 2022-02-28 15:42:02.149 [ShardingSphere-Command-7] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:42:02.149 [ShardingSphere-Command-7] ShardingSphere-SQL - Actual SQL: ds_2 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10001

Release the secondary databases

Release the secondary database

(Zhao Jinchao, CC BY-SA 4.0)

View the latest primary-secondary relationship changes through DistSQL. The state of the ds_0 node is recovered as enabled, while ds_0 is integrated to read_data_source_names:

mysql> SHOW READWRITE_SPLITTING RULES;
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
| name           | auto_aware_data_source_name | write_data_source_name | read_data_source_names | load_balancer_type | load_balancer_props |
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
| replication_ds | mgr_replication_ds          | ds_1                   | ds_0,ds_2              | NULL               |                     |
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
1 ROW IN SET (0.01 sec)

mysql> SHOW READWRITE_SPLITTING READ RESOURCES;
+----------+---------+
| resource | STATUS  |
+----------+---------+
| ds_0     | enabled |
| ds_2     | enabled |
+----------+---------+
2 ROWS IN SET (0.00 sec)

Wrap up

Database high availability is critical in today's business environments, and Apache ShardingSphere can help provide the necessary reliability. Based on the above example, you now know more about ShardingSphere's high availability and dynamic read/write splitting. Use this example as the basis for your own configurations. 

Original article source at: https://opensource.com/

#database #apache 

How to Create A Highly Available Distributed Database
Jammie  Yost

Jammie Yost

1672462440

Apache Maven Core

Apache Maven

Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information.

If you think you have found a bug, please file an issue in the Maven Issue Tracker.

Documentation

More information can be found on Apache Maven Homepage. Questions related to the usage of Maven should be posted on the Maven User List.

Where can I get the latest release?

You can download the release source from our download page.

Contributing

If you are interested in the development of Maven, please consult the documentation first and afterward you are welcome to join the developers mailing list to ask questions or discuss new ideas/features/bugs etc.

Take a look into the contribution guidelines.

License

This code is under the Apache License, Version 2.0, January 2004.

See the NOTICE file for required notices and attributions.

Donations

Do you like Apache Maven? Then donate back to the ASF to support the development.

Quick Build

If you want to bootstrap Maven, you'll need:

  • Java 8+
  • Maven 3.0.5 or later
  • Run Maven, specifying a location into which the completed Maven distro should be installed:
mvn -DdistributionTargetDir="$HOME/app/maven/apache-maven-4.0.x-SNAPSHOT" clean package

Download details:

Author: apache
Source code: https://github.com/apache/maven

License: Apache-2.0 license

#maven #apache 

Apache Maven Core
Gordon  Matlala

Gordon Matlala

1671203100

How to Apache Camel Exception Re-try Policy

Apache Camel is a rules-based routing and arbitration engine that provides a Java object-based implementation of the Enterprise Integration Pattern, using an API (or Declarative Java Domain Specific Language) to configure routing and arbitration rules. We can implement exception handling in two ways Using Do Try block and OnException block . A re-try policy defines rules when Camel Error Handler perform re-try attempts. e.g you can setup rules that state how many times to try retry, and the delay in between attempts, and so forth.

The project structure will be as follows-

apache-camel-exception-retry-architecture

The pom.xml will be as follows-

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.spring</groupId>
  <artifactId>camel-spring-integration</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  
  <dependencies>
  	<dependency>
		<groupId>org.apache.camel</groupId>
		<artifactId>camel-core</artifactId>
		<version>2.13.0</version>
	</dependency>

	<dependency>
		<groupId>org.apache.camel</groupId>
		<artifactId>camel-spring</artifactId>
		<version>2.13.0</version>
	</dependency>

	<dependency>
		<groupId>org.slf4j</groupId>
		<artifactId>slf4j-api</artifactId>
		<version>1.7.5</version>
	</dependency>

	<dependency>
		<groupId>org.slf4j</groupId>
		<artifactId>slf4j-log4j12</artifactId>
		<version>1.7.5</version>
	</dependency>
  </dependencies>
</project>

Added MainApplication class

package com.spring.main;
import org.springframework.context.support.AbstractApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;


public class MainApplication {
	
	/*
	 * It is a Main Application. It invokes routeBuilder bean through  from applicationContect.xml file via 
	 * ClassPathXmlApplicationContext
	 */
	
	public static void main(String[] args) {
        AbstractApplicationContext ctx = new ClassPathXmlApplicationContext("applicationContext.xml");
        ctx.start();
        System.out.println("Application started...");
        try {
        	System.out.println("inside try block");
        	System.out.println("--------------------- inputMessageBody ------------------- ");
            Thread.sleep(5 * 60 * 1000);
        }
        catch (InterruptedException e) {
            e.printStackTrace();
        }
        ctx.stop();
        ctx.close();
    }

}

Created CamelCustomException class to implement custom exception

package com.spring.exception;

/*
 * Created the custom exception... 
 */

public class CamelCustomException extends Exception {
	private static final long serialVersionUID = 2L;

}

Created applicationContext.xml

<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:camel="http://camel.apache.org/schema/spring"
	xsi:schemaLocation="http://www.springframework.org/schema/beans 
		http://www.springframework.org/schema/beans/spring-beans.xsd          
		http://camel.apache.org/schema/spring 
		http://camel.apache.org/schema/spring/camel-spring.xsd">

	<bean id="routeBuilder" class="com.spring.route.SimpleRouteBuilder" />

	<camelContext xmlns="http://camel.apache.org/schema/spring">
		<routeBuilder ref="routeBuilder" />
		
	</camelContext>
	
</beans>

We modify the MyProcessor as follows. We check if the input contains text “test”, then an custom exception is thrown.

package com.spring.processor;

import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import com.spring.exception.CamelCustomException;

public class MyProcessor implements Processor {
	
/*
 * (non-Javadoc)
 * @see org.apache.camel.Processor#process(org.apache.camel.Exchange)
 * It is Camel processor class which implements Processor. 
 */

    public void process(Exchange exchng) throws Exception {
    	String inputMessageBody = exchange.getIn().getBody(String.class);
    	System.out.println("\n" + inputMessageBody);
        if (inputMessageBody.contains("test"))
            throw new CamelCustomException();     
    }

}

Our SimpleRoutebuilder class is as before-

package com.spring.route;

import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import org.apache.camel.builder.RouteBuilder;

import com.spring.exception.CamelCustomException;
import com.spring.processor.MyProcessor;
import com.spring.processor.RetryProcessor;

public class SimpleRouteBuilder extends RouteBuilder {
	
	/*
	 * (non-Javadoc)
	 * @see org.apache.camel.builder.RouteBuilder#configure()
	 * It is a SimpleRouteBuilder class which facilitates  Routes
	 */

    @Override
    public void configure() throws Exception {
    	
    	onException(CamelCustomException.class).process(new Processor() {

            public void process(Exchange exchng) throws Exception {
                System.out.println("Exception is handling by onException{} block");
            }
        })
    	.log("Received body ${body}").handled(true);
    	
    	from("file:/home/knoldus/Downloads/Softwares/Workspace/input?noop=true")
    	.process(new MyProcessor())
    	.to("file:/home/knoldus/Downloads/Softwares/Workspace/output");
    	
    }

}

After running the MainApp.java output will be as follow-

camel-with-exception

After the exception is thrown it is caught by the onException{} Block. We will now define the redelivery policy. So now each message will be tried 3 times before being caught.

Modify the applicationContext.xml

1.) maximumRedeliveries is the maximum number of times can we redeliver. 

2.) redeliveryDelay is the delay time (in ms) between each re-try attempts.

<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:camel="http://camel.apache.org/schema/spring"
	xsi:schemaLocation="http://www.springframework.org/schema/beans 
		http://www.springframework.org/schema/beans/spring-beans.xsd          
		http://camel.apache.org/schema/spring 
		http://camel.apache.org/schema/spring/camel-spring.xsd">

	<bean id="routeBuilder" class="com.spring.route.SimpleRouteBuilder" />

	<camelContext xmlns="http://camel.apache.org/schema/spring">
		<routeBuilder ref="routeBuilder" />
		<redeliveryPolicyProfile id="localRedeliveryPolicyProfile"
			retryAttemptedLogLevel="WARN" maximumRedeliveries="3"
			redeliveryDelay="1" />
	</camelContext>
	
</beans>

Configure the redelivery policy in the route of SimpleRouteBuilder class. So that it can process the redelivery logic.

package com.spring.route;

import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import org.apache.camel.builder.RouteBuilder;

import com.spring.exception.CamelCustomException;
import com.spring.processor.MyProcessor;
import com.spring.processor.RetryProcessor;

public class SimpleRouteBuilder extends RouteBuilder {
	
	/*
	 * (non-Javadoc)
	 * @see org.apache.camel.builder.RouteBuilder#configure()
	 * It is a SimpleRouteBuilder class which facilitates  Routes
	 */

    @Override
    public void configure() throws Exception {
    	
    	onException(CamelCustomException.class).process(new Processor() {

            public void process(Exchange exchange) throws Exception {
                System.out.println("Exception is handling by onException{} block");
            }
        })
    	.redeliveryPolicyRef("localRedeliveryPolicyProfile")
    	.log("Received body ${body}").handled(true);
    	
    	from("file:/home/knoldus/Downloads/Softwares/Workspace/input?noop=true")
    	.process(new MyProcessor())
    	.to("file:/home/knoldus/Downloads/Softwares/Workspace/output"); 	
    }

}

After running the MainApp.java output will be as follows-

camel-exception-retry

Create RetryProcessor class. It will responsible for process the logic in retry attempts.

package com.spring.processor;


import org.apache.camel.Exchange;
import org.apache.camel.Processor;

/**
 * @author knoldus
 * If the body contains text "test" then it will set new body
 */


public class RetryProcessor implements Processor {

    public void process(Exchange exchange) throws Exception {    
    	exchange.getIn().setBody("replaced new body...");
        
    }

}

Configure the onRedelivery before the redelivery policy configuration in the route of SimpleRouteBuilder class

package com.spring.route;

import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import org.apache.camel.builder.RouteBuilder;

import com.spring.exception.CamelCustomException;
import com.spring.processor.MyProcessor;
import com.spring.processor.RetryProcessor;

public class SimpleRouteBuilder extends RouteBuilder {
	
	/*
	 * (non-Javadoc)
	 * @see org.apache.camel.builder.RouteBuilder#configure()
	 * It is a SimpleRouteBuilder class which facilitates  Routes
	 */

    @Override
    public void configure() throws Exception {
    	
    	onException(CamelCustomException.class).process(new Processor() {

            public void process(Exchange exchange) throws Exception {
                System.out.println("Exception is handling by onException{} block");
            }
        })
    	.onRedelivery(new RetryProcessor())
    	.redeliveryPolicyRef("localRedeliveryPolicyProfile")
    	.log("Received body ${body}").handled(true);
    	
    	from("file:/home/knoldus/Downloads/Softwares/Workspace/input?noop=true")
    	.process(new MyProcessor())
    	.to("file:/home/knoldus/Downloads/Softwares/Workspace/output");
    	
    }

}

In the Re-try Policy bean we have changed the exchange data so the exception will not thrown again

camel-exception-retry-with-onDelivery

Conclusion

In this blog, we have covered how to implement and configure Exception Handling with retry policy in Apache Camel using Spring. Now you are ready to go to implement Exception Handling with retry policy in Apache Camel using Spring. For more, you can refer to the documentation: https://people.apache.org/~dkulp/camel/redeliverypolicy.html

Original article source at: https://blog.knoldus.com/

#apache #policy 

How to Apache Camel Exception Re-try Policy
Sheldon  Grant

Sheldon Grant

1670593211

Apache Pulsar Architecture and Benefits

Introduction to Apache Pulsar

Apache Pulsar is a multi-tenant, high-performance server to server messaging system. Yahoo developed it. In late 2016 it was a first open-source project. Now it is in the incubation, under the Apache Software Foundation(ASF). Pulsar works on the pub-sub pattern, where there is a Producer, and a Consumer also called the subscribers, the topic is the core of the pub-sub model, where producer publish their messages on a given pulsar topic, and consumer subscribes to a problem to get news from that topic and send an acknowledgement.

Once a subscription has been acknowledged, all the messages will be retained by the pulsar. One Consumer acknowledged has been processed only after that message gets deleted.Apache Pulsar TopicsApache Pulsar Topics:  are well defined named channels for transmitting messages from producers to consumers. Topics names are well-defined URL.

Namespaces:  It is logical nomenclature within a tenant. A tenant can create multiple namespaces via admin API. A namespace allows the application to create and manage a hierarchy of topics. The number of issues can be created under the namespace.

Apache Pulsar Subscription Modes

A subscription is a named rule for the configuration that determines the delivery of the messages to the consumer. There are three subscription modes in Apache Pulsar

Exclusive

Apache Pulsar Subscription Mode Exclusive

In Exclusive mode, only a single consumer is allowed to attach to the subscription. If more then one consumer attempts to subscribe to a topic using the same subscription, then the consumer receives an error. Exclusive mode as default is subscription model.
 

Failover

Apache Pulsar Subscription Failover

In failover, multiple consumers attached to the same topic. These consumers are sorted in lexically with names, and the first consumer is the master consumer, who gets all the messages. When a master consumer gets disconnected, the next consumers will get the words.
 

Shared 

Apache Pulsar Subscription Mode SharedShared and round-robin mode, in which a message is delivered only to that consumer in a round-robin manner. When that user is disconnected, then the messages sent and not acknowledged by that consumer will be re-scheduled to other consumers. Limitations of shared mode-

  • Message ordering is not guaranteed.
  • You can’t use cumulative acknowledgement with shared mode.

The process used for analyzing the huge amount of data at the moment it is used or produced. Click to explore about our, Real Time Data Streaming Tools

Routing Modes

The routing modes determine which partition to which topic a message will be subscribed. There are three types of routing methods. When using partitioned questions to publish, routing is necessary.

Round Robin Partition 

If no key is provided to the producer, it will publish messages across all the partitions available in a round-robin way to achieve maximum throughput. Round-robin is not done per individual message but set to the same boundary of batching delay, and this ensures effective batching. While if a key is specified on the message, the producer that is partitioned will hash the key and assign all the messages to the particular partition. This is the default mode.

Single Partition

If no key is provided, the producer randomly picks a single partition and publish all the messages in that particular partition. While if the key is specified for the message, the partitioned producer will hash the key and assign the letter to the barrier.

Custom Partition

The user can create a custom routing mode by using the java client and implementing the MessageRouter interface. Custom routing will be called for a particular partition for a specific message.

Apache Pulsar Architecture

Pulsar ArchitecturePulsar cluster consists of different parts in it: In pulsar, there may be one more broker’s handles, and load balances incoming messages from producers, it dispatches messages to consumers, communicates with the pulsar configuration store to handle various coordination tasks. It stores messages in BookKeeper instances.

  • BookKeeper cluster consisting of one or more bookies to handles persistent storage of messages.
  • ZooKeeper cluster calls the configuration store to handle coordination tasks that involve multiple groups.

Brokers

The broker is a stateless component that handles an HTTP server and the Dispatcher. An HTTP server exposes a Rest API for both administrative tasks and topic lookup for producers and consumers. A dispatcher is an async TCP server over a custom binary protocol used for all data transfers.

Clusters

A Pulsar instance usually consists of one or more Pulsar clusters. It consists of: One or more brokers, a zookeeper quorum used for cluster-level configuration and coordination and an ensemble of bookies used for persistent storage of messages.

Metadata store

Pulsar uses apache zookeeper to store the metadata storage, cluster config and coordination.

Persistent storage

Pulsar provides surety of message delivery. If a message reaches a Pulsar broker successfully, it will be delivered to the target that’s intended for it.

Pulsar Clients

Pulsar has client API’s with language Java, Go, Python and C++. The client API encapsulates and optimizes pulsar’s client-broker communication protocol. It also exposes a simple and intuitive API for use by the applications. The current official Pulsar client libraries support transparent reconnection, and connection failover to brokers, queuing of messages until acknowledged by the broker, and these also consists of heuristics such as connection retries with backoff.

Client setup phase

When an application wants to create a producer/consumer, the pulsar client library will initiate a setup phase that is composed of two setups:

  1. The client will attempt to determine the owner of the topic by sending an HTTP lookup request to the broker. The application could reach to an active broker which in return by looking at the cached metadata of zookeeper will let the user know about the serving topic or assign it to the least loaded broker in case nobody is serving it.
  2. Once the client library has the broker address, it will create a TCP connection (or reuse an existing connection from the pool) and authenticate it. Within this connection, binary commands are exchanged between the broker and the client from the custom protocol. At this point, the client sends a command to create consumer or producer to the broker, which complies after user validates the authorization policy.

Geo-Replication

Apache Pulsar’s Geo-replication enables messages to be produced in one geolocation and can be consumed in other geolocation.  Geo ReplicationIn the above diagram, whenever producers P1, P2, and P3 publish a message to the given topic T1 on Cluster – A, B and C respectively, all those messages are instantly replicated across clusters. Once replicated, this allows consumers C1 & C2 to consume the messages from their respective groups. Without geo-replication, C1 and C2 consumers are not able to consume messages published by P3 producers.

Multi-Tenancy

Pulsar was created from the group up as a multi-tenant system. Apache supports multi-tenancy. It is spread across a cluster, and each can have their authentication and authorization scheme applied to them. They are also the administrative unit at which storage, message Ttl, and isolation policies can be managed.

Tenants

To each tenant in a particular pulsar instance you can assign:     

  • An authorization scheme.     
  • The set of the cluster to which the tenant’s configuration applies.

The Dataset is a data structure in Spark SQL which is strongly typed, Object-oriented and is a map to a relational schema.Click to explore about our, RDD in Apache Spark Advantages

Authentication and Authorization

Pulsar has support for the authentication mechanism which can be configured at the broker, and it also supports authorization to identify the client and its access rights on topics and tenants.

Tiered Storage

Pulsar’s architecture allows topic backlogs to grow very large. This makes a rich set of the situation over time. To alleviate this cost is to use Tiered Storage. The Tiered Storage move older messages in the backlog can be moved from BookKeeper to cheaper storage. Which means clients can access older backlogs.

Schema Registry

Type safety is paramount in communication between the producer and the consumer in it. For safety in messaging, pulsar adopted two basic approaches:

Client-side approach

In this approach message producers and consumers are responsible for not only serializing and deserializing messages (which consist of raw bytes) but also “knowing” which types are being transmitted via which topics. 

Server-side approach

In this approach which producers and consumers inform the system which data types can be transmitted via the topic. With this approach, the messaging system enforces type safety and ensures that both producers and consumers remain in sync.

How schemas work ?

Pulsar schema is applied and enforced at the topic level. Producers and consumers upload schemas to pulsar are asked. Pulsar schema consists of :

  • Name: name is the topic to which the schema is applied.
  • Payload: binary representation of the schema.
  • User-defined properties as a string/string map

It supports the following schema formats:

  • JSON
  • Protobuf
  • Avro
  • string (used for UTF-8-encoded lines) 

If no schema is defined, producers and consumers handle raw bytes.

What are the Pros and Cons?

The pros and cons of Apache Pulsar are described below:

Pros

  • Feature-rich – persistent/nonpersistent topics
  • Multi-tenancy
  • More flexible client API- including CompletableFutures,fluent interface
  • Java clients have till date to no java docs.

Cons

  •  Community base is small.
  •  The reader can’t read the last message in the topic [need to skim through all the words]
  •  Higher operational complexity – ZooKeeper + Broker nodes + BookKeeper + all clustered.
  • Java client components are thread-safe – the consumer can acknowledge messages from different threads.

Apache Pulsar Multi-Layered Architecture

Pulsar multilayered Architecture

Difference between Apache Kafka and Apache Pulsar

S.No. KafkaApache Pulsar
1It is more mature and higher-level APIs.It incorporated improved design stuff of Kafka and its existing capabilities.
2Built on top of Kafka Streams

 Unified messaging model and API.

  • Streaming via exclusive, failover subscription
  • Queuing via shared subscription
3Producer-topic-consumer group-consumerProducer-topic-subscription-consumer
4Restricts fluidity and flexibilityProvide fluidity and flexibility
5Messages are deleted based on retention. If a consumer doesn’t read words before the retention period, it will lose data. Messages are only deleted after all subscriptions consumed them. No data loss, even the consumers of a subscription are down for a long time. Words are allowed to keep for a configured retention period time even after all subscriptions consume them.

Drawbacks of Kafka

  1. High Latency
  2. Poor Scalability
  3. Difficulty supporting global architecture (fulfilled by pulsar with the help of geo-replication)
  4. High OpEx (operation expenditure)

How Apache Pulsar is better than Kafka

  1. Pulsar has shown notable improvements in bot latency and throughput when compared with Kafka. Pulsar is approximately 2.5 times faster and has 40% less lag than Kafka.
  2. Kafka, in many scenarios, has shown that it doesn’t go well when there are thousands of topics and partitions even if the data is not massive. Fortunately, the pulsar is designed to serve hundreds of thousands of items in a cluster deployed.
  3. Kafka stores data and logs in the dedicated files and directories (Broker) this creates trouble at the time of scaling (files are loaded to disk periodically). In contrast, scaling is effortless in the case of the pulsar as pulsar has stateless brokers that means scaling is not rocket science, pulsar uses bookies to store data. 
  4. Kafka brokers are designed to work together in a single region in the network provided. So it is not an easy way to work with multi-datacentre architecture. Whereas, pulsar offers geo-replication in which user can easily replicate it’s data synchronously or asynchronously among any number of clusters.
  5. Multi-tenancy is a feature that can be of great use as it provides different types of defined tenants that are specific to the needs of a particular client or organization. In layman language, it’s like describing a set of properties so that each specific property satisfies the need for a specific group of clients/consumers using it.

Even though it looks like Kafka lags behind pulsar, but kip (Kafka improvement proposals) has almost all of these drawbacks covered in its discussion and users can hope to see the changes in the upcoming versions of the Kafka.

Kafka To Pulsar –  User can easily migrate to Pulsar from Kafka as Pulsar natively supports to work directly with Kafka data through connectors provided or one can import Kafka application data to pulsar quite easily.

Pulsar SQL  uses Presto to query over the old messages that are kept in backlog (Apache BookKeeper).

Conclusion

Apache Pulsar is a powerful stream-processing platform that has been able to learn from the previously existing systems. It has a layered architecture which is complemented by the number of great out-of-the-box features like multi-tenancy, zero rebalancing downtime,geo-replication, proxy and durability and TLS-based authentication/authorization. Compared to other platforms, pulsar can give you the ultimate tools with more capabilities.

Original article source at: https://www.xenonstack.com/

#kafka #apache #architecture #benefits 

Apache Pulsar Architecture and Benefits
Bongani  Ngema

Bongani Ngema

1670579700

Stream Processing with Apache Flink and NATS

Stream Processing with Apache Flink

Apache Flink is a stream processing framework which is developed by Apache Software Foundation. It is an open source platform which is a streaming data flow engine that provides communication, and data-distribution for distributed computations over data streams. Apache Flink is a distributed data processing platform which is used in big data applications and primarily involves the analysis of data stored in the Hadoop clusters. It is capable of handling both the batch and stream processing jobs. It is the alternative of Map-reduce. Some of the best features of it are as follows -

An open-source, distributed processing engine and framework of stateful computations written in JAVA and Scala. Click to explore about our, How to Integrate AIOps for DevOps?

Unified framework - Flink is a unified framework, that allows to build a single data workflow and holds streaming, batch, and SQL. Flink can also process graph with its own Gelly library and use the Machine learning algorithm from its FlinkML library. Apart from This, Flink also supports iterative algorithms and interactive queries.

Custom Memory Manager - Flink implements its memory management inside the JVM and its features are as follows

  • C++ style memory management inside the JVM.
  • User data stored in serialized bytes array in JVM.
  • Memory can be quickly allocated and de allocated.

Native Closed Loop Iteration Operators: Flink has its dedicated support for iterative computations. It iterates on data by using streaming Architecture. The concept of an iterative algorithm is tightly bounded into the flink query optimizer.

Use Cases of Apache Flink

It is one of the best options to develop and run several types of applications because of its extensive features. Some of the use cases of Flink are as follows -

Event Driven Applications - An event-driven application is a type of stateful application through which events are ingested from one or more event streams, and it also reacts to the incoming events. Event-driven applications are based on stateful stream processing applications. Some of the event-driven applications are as follows -

  • Fraud Detection
  • Anomaly Detection
  • Web Application

Data Analytics Application - These types of applications extract the information from the Raw data. With the help of a proper stream processing engine, analytics can also be done in real time.

Some of the data analytics Applications are as follows -

  • Quality monitoring of networks
  • Analysis of product updates and experiment evaluation
  • Large scale graph analysis
  • Data Pipeline Applications - For converting and moving data from one system to another, ETL, i.e. Extract, Transform and Load operation is the general approach and even ETL jobs are periodically triggered for copying the data from the transaction database to analytical database.

Some of the data pipeline applications are as follows -

  • Continuous ETL operations in e-commerce
  • Real-time search index building in e-commerce

Challenges for enabling IoT in Data Processing

There are several challenges which are faced by the IoT industries when it comes to data processing some of the are as follows -

  • Devices produce more amount of data that users do.
  • IoT users also expect real-time information to which they can act on immediately.
  • Connectivity can never be guaranteed in the IoT industries.
  • Integrating and Managing IoT data.

Solutions for Stream Processing Using Apache Flink in IoT

Several solutions behind streaming processing using Apache Flink in IoT are as follows -

  1. Real-Time Data Processing - Many IoT use cases require immediate information and an action to be followed. So, for handling real-time data processing Apache flink is one way.
  2. Event Time for Ordering Data in IoT - When data from the devices travel through a network, it is essential to account for latency and network failures, even if the data was sent to a more stable system, the latency will increase with the distance from the data center.
  3. Tools for Dealing with Messy Data: Generally the pre-processing of the data is the hardest part of the process. When talking about IoT, it is even harder to control the source. Although streaming of data does not fix the problem. But it can provide several tools for it, like windowing, which is a concept of a grouping of data from a particular time together for further processing.
  4. Segmentation Allows for Parallel Processing - Usually, the users of IoT devices are more interested in calculating on subsets of data than calculating on the complete data. So, Flink introduces the concept of grouping by key for that type of purpose. In this once a stream is partitioned, it can be executed in parallel.
  5. Local State is Crucial to Performance - Apache Flink lets us keep data in a proper manner, where the calculations are performed with the help of local state as.
  6. Data Streaming is Conceptually Simple - Although one has to learn how to manage state properly in Flink, once it is familiar of using flink, we can only focus on the core logic of the application, and leave the other part on the framework to handle it.

What is NATS?

NATS is an open source messaging system which consists of a server, a client and a connector framework which is a java based framework used for connecting it with other services. Its server is written in GO programming language. It also provides high performant and flexible messaging capabilities. The essential design principles which makes it easy to use are its performance and scalability. Features of NATS - It provide some of the best and unique features; some of them are as follows -

  • Auto-discovery - This is a feature to discover routes to other servers makes clustering bliss. For getting a better network between the nodes, we can combine auto-discovery and embedded servers.
  • Optional Persistence - The NATS server provides the ability to persist messages to ensure their delivery. This feature makes its server very lighter for the users.
  • Clustered mode Server - Its server can be clustered together and have distributed queuing across the clusters.

Solutions for Stream Processing Using NATS in IoT

Several solutions behind streaming processing using Nats IO in Iot are as follows -

Multiple qualities of service (QoS)

  • At-most-once delivery - NATS delivers messages to immediately eligible subscribers and do not preserve the messages for other subscribers.
  • At-least-once delivery - Messages preserved until delivery to the subscribers has been confirmed, or storage has been exhausted.
  • Load balancing - The application will produce a massive amount of requests, and we would like to use a dynamically scalable pool of worker application instances to ensure the meeting SLAs or other performance targets.
  • Fault tolerance - The application needs to be highly influential to a network that may be beyond the control and we need the underlying application data communication to seamlessly recover from connectivity outages, so it provides a proper fault tolerance capability.

Use Cases of NATS

It is one of the most straightforward and powerful messaging systems and offers multiple quality of Services. Some of the best use cases of Nats are as follows -

  • Command and control - Sending the commands for running the applications or devices and receiving back the status from the devices or the applications like satellite telemetry and IoT.
  • Addressing, discovery - Sending the data to specific application instances or devices, or users or discovering all the applications instances or devices that are connected to the Infrastructure.
  • High Throughput message Fanout - A few numbers of publishers need to send frequently data to a much larger group of subscribers and many of them also share a common interest in specific data sets or categories.

Conclusion 

The proper management of data streams has helped to enterprises to meet the demands of a real-time world. To facilitate Streaming Analytics as your Analytics Approach we advice taking the subsequent steps.

Original article source at: https://www.xenonstack.com/

#apache #stream #nats #iot 

Stream Processing with Apache Flink and NATS