1679129221
Introduction
Thrift is a lightweight, language-independent software stack for point-to-point RPC implementation. Thrift provides clean abstractions and implementations for data transport, data serialization, and application level processing. The code generation system takes a simple definition language as input and generates code across programming languages that uses the abstracted stack to build interoperable RPC clients and servers.
Thrift makes it easy for programs written in different programming languages to share data and call remote procedures. With support for 28 programming languages, chances are Thrift supports the languages that you currently use.
Thrift is specifically designed to support non-atomic version changes across client and server code. This allows you to upgrade your server while still being able to service older clients; or have newer clients issue requests to older servers. An excellent community-provided write-up about thrift and compatibility when versioning an API can be found in the Thrift Missing Guide.
For more details on Thrift's design and implementation, see the Thrift whitepaper included in this distribution, or at the README.md file in your particular subdirectory of interest.
Status
Releases
Thrift does not maintain a specific release calendar at this time.
We strive to release twice yearly. Download the current release.
Project Hierarchy
thrift/
compiler/
Contains the Thrift compiler, implemented in C++.
lib/
Contains the Thrift software library implementation, subdivided by
language of implementation.
cpp/
go/
java/
php/
py/
rb/
...
test/
Contains sample Thrift files and test code across the target programming
languages.
tutorial/
Contains a basic tutorial that will teach you how to develop software
using Thrift.
Development
To build the same way Travis CI builds the project you should use docker. We have comprehensive building instructions for docker.
Requirements
See http://thrift.apache.org/docs/install for a list of build requirements (may be stale). Alternatively, see the docker build environments for a list of prerequisites.
Resources
More information about Thrift can be obtained on the Thrift webpage at:
http://thrift.apache.org
Acknowledgments
Thrift was inspired by pillar, a lightweight RPC tool written by Adam D'Angelo, and also by Google's protocol buffers.
Installation
If you are building from the first time out of the source repository, you will need to generate the configure scripts. (This is not necessary if you downloaded a tarball.) From the top directory, do:
./bootstrap.sh
Once the configure scripts are generated, thrift can be configured. From the top directory, do:
./configure
You may need to specify the location of the boost files explicitly. If you installed boost in /usr/local
, you would run configure as follows:
./configure --with-boost=/usr/local
Note that by default the thrift C++ library is typically built with debugging symbols included. If you want to customize these options you should use the CXXFLAGS option in configure, as such:
./configure CXXFLAGS='-g -O2'
./configure CFLAGS='-g -O2'
./configure CPPFLAGS='-DDEBUG_MY_FEATURE'
To enable gcov required options -fprofile-arcs -ftest-coverage enable them:
./configure --enable-coverage
Run ./configure --help to see other configuration options
Please be aware that the Python library will ignore the --prefix option and just install wherever Python's distutils puts it (usually along the lines of /usr/lib/pythonX.Y/site-packages/
). If you need to control where the Python modules are installed, set the PY_PREFIX variable. (DESTDIR is respected for Python and C++.)
Make thrift:
make
From the top directory, become superuser and do:
make install
Uninstall thrift:
make uninstall
Note that some language packages must be installed manually using build tools better suited to those languages (at the time of this writing, this applies to Java, Ruby, PHP).
Look for the README.md file in the lib// folder for more details on the installation of each language library package.
Package Managers
Apache Thrift is available via a number of package managers, a list which is is steadily growing. A more detailed overview can be found at the Apache Thrift web site under "Libraries" and/or in the respective READMEs for each language under /lib
Testing
There are a large number of client library tests that can all be run from the top-level directory.
make -k check
This will make all of the libraries (as necessary), and run through the unit tests defined in each of the client libraries. If a single language fails, the make check will continue on and provide a synopsis at the end.
To run the cross-language test suite, please run:
make cross
This will run a set of tests that use different language clients and servers.
Author: Apache
Source Code: https://github.com/apache/thrift
License: Apache-2.0 license
1677063241
Apache Cassandra is a free and open-source NoSQL database with no single point of failure. It provides linear scalability and high availability without compromising performance. Apache Cassandra is used by many companies that have large, active data sets, including Reddit, NetFlix, Instagram, and Github.
This article guides you through the installation of Apache Cassandra on Ubuntu 20.04.
Installing the Apache Cassandra on Ubuntu is straightforward. We’ll install Java, enable the Apache Cassandra repository, import the repository GPG key, and install the Apache Cassandra server.
At the time of writing this article, the latest version of Apache Cassandra is 3.11
and requires OpenJDK 8 to be installed on the system.
Run the following command as root or user with sudo privileges to install OpenJDK :
sudo apt update
sudo apt install openjdk-8-jdk
Verify the Java installation by printing the Java version :
java -version
The output should look something like this:
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)
Install the dependencies necessary to add a new repository over HTTPS:
sudo apt install apt-transport-https
Import the repository’s GPG key and add the Cassandra repository to the system:
wget -q -O - https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -
sudo sh -c 'echo "deb http://www.apache.org/dist/cassandra/debian 311x main" > /etc/apt/sources.list.d/cassandra.list'
Once the repository is enabled, update the packages list and install the latest version of Apache Cassandra:
sudo apt update
sudo apt install cassandra
Apache Cassandra service will automatically start after the installation process is complete. You can verify it by typing:
nodetool status
You should see something similar to this:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 70 KiB 256 100.0% 2eaab399-be32-49c8-80d1-780dcbab694f rack1
That’s it. At this point, you have Apache Cassandra installed on your Ubuntu server.
Apache Cassandra data is stored in the /var/lib/cassandra
directory, configuration files are located in /etc/cassandra
, and Java start-up options can be configured in the /etc/default/cassandra
file.
By default, Cassandra is configured to listen on localhost only. If the client connecting to the database is also running on the same host, you don’t need to change the default configuration file.
To interact with Cassandra through CQL (the Cassandra Query Language) you can use a command-line tool named cqlsh
that is shipped with the Cassandra package.
cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>
The default Cassandra cluster is named “Test Cluster”. If you want to change the cluster name, perform the steps below:
Login to the Cassandra CQL terminal with cqlsh
:
cqlsh
Run the following command to change the cluster name to “Linuxize Cluster”:
UPDATE system.local SET cluster_name = 'Linuxize Cluster' WHERE KEY = 'local';
Change “Linuxize Cluster” with your desired name.
Once done, type exit
to exit the console.
Open the cassandra.yaml
configuration file and enter your new cluster name.
/etc/cassandra/cassandra.yaml
cluster_name: 'Linuxize Cluster'
Save and close the file.
Clear the system cache:
nodetool flush system
Restart the Cassandra service:
sudo systemctl restart cassandra
We’ve shown you how to install Apache Cassandra on Ubuntu 20.04. You can now visit the official Apache Cassandra Documentation page and learn how to get started with Cassandra.
If you hit a problem or have feedback, leave a comment below.
Original article source at: https://linuxize.com/
1676619507
Apache web server is the most widely used web server in the world. It comes with a modular architecture, which allows users to extend its functionality and customize it according to their needs. One of the important modules of Apache is Multi-Processing Module (MPM), which handles incoming requests and manages multiple processes or threads to handle them efficiently.
Apache provides two popular MPMs, Prefork and Worker, each with its own advantages and limitations. Choosing the right MPM for your website is critical to its performance and stability. In this article, we will compare the two MPMs in detail and help you make an informed decision.
The Prefork MPM is the traditional and default MPM in Apache web server. It creates multiple child processes to handle incoming requests, each running its own copy of the Apache web server. Each child process can handle only one request at a time, which makes it less efficient than other MPMs. However, it is still popular because of its stability and compatibility with older PHP and other scripts.
The Worker MPM is a newer MPM in Apache web server, which is designed to improve performance and scalability. It creates multiple threads within a single process, each handling a separate connection. It is more efficient than Prefork MPM in terms of resource usage and concurrency. However, it requires a more modern version of PHP and other scripts that are thread-safe.
The following table compares the key features of Apache Prefork and Worker MPMs:
Feature | Apache Prefork | Apache Worker |
---|---|---|
Architecture | Process-based | Thread-based |
Scalability | Poor | Good |
Memory Usage | High | Low |
Performance | Slow | Fast |
Compatibility | Good | Good |
Stability | Good | Good |
Flexibility | Limited | Flexible |
As you can see, Apache Worker has several advantages over Apache Prefork. It is more scalable, uses less memory, and performs better for high-traffic websites. Apache Prefork, on the other hand, is simpler and more stable. It is still a good option for small websites or websites that do not receive a lot of traffic.
Here are some sample configurations for Apache Prefork and Worker:
<IfModule mpm_prefork_module>
ServerLimit 100
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 100
MaxRequestsPerChild 0
</IfModule>
<IfModule mpm_worker_module>
ServerLimit 100
StartServers 2
MaxClients 150
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0
</IfModule>
In conclusion, both the Prefork and Worker MPMs have their own advantages and disadvantages. It ultimately depends on the specific needs of your website and the amount of traffic it receives. If you’re unsure which MPM to use, it’s recommended to start with the default Prefork MPM and then switch to the Worker MPM if you experience high traffic and want to improve performance.
Original article source at: https://www.cloudbooklet.com/
1676549940
It is a real-time analytics database that is designed for rapid analytics on large datasets. This database is used more often for powering use cases where real-time ingestion, high uptime, and fast query performance is needed. Druid can be used to analyze billions of rows not only in batch but also in real-time. It offers many integrations with different technologies like Apache Kafka Security, Cloud Storage, S3, Hive, HDFS, DataSketches, Redis, etc. It also follows the immutable past and append-only future. As past events happen once and never change, these are immutable, whereas the only append takes place for new events. It provides users with a fast and deep exploration of large scale transaction data.
Some of the exciting characteristics are:
Some of the common use cases of Druid are:
Druid’s core architecture is made by combining the ideas of different data warehouses, log search systems, and time-series databases.
It uses column-oriented storage; hence only loads required columns needed for a particular query. It helps in fast scans and aggregations.
It can process a query in parallel across the entire cluster. It is also termed as Massively Parallel Processing.
Druid is mostly deployed in clusters ranging from tens to hundreds that offer ingest rate of millions of records/sec, query latencies of sub-second to a few seconds, and retention of trillions of records.
Druid can ingest data either in real-time (Ingested data can be queried immediately) or in batches.
It is a fault-tolerant architecture that won’t lose data. Once Druid ingests data, its copy is safely stored in deep storage (Cloud Storage, Amazon S3, Redis, HDFS, many more). Users' data can be easily recovered from this deep storage even if all the Druid’s servers fail. This replication ensures that queries are still possible while the system recovers.
Druid uses concise and roaring compressed bitmap indexes to create indexes that help in faster filtering.
Every data in Druid must have a timestamp column as the data is always partitioned by time, and every query has a time filter.
Users can easily stream data natively using Druid from message buses like Kafka, kinesis, and many more. It can also load batch files from the data lakes like HDFS and Amazon S3.
Druid is mainly composed of the following processes:
The processes described above are organized into 3 types of servers: Master, Query, and Data.
It runs the Coordinator and Overlord. Basically, it manages big data ingestion and availability. Master is responsible for the ingestion of jobs and coordinating the availability of data on the “Data Servers”.
It runs Brokers and Optional Router processes. Basically, it handles queries and external clients by providing the endpoints of applications that users and clients interact with, routing queries to Data servers or other Query servers.
It runs Middle Managers and Historical processes. This helps execute jobs and store the queryable data. Other than these 3 servers and six processes, Druid also requires storage for Metadata and Deep Storage.
It is basically used to store the metadata of the system (Audit, Datasource, Schemas, and so on). For experimental purposes, the environment suggested using Apache Derby. Derby is the default metadata store for Druid, but it is not suitable for production. For production purposes, MySQL or PostgreSQL is the best choice. Metadata storage stores the entire metadata, which is very useful for the cluster of Druid to work. Derby is not used for production as it does not support a multi-node cluster with high availability. MySQL as a metadata storage database is used to acquire:
PostgreSQL, as a metadata storage database, is used to acquire:
Apache Druid uses separate storage for any data ingested that makes it fault-tolerant. Some of Deep Storage Technologies are Cloud Storage, Amazon S3, HDFS, Redis, and many more.
Data in Druid is organized into segments that generally have rows up to a few million. Loading data in Druid is known as Ingestion or Indexing. Druid fully supports batch ingestion and streaming ingestion. Some of the technologies supported by Druid is Kinesis, Cloud Storage, Apache Kafka, and local storage. Druid requires some structure to the data it ingests. In general, data should consist of OS timestamp, metrics and dimensions.
Ittransf uses Apache Zookeeper to integrate all the solutions. Users can use Zookeeper that comes with Druid for experiments, but one has to install Zookeeper for production. It’s cluster can only be as stable by a Zookeeper. Zookeeper is responsible for most of the communications that keep the Druid cluster functioning as Druid nodes are prevented from talking to each other.
Zookeeper is responsible for the following operations:
For maximum Zookeeper stability, the user has to follow the following practices:
If Zookeeper goes down, the cluster will operate. Failing of Zookeeper would neither result in addition to new data segments nor can it effectively react to the loss of one of the nodes. So, the failure of Zookeeper is a degraded state.
Users can monitor Druid by using the metrics it generates. Druid generates metrics related to queries, coordination and ingestion. These metrics are emitted as a JSON object. It is either emitted to a runtime log file or over HTTP (to service like Kafka). The emission of a metric is disabled by default.
Metrics emitted by Druid share a common set of fields.
Metric emitted may have dimensions beyond the one listed. To change the emission period of Druid that is 1 minute by default, one can use `druid.monitoring.emissionPeriod` to change the default value. Metrics available are:
Apache Druid is the best in the market when it comes to analyzing data in clusters and providing brief insight to all the data processed. Plus having Zookeeper by the side, one can ease up their working with it and rule the DataOps market. Also, there are many libraries to interact with it. To Validate the running of service, one can use JPS commands. As Druid nodes are java processes, they would show up when JPS commands '$ jps -m' are used. With that much ease in monitoring Druid and working with such a vast architecture of Druid, it is really the last bite of an ice-cream for a DataOps Engineer.
Original article source at: https://www.xenonstack.com/
1675732620
lakeFS is an open-source tool that transforms your object storage into a Git-like repository. It enables you to manage your data lake the way you manage your code.
With lakeFS you can build repeatable, atomic, and versioned data lake operations - from complex ETL jobs to data science and analytics.
lakeFS supports AWS S3, Azure Blob Storage, and Google Cloud Storage as its underlying storage service. It is API compatible with S3 and works seamlessly with all modern data frameworks such as Spark, Hive, AWS Athena, Presto, etc.
For more information, see the official lakeFS documentation.
When working with a data lake, it’s useful to have replicas of your production environment. These replicas allow you to test these ETLs and understand changes to your data without impacting downstream data consumers.
Running ETL and transformation jobs directly in production without proper ETL Testing is a guaranteed way to have data issues flow into dashboards, ML models, and other consumers sooner or later. The most common approach to avoid making changes directly in production is to create and maintain multiple data environments and perform ETL testing on them. Dev environment to develop the data pipelines and test environment where pipeline changes are tested before pushing it to production. With lakeFS you can create branches, and get a copy of the full production data, without copying anything. This enables a faster and easier process of ETL testing.
Data changes frequently. This makes the task of keeping track of its exact state over time difficult. Oftentimes, people maintain only one state of their data––its current state.
This has a negative impact on the work, as it becomes hard to:
In comparison, lakeFS exposes a Git-like interface to data that allows keeping track of more than just the current state of data. This makes reproducing its state at any point in time straightforward.
Data pipelines feed processed data from data lakes to downstream consumers like business dashboards and machine learning models. As more and more organizations rely on data to enable business critical decisions, data reliability and trust are of paramount concern. Thus, it’s important to ensure that production data adheres to the data governance policies of businesses. These data governance requirements can be as simple as a file format validation, schema check, or an exhaustive PII(Personally Identifiable Information) data removal from all of organization’s data.
Thus, to ensure the quality and reliability at each stage of the data lifecycle, data quality gates need to be implemented. That is, we need to run Continuous Integration(CI) tests on the data, and only if data governance requirements are met can the data can be promoted to production for business use.
Everytime there is an update to production data, the best practice would be to run CI tests and then promote(deploy) the data to production. With lakeFS you can create hooks that make sure that only data that passed these tests will become part of production.
A rollback operation is used to to fix critical data errors immediately.
What is a critical data error? Think of a situation where erroneous or misformatted data causes a signficant issue with an important service or function. In such situations, the first thing to do is stop the bleeding.
Rolling back returns data to a state in the past, before the error was present. You might not be showing all the latest data after a rollback, but at least you aren’t showing incorrect data or raising errors. Since lakeFS provides versions of the data without making copies of the data, you can time travel between versions and roll back to the version of the data before the error was presented.
Use this section to learn about lakeFS. For a production-suitable deployment, see the docs.
Ensure you have Docker installed on your computer.
Run the following command:
docker run --pull always --name lakefs -p 8000:8000 treeverse/lakefs run --local-settings
Open http://127.0.0.1:8000/ in your web browser to set up an initial admin user. You will use this user to log in and send API requests.
You can try lakeFS:
Once lakeFS is installed, you are ready to create your first repository!
Stay up to date and get lakeFS support via:
Author: Treeverse
Source Code: https://github.com/treeverse/lakeFS
License: Apache-2.0 license
1675650379
Learn how to install Apache Airflow on a Windows machine without Docker and how to write a DAG script. Airflow is a crucial tool for data engineers and scientists. Apache Airflow is a tool that helps you manage and schedule data pipelines.
According to the documentation, it lets you "programmatically author, schedule, and monitor workflows."
Airflow is a crucial tool for data engineers and scientists. In this article, I'll show you how to install it on Windows without Docker.
Although it's recommended to run Airflow with Docker, this method works for low-memory machines that are unable to run Docker.
This article assumes that you're familiar with using the command line and can set up your development environment as directed.
You need Python 3.8 or higher, Windows 10 or higher, and the Windows Subsystem for Linux (WSL2) to follow this tutorial.
WSL2 allows you to run Linux commands and programs on a Windows operating system.
It provides a Linux-compatible environment that runs natively on Windows, enabling users to use Linux command-line tools and utilities on a Windows machine.
You can read more here to install WSL2 on your machine.
With Python and WSL2 installed and activated on your machine, launch the terminal by searching for Ubuntu from the start menu.
To work with Airflow on Windows, you need to set up a virtual environment. To do this, you'll need to install the virtualenv package.
Note: Make sure you are at the root of the terminal by typing:
cd ~
pip install virtualenv
Create the virtual environment like this:
virtualenv airflow_env
And then activate the environment:
source airflow_env/bin/activate
Create a folder named airflow. Mine will be located at c/Users/[Username]. You can put yours wherever you prefer.
If you do not know how to navigate the terminal, you can follow the steps in the image below:
Create an Airflow directory from the terminal
Now that you have created this folder, you have to set it as an environment variable. Open a .bashrc script from the terminal with the command:
nano ~/.bashrc
Then write the following:
AIRFLOW_HOME=/c/Users/[YourUsername]/airflow
Setup Airflow directory path as an environment variable
Press ctrl s and ctrl x to exit the nano editor.
This part of the Airflow directory will be permanently saved as an environment variable. Anytime you open a new terminal, you can recover the value of the variable by typing:
cd $AIRFLOW_HOME
Navigate to Airflow directory using the environment variable
With the virtual environment still active and the current directory pointing to the created Airflow folder, install Apache Airflow:
pip install apache-airflow
Initialize the database:
airflow db init
Create a folder named dags inside the airflow folder. This will be used to store all Airflow scripts.
View files and folders generated by Airflow db init
When airflow is newly installed, you'll need to create a user. This user will be used to login into the Airflow UI and perform some admin functions.
airflow users create --username admin –password admin –firstname admin –lastname admin –role Admin –email youremail@email.com
Check the created user:
airflow users list
Create an Airflow user and list the created user
Run the scheduler with this command:
airflow scheduler
Launch another terminal, activate the airflow virtual environment, cd to $AIRFLOW_HOME, and run the webserver:
airflow webserver
If the default port 8080 is in use, change the port by typing:
airflow webserver –port <port number>
Log in to the UI using the username created earlier with "airflow users create".
In the UI, you can view pre-created DAGs that come with Airflow by default.
A DAG is a Python script for organizing and managing tasks in a workflow.
To create a DAG, navigate into the dags folder created inside the $AIRFLOW_HOME directory. Create a file named "hello_world_dag.py". Use VS Code if it's available.
Enter the code from the image below, and save it:
Example DAG script in VS Code editor
Go to the Airflow UI and search for hello_world_dag. If it does not show up, try refreshing your browser.
That's it. This completes the installation of Apache Airflow on Windows.
This guide covered how to install Apache Airflow on a Windows machine without Docker and how to write a DAG script.
I do hope the steps outlined above helped you install airflow on your Windows machine without Docker.
Original article source at https://www.freecodecamp.org
#apache #windows
1675648644
Learn about Apache Kafka from the Apache Kafka Handbook. Learn how to get started using Kafka. Why should you learn Apache Kafka? Apache Kafka is an open-source event streaming platform that can transport huge volumes of data at very low latency.
Apache Kafka is an open-source event streaming platform that can transport huge volumes of data at very low latency.
Companies like LinkedIn, Uber, and Netflix use Kafka to process trillions of events and petabtyes of data each day.
Kafka was originally developed at LinkedIn, to help handle their real-time data feeds. It's now maintained by the Apache Software Foundation, and is widely adopted in industry (being used by 80% of Fortune 100 companies).
Kafka lets you:
The main thing Kafka does is help you efficiently connect diverse data sources with the many different systems that might need to use that data.
Kafka helps you connect data sources to the systems using that data
Some of the things you can use Kafka for include:
What all these uses have in common is that they need to take in and process data in real time, often at huge scales. This is something Kafka excels at. To give one example, Pinterest uses Kafka to handle up to 40 million events per second.
Kafka is distributed, which means it runs as a cluster of nodes spread across multiple servers. It's also replicated, meaning that data is copied in multiple locations to protect it from a single point of failure. This makes Kafka both scalable and fault-tolerant.
Kafka is also fast. It's optimized for high throughput, making effective use of disk storage and batched network requests.
This article will:
Things the article won't cover:
kafka-console-producer
kafka-console-consumer
kafka-consumer-groups
Before we dive into Kafka, we need some context on event streaming and event-driven architectures.
An event is a record that something happened, as well as information about what happened. For example: a customer placed an order, a bank approved a transaction, inventory management updated stock levels.
Events can triggers one or more processes to respond to them. For example: sending an email receipt, transmitting funds to an account, updating a real-time dashboard.
Event streaming is the process of capturing events in real-time from sources (such as web applications, databases, or sensors) to create streams of events. These streams are potentially unending sequences of records.
The event stream can be stored, processed, and sent to different destinations, also called sinks. The destinations that consume the streams could be other applications, databases, or data pipelines for further processing.
As applications have become more complex, often being broken up into different microservices distributed across multiple data centers, many organizations have adopted an event-driven architecture for their applications.
This means that instead of parts of your application directly asking each other for updates about what happened, they each publish events to event streams. Other parts of the application continuously subscribe to these streams and only act when they receive an event that they are interested in.
This architecture helps ensure that if part of your application goes down, other parts won't also fail. Additionally, you can add new features by adding new subscribers to the event stream, without having to rewrite the existing codebase.
Kafka has become one of the most popular ways to implement event streaming and event-driven architectures. But it does have a bit of a learning curve and you need to understand a couple of concepts before you can make effective use of it.
These core concepts are:
When you write data to Kafka, or read data from it, you do this in the form of messages. You'll also see them called events or records.
A message consists of:
A Kafka message consisting of key, value, timestamp, compression type, and headers
Every event in Kafka is, at its simplest, a key-value pair. These are serialized into binary, since Kafka itself handles arrays of bytes rather than complex language-specific objects.
Keys are usually strings or integers and aren't unique for every message. Instead, they point to a particular entity in the system, such as a specific user, order, or device. Keys can be null, but when they are included they are used for dividing topics into partitions (more on partitions below).
The message value contains details about the event that happened. This could be as simple as a string or as complex as an object with many nested properties. Values can be null, but usually aren't.
By default, the timestamp records when the message was created. You can overwrite this if your event actually occurred earlier and you want to record that time instead.
Messages are usually small (less than 1 MB) and sent in a standard data format, such as JSON, Avro, or Protobuf. Even so, they can be compressed to save on data. The compression type can be set to gzip
, lz4
, snappy
, zstd
, or none
.
Events can also optionally have headers, which are key-value pairs of strings containing metadata, such as where the event originated from or where you want it routed to.
Once a message is sent into a Kafka topic, it also receives a partition number and offset id (more about these later).
Kafka stores messages in a topic, an ordered sequence of events, also called an event log.
A Kafka topic containing messages, each with a unique offset
Different topics are identified by their names and will store different kinds of events. For example a social media application might have posts
, likes
, and comments
topics to record every time a user creates a post, likes a post, or leaves a comment.
Multiple applications can write to and read from the same topic. An application might also read messages from one topic, filter or transform the data, and then write the result to another topic.
One important feature of topics is that they are append-only. When you write a message to a topic, it's added to the end of the log. Events in a topic are immutable. Once they're written to a topic, you can't change them.
A Producer writing events to topics and a Consumer reading events from topics
Unlike with messaging queues, reading an event from a topic doesn't delete it. Events can be read as often as needed, perhaps several times by multiple different applications.
Topics are also durable, holding onto messages for a specific period (by default 7 days) by saving them to physical storage on disk.
You can configure topics so that messages expire after a certain amount of time, or when a certain amount of storage is exceeded. You can even store messages indefinitely as long as you can pay for the storage costs.
In order to help Kafka to scale, topics can be divided into partitions. This breaks up the event log into multiple logs, each of which lives on a separate node in the Kafka cluster. This means that the work of writing and storing messages can be spread across multiple machines.
When you create a topic, you specify the amount of partitions it has. The partitions are themselves numbered, starting at 0. When a new event is written to a topic, it's appended to one of the topic's partitions.
A topic divided into three partitions
If messages have no key, they will be evenly distributed among partitions in a round robin manner: partition 0, then partition 1, then partition 2, and so on. This way, all partitions get an even share of the data but there's no guarantee about the ordering of messages.
Messages that have the same key will always be sent to the same partition, and in the same order. The key is run through a hashing function which turns it into an integer. This output is then used to select a partition.
Messages without keys are sent across partitions, while messages with the same keys are sent to the same partition
Messages within each partition are guaranteed to be ordered. For example, all messages with the same customer_id
as their key will be sent to the same partition in the order in which Kafka received them.
Each message in a partition gets an id that is an incrementing integer, called an offset. Offsets start at 0 and are incremented every time Kafka writes a message to a partition. This means that each message in a given partition has a unique offset.
Offsets are unique within a partition but not between partitions
Offsets are not reused, even when older messages get deleted. They continue to increment, giving each new message in the partition a unique id.
When data is read from a partition, it is read in order from the lowest existing offset upwards. We'll see more about offsets when we cover Kafka consumers.
A single "server" running Kafka is called a broker. In reality, this might be a Docker container running in a virtual machine. But it can be a helpful mental image to think of brokers as individual servers.
A Kafka cluster made up of three brokers
Multiple brokers working together make up a Kafka cluster. There might be a handful of brokers in a cluster, or more than 100. When a client application connects to one broker, Kafka automatically connects it to every broker in the cluster.
By running as a cluster, Kafka becomes more scalable and fault-tolerant. If one broker fails, the others will take over its work to ensure there is no downtime or data loss.
Each broker manages a set of partitions and handles requests to write data to or read data from these partitions. Partitions for a given topic will be spread evenly across the brokers in a cluster to help with load balancing. Brokers also manage replicating partitions to keep their data backed up.
Partitions spread across brokers
To protect against data loss if a broker fails, Kafka writes the same data to copies of a partition on multiple brokers. This is called replication.
The main copy of a partition is called the leader, while the replicas are called followers.
The data from the leader partition is copied to follower partitions on different brokers
When a topic is created, you set a replication factor for it. This controls how many replicas get written to. A replication factor of three is common, meaning data gets written to one leader and replicated to two followers. So even if two brokers failed, your data would still be safe.
Whenever you write messages to a partition, you're writing to the leader partition. Kafka then automatically copies these messages to the followers. As such, the logs on the followers will have the same messages and offsets as on the leader.
Followers that are up to date with the leader are called In-Sync Replicas (ISRs). Kafka considers a message to be committed once a minimum number of replicas have saved it to their logs. You can configure this to get higher throughput at the expense of less certainty that a message has been backed up.
Producers are client applications that write events to Kafka topics. These apps aren't themselves part of Kafka – you write them.
Usually you will use a library to help manage writing events to Kafka. There is an official client library for Java as well as dozens of community-supported libraries for languages such as Scala, JavaScript, Go, Rust, Python, C#, and C++.
A Producer application writing to multiple topics
Producers are totally decoupled from consumers, which read from Kafka. They don't know about each other and their speed doesn't affect each other. Producers aren't affected if consumers fail, and the same is true for consumers.
If you need to, you could write an application that writes certain events to Kafka and reads other events from Kafka, making it both a producer and a consumer.
Producers take a key-value pair, generate a Kafka message, and then serialize it into binary for transmission across the network. You can adjust the configuration of producers to batch messages together based on their size or some fixed time limit to optimize writing messages to the Kafka brokers.
It's the producer that decides which partition of a topic to send each message to. Again, messages without keys will be distributed evenly among partitions, while messages with keys are all sent to the same partition.
Consumers are client applications that read messages from topics in a Kafka cluster. Like with producers, you write these applications yourself and can make use of client libraries to support the programming language your application is built with.
A Consumer reading messages from multiple topics
Consumers can read from one or more partitions within a topic, and from one or more topics. Messages are read in order within a partition, from the lowest available offset to the highest. But if a consumer reads data from several partitions in the same topic, the message order between these partitions is not guaranteed.
For example, a consumer might read messages from partition 0, then partition 2, then partition 1, then back to partition 0. The messages from partition 0 will be read in order, but there might be messages from the other partitions mixed among them.
It's important to remember that reading a message does not delete it. The message is still available to be read by any other consumer that needs to access it. It's normal for multiple consumers to read from the same topic if they each have uses for the data in it.
By default, when a consumer starts up it will read from the current offset in a partition. But consumers can also be configured to go back and read from the oldest existing offset.
Consumers deserialize messages, converting them from binary into a collection of key-value pairs that your application can then work with. The format of a message should not change during a topic's lifetime or your producers and consumers won't be able to serialize and deserialize it correctly.
One thing to be aware of is that consumers request messages from Kafka, it doesn't push messages to them. This protects consumers from becoming overwhelmed if Kafka is handling a high volume of messages. If you want to scale consumers, you can run multiple instances of a consumer together in a consumer group.
An application that reads from Kafka can create multiple instances of the same consumer to split up the work of reading from different partitions in a topic. These consumers work together as a consumer group.
When you create a consumer, you can assign it a group id. All consumers in a group will have the same group id.
You can create consumer instances in a group up to the number of partitions in a topic. So if you have a topic with 5 partitions, you can create up to 5 instances of the same consumer in a consumer group. If you ever have more consumers in a group than partitions, the extra consumer will remain idle.
Consumers in a consumer group reading messages from a topic's partitions
If you add another consumer instance to a consumer group, Kafka will automatically redistribute the partitions among the consumers in a process called rebalancing.
Each partition is only assigned to one consumer in a group, but a consumer can read from multiple partitions. Also, multiple different consumer groups (meaning different applications) can read from the same topic at the same time.
Kafka brokers use an internal topic called __consumer_offsets
to keep track of which messages a specific consumer group has successfully processed.
As a consumer reads from a partition, it regularly saves the offset it has read up to and sends this data to the broker it is reading from. This is called the consumer offset and is handled automatically by most client libraries.
A Consumer committing the offsets it has read up to
If a consumer crashes, the consumer offset helps the remaining consumers to know where to start from when they take over reading from the partition.
The same thing happens if a new consumer is added to the group. The consumer group rebalances, the new consumer is assigned a partition, and it picks up reading from the consumer offset of that partition.
One other topic that we briefly need to cover here is how Kafka clusters are managed. Currently this is usually done using Zookeeper, a service for managing and synchronizing distributed systems. Like Kafka, it's maintained by the Apache Foundation.
Kafka uses Zookeeper to manage the brokers in a cluster, and requires Zookeeper even if you're running a Kafka cluster with only one broker.
Recently, a proposal has been accepted to remove Zookeeper and have Kafka manage itself (KIP-500), but this is not yet widely used in production.
Zookeeper keeps track of things like:
A Zookeeper ensemble managing the brokers in a Kafka cluster
Zookeeper itself runs as a cluster called an ensemble. This means that Zookeeper can keep working even if one node in the cluster fails. New data gets written to the ensemble's leader and replicated to the followers. Your Kafka brokers can read this data from any of the Zookeeper nodes in the ensemble.
Now that you understand the main concepts behind Kafka, let's get some hands-on practice working with Kafka.
You're going to install Kafka on your own computer, practice interacting with Kafka brokers from the command line, and then build a simple producer and consumer application with Java.
At the time of writing this guide, the latest stable version of Kafka is 3.3.1. Check kafka.apache.org/downloads to see if there is a more recent stable version. If there is, you can replace "3.3.1" with the latest stable version in all of the following instructions.
If you're using macOS, I recommend using Homebrew to install Kafka. It will make sure you have Java installed before it installs Kafka.
If you don't already have Homebrew installed, install it by following the instructions at brew.sh.
Next, run brew install kafka
in a terminal. This will install Kafka's binaries at usr/local/bin
.
Finally, run kafka-topics --version
in a terminal and you should see 3.3.1
. If you do, you're all set.
To make it easier to work with Kafka, you can add Kafka to the PATH
environment variable. Open your ~/.bashrc
(if using Bash) or ~/.zshrc
(if using Zsh) and add the following line, replacing USERNAME
with your username:
PATH="$PATH:/Users/USERNAME/kafka_2.13-3.3.1/bin"
You'll need to close your terminal for this change to take effect.
Now, if you run echo $PATH
you should see that the Kafka bin
directory has been added to your path.
Kafka isn't natively supported on Windows, so you will need to use either WSL2 or Docker. I'm going to show you WSL2 since it's the same steps as Linux.
To set up WSL2 on Widows, follow the instructions in the official docs.
From here on, the instructions are the same for both WSL2 and Linux.
First, install Java 11 by running the following commands:
wget -O- https://apt.corretto.aws/corretto.key | sudo apt-key add -
sudo add-apt-repository 'deb https://apt.corretto.aws stable main'
sudo apt-get update; sudo apt-get install -y java-11-amazon-corretto-jdk
Once this has finished, run java -version
and you should see something like:
openjdk version "11.0.17" 2022-10-18 LTS
OpenJDK Runtime Environment Corretto-11.0.17.8.1 (build 11.0.17+8-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.17.8.1 (build 11.0.17+8-LTS, mixed mode)
From your root directory, download Kafka with the following command:
wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz
The 2.13
means it is using version 2.13
of Scala, while 3.3.1
refers to the Kafka version.
Extract the contents of the download with:
tar xzf kafka_2.13-3.3.1.tgz
If you run ls
, you'll now see kafka_2.13-3.3.1
in your root directory.
To make it easier to work with Kafka, you can add Kafka to the PATH
environment variable. Open your ~/.bashrc
(if using Bash) or ~/.zshrc
(if using Zsh) and add the following line, replacing USERNAME
with your username:
PATH="$PATH:home/USERNAME/kafka_2.13-3.3.1/bin"
You'll need to close your terminal for this change to take effect.
Now, if you run echo $PATH
you should see that the Kafka bin
directory has been added to your path.
Run kafka-topics.sh --version
in a terminal and you should see 3.3.1
. If you do, you're all set.
Since Kafka uses Zookeeper to manage clusters, you need to start Zookeeper before you start Kafka.
In one terminal window, start Zookeeper with:
/usr/local/bin/zookeeper-server-start /usr/local/etc/zookeeper/zoo.cfg
In another terminal window, start Kafka with:
/usr/local/bin/kafka-server-start /usr/local/etc/kafka/server.properties
While using Kafka, you need to keep both these terminal windows open. Closing them will shut down Kafka.
In one terminal window, start Zookeeper with:
~/kafka_2.13-3.3.1/bin/zookeeper-server-start.sh ~/kafka_2.13-3.3.1/config/zookeeper.properties
In another terminal window, start Kafka with:
~/kafka_2.13-3.3.1/bin/kafka-server-start.sh ~/kafka_2.13-3.3.1/config/server.properties
While using Kafka, you need to keep both these terminal windows open. Closing them will shut down Kafka.
Now that you have Kafka installed and running on your machine, it's time to get some hands-on practice.
When you install Kafka, it comes with a Command Line Interface (CLI) that lets you create and manage topics, as well as produce and consume events.
First, make sure Zookeeper and Kafka are running in two terminal windows.
In a third terminal window, run kafka-topics.sh
(on WSL2 or Linux) or kafka-topics
(on macOS) to make sure the CLI is working. You'll see a list of all the options you can pass to the CLI.
kafka-topics options
Note: When working with the Kafka CLI, the command will be kafka-topics.sh
on WSL2 and Linux. It will be kafka-topics.sh
on macOS if you directly installed the Kafka binaries and kafka-topics
if you used Homebrew. So if you're using Homebrew, remove the .sh
extension from the example commands in this section.
To see the topics available on the Kafka broker on your local machine, use:
kafka-topics.sh --bootstrap-server localhost:9092 --list
This means "Connect to the Kafka broker running on localhost:9092 and list all topics there". --bootstrap-server
refers to the Kafka broker you are trying to connect to and localhost:9092
is the IP address it's running at. You won't see any output since you haven't created any topics yet.
To create a topic (with the default replication factor and number of partitions), use the --create
and --topic
options and pass them a topic name:
kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my_first_topic
If you use an _
or .
in your topic name, you will see the following warning:
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Since Kafka could confuse my.first.topic
with my_first_topic
, it's best to only use either underscores or periods when naming topics.
To describe the topics on a broker, use the --describe
option:
kafka-topics.sh --bootstrap-server localhost:9092 --describe
This will print the details of all the topics on this broker, including the number of partitions and their replication factor. By default, these will both be set to 1
.
If you add the --topic
option and the name of a topic, it will describe only that topic:
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my_first_topic
To create a topic with multiple partitions, use the --partitions
option and pass it a number:
kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my_second_topic --partitions 3
To create a topic with a replication factor higher than the default, use the --replication-factor
option and pass it a number:
kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my_third_topic --partitions 3 --replication-factor 3
You should get the following error:
ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 2 larger than available brokers: 1.
Since you're only running one Kafka broker on your machine, you can't set a replication factor higher than one. If you were running a cluster with multiple brokers, you could set a replication factor as high as the total number of brokers.
To delete a topic, use the --delete
option and specify a topic with the --topic
option:
kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic my_first_topic
You won't get any output to say the topic was deleted but you can check using --list
or --describe
.
kafka-console-producer
You can produce messages to a topic from the command line using kafka-console-producer
.
Run kafka-console-producer.sh
to see the options you can pass to it.
kafka-console-producer options
To create a producer connected to a specific topic, run:
kafka-console-producer.sh --bootstrap-server localhost:9092 --topic TOPIC_NAME
Let's produce messages to the my_first_topic
topic.
kafka-console-producer.sh --bootstrap-server localhost:9092 --topic my_first_topic
Your prompt will change and you will be able to type text. Press enter
to send that message. You can keep sending messages until you press ctrl
+ c
.
Sending messages using kafka-console-producer
If you produce messages to a topic that doesn't exist, you'll get a warning, but the topic will be created and the messages will still get sent. It's better to create a topic in advance, however, so you can specify partitions and replication.
By default, the messages sent from kafka-console-producer
have their keys set to null
, and so they will be evenly distributed to all partitions.
You can set a key by using the --property
option to set parse.key
to be true and providing a key separator, such as :
For example, we can create a books
topic and use the books' genre as a key.
kafka-topics.sh --bootstrap-server localhost:9092 --topic books --create
kafka-console-producer.sh --bootstrap-server localhost:9092 --topic books --property parse.key=true --property key.separator=:
Now you can enter keys and values in the format key:value
. Anything to the left of the key separator will be interpreted as a message key, anything to the right as a message value.
science_fiction:All Systems Red
fantasy:Uprooted
horror:Mexican Gothic
Producing messages with keys and values
Now that you've produced messages to a topic from the command line, it's time to consume those messages from the command line.
kafka-console-consumer
You can consumer messages from a topic from the command line using kafka-console-consumer
.
Run kafka-console-consumer.sh
to see the options you can pass to it.
kafka-console-consumer options
To create a consumer, run:
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TOPIC_NAME
When you start a consumer, by default it will read messages as they are written to the end of the topic. It won't read messages that were previously sent to the topic.
If you want to read the messages you already sent to a topic, use the --from-beginning
option to read from the beginning of the topic:
kafka-console-consumer --bootstrap-server localhost:9092 --topic my_first_topic --from-beginning
The messages might appear "out of order". Remember, messages are ordered within a partition but ordering can't be guaranteed between partitions. If you don't set a key, they will be sent round robin between partitions and ordering isn't guaranteed.
You can display additional information about messages, such as their key and timestamp, by using the --property
option and setting the print
property to true.
Use the --formatter
option to set the message formatter and the --property
option to select which message properties to print.
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my_first_topic --from-beginning --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true
Consuming messages from a topic
We get the messages' timestamp, key, and value. Since we didn't assign any keys when we sent these messages to my_first_topic
, their key
is null
.
kafka-consumer-groups
You can run consumers in a consumer group using the Kafka CLI. To view the documentation for this, run:
kafka-consumer-groups.sh
kafka-consumer-groups options
First, create a topic with three partitions. Each consumer in a group will consume from one partition. If there are more consumers than partitions, any extra consumers will be idle.
kafka-topics.sh --bootstrap-server localhost:9092 --topic fantasy_novels --create --partitions 3
You add a consumer to a group when you create it using the --group
option. If you run the same command multiple times with the same group name, each new consumer will be added to the group.
To create the first consumer in your consumer group, run:
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic fantasy_novels --group fantasy_consumer_group
Next, open two new terminal windows and run the same command again to add a second and third consumer to the consumer group.
Three consumers running in a consumer group
In a different terminal window, create a producer and send a few messages with keys to the topic.
Note: Since Kafka 2.4, Kafka will send messages in batches to one "sticky" partition for better performance. In order to demonstrate messages being sent round robin between partitions (without sending a large volume of messages), we can set the partitioner to RoundRobinPartitioner
.
kafka-console-producer.sh --bootstrap-server localhost:9092 --topic fantasy_novels --property parse.key=true --property key.separator=: --property partitioner.class=org.apache.kafka.clients.producer.RoundRobinPartitioner
tolkien:The Lord of the Rings
le_guin:A Wizard of Earthsea
leckie:The Raven Tower
de_bodard:The House of Shattered Wings
okorafor:Who Fears Death
liu:The Grace of Kings
Messages spread between consumers in a consumer group
If you stop one of the consumers, the consumer group will rebalance and future messages will be sent to the remaining consumers.
Now that you have some experience working with Kafka from the command line, the next step is to build a small application that connects to Kafka.
We're going to build a simple Java app that both produces messages to and consumes messages from Kafka. For this we'll use the official Kafka Java client.
If at any point you get stuck, the full code for this project is available on GitHub.
First of all, make sure you have Java (at least JDK 11) and Kafka installed.
We're going to send messages about characters from The Lord of the Rings. So let's create a topic for these messages with three partitions.
From the command line, run:
kafka-topics.sh --bootstrap-server localhost:9092 --create --topic lotr_characters --partitions 3
I recommend using IntelliJ for Java projects, so go ahead and install the Community Edition if you don't already have it. You can download it from jetbrains.com/idea
In Intellij, select File
, New
, and Project
.
Give your project a name and select a location for it on your computer. Make sure you have selected Java as the language, Maven as the build system, and that the JDK is at least Java 11. Then click Create
.
Setting up a Maven project in IntelliJ
Note: If you're on Windows, IntelliJ can't use a JDK installed on WSL. To install Java on the Windows side of things, go to docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list and download the Windows installer. Follow the installation steps, open a command prompt, and run java -version
. You should see something like:
openjdk version "11.0.18" 2023-01-17 LTS
OpenJDK Runtime Environment Corretto-11.0.18.10.1 (build 11.0.18+10-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.18.10.1 (build 11.0.18+10-LTS, mixed mode)
Once your Maven project finishes setting up, run the Main
class to see "Hello world!" and make sure everything worked.
Next, we're going to install our dependencies. Open up pom.xml
and inside the <project>
element, create a <dependencies>
element.
We're going to use the Java Kafka client for interacting with Kafka and SLF4J for logging, so add the following inside your <dependencies>
element:
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>3.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-api -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>2.0.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-simple -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>2.0.6</version>
</dependency>
The package names and version numbers might be red, meaning you haven't downloaded them yet. If this happens, click on View
, Tool Windows
, and Maven
to open the Maven menu. Click on the Reload All Maven Projects
icon and Maven will install these dependencies.
Reloading Maven dependencies in IntelliJ
Create a HelloKafka
class in the same directory as your Main
class and give it the following contents:
package org.example;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class HelloKafka {
private static final Logger log = LoggerFactory.getLogger(HelloKafka.class);
public static void main(String[] args) {
log.info("Hello Kafka");
}
}
To make sure your dependencies are installed, run this class and you should see [main] INFO org.example.HelloKafka - Hello Kafka
printed to the IntelliJ console.
Next, we're going to create a Producer
class. You can call this whatever you want as long as it doesn't clash with another class. So don't use KafkaProducer
as you'll need that class in a minute.
package org.example;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Producer {
private static final Logger log = LoggerFactory.getLogger(KafkaProducer.class);
public static void main(String[] args) {
log.info("This class will produce messages to Kafka");
}
}
All of our Kafka-specific code is going to go inside this class's main()
method.
The first thing we need to do is configure a few properties for the producer. Add the following inside the main()
method:
Properties properties = new Properties();
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
Properties
stores a set of properties as pairs of strings. The ones we're using are:
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG
which specifies the IP address to use to access the Kafka clusterProducerConfig.KEY_SERIALIZER_CLASS_CONFIG
which specifies the serializer to use for message keysProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG
which specifies the serializer to use for message valuesWe're going to connect to our local Kafka cluster running on localhost:9092
, and use the StringSerializer
since both our keys and values will be strings.
Now we can create our producer and pass it the configuration properties.
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
To send a message, we need to create a ProducerRecord
and pass it to our producer. ProducerRecord
contains a topic name, and optionally a key, value, and partition number.
We're going to create the ProducerRecord
with the topic to use, the message's key, and the message's value.
ProducerRecord<String, String> producerRecord = new ProducerRecord<>("lotr_characters", "hobbits", "Bilbo");
We can now use the producer's send()
method to send the message to Kafka.
producer.send(producerRecord);
Finally, we need to call the close()
method to stop the producer. This method handles any messages currently being processed by send()
and then closes the producer.
producer.close();
Now it's time to run our producer. Make sure you have Zookeeper and Kafka running. Then run the main()
method of the Producer
class.
Sending a message from a producer in a Java Kafka client app
Note: On Windows, your producer might not be able to connect to a Kafka broker running on WSL. To fix this, you're going to need to do the following:
cd ~/kafka_2.13-3.3.1/config/
server.properties
, for example with Nano: nano server.properties
#listeners=PLAINTEXT//:9092
listeners=PLAINTEXT//[::1]:9092
Producer
class, replace "localhost:9092"
with "[::1]:9092"
[::1]
, or 0:0:0:0:0:0:0:1
, refers to the loopback address (or localhost) in IPv6. This is equivalent to 127.0.0.1
in IPv4.
If you change listeners
, when you try to access the Kafka broker from the command line you'll also have to use the new IP address, so use --bootstrap-server ::1:9092
instead of --bootstrap-server localhost:9092
and it should work.
We can now check that Producer
worked by using kafka-console-consumer
in another terminal window to read from the lotr_characters
topic and see the message printed to the console.
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic lotr_characters --from-beginning
kafka-console-consumer reading the message sent by the producer in our Java app
So far we're only sending one message. If we update Producer
to send multiple messages, we'll be able to see how keys are used to divide messages between partitions. We can also take this opportunity to use a callback to view the sent message's metadata.
To do this, we're going to loop over a collection of characters to generate our messages.
So replace this:
ProducerRecord<String, String> producerRecord = new ProducerRecord<>("lotr_characters", "hobbits", "Bilbo");
producer.send(producerRecord);
with this:
HashMap<String, String> characters = new HashMap<String, String>();
characters.put("hobbits", "Frodo");
characters.put("hobbits", "Sam");
characters.put("elves", "Galadriel");
characters.put("elves", "Arwen");
characters.put("humans", "Éowyn");
characters.put("humans", "Faramir");
for (HashMap.Entry<String, String> character : characters.entrySet()) {
ProducerRecord<String, String> producerRecord = new ProducerRecord<>("lotr_characters", character.getKey(), character.getValue());
producer.send(producerRecord, (RecordMetadata recordMetadata, Exception err) -> {
if (err == null) {
log.info("Message received. \n" +
"topic [" + recordMetadata.topic() + "]\n" +
"partition [" + recordMetadata.partition() + "]\n" +
"offset [" + recordMetadata.offset() + "]\n" +
"timestamp [" + recordMetadata.timestamp() + "]");
} else {
log.error("An error occurred while producing messages", err);
}
});
}
Here, we're iterating over the collection, creating a ProducerRecord
for each entry, and passing the record to send()
. Behind the scenes, Kafka will batch these messages together to make fewer network requests. send()
can also take a callback as a second argument. We're going to pass it a lambda which will run code when the send()
request completes.
If the request completed successfully, we get back a RecordMetadata
object with metadata about the message, which we can use to see things such as the partition and offset the message ended up in.
If we get back an exception, we could handle it by retrying to send the message, or alerting our application. In this case, we're just going to log the exception.
Run the main()
method of the Producer
class and you should see the message metadata get logged.
The full code for the Producer
class should now be:
package org.example;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import org.apache.kafka.common.serialization.StringSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.HashMap;
import java.util.Properties;
public class Producer {
private static final Logger log = LoggerFactory.getLogger(Producer.class);
public static void main(String[] args) {
log.info("This class produces messages to Kafka");
Properties properties = new Properties();
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
HashMap<String, String> characters = new HashMap<String, String>();
characters.put("hobbits", "Frodo");
characters.put("hobbits", "Sam");
characters.put("elves", "Galadriel");
characters.put("elves", "Arwen");
characters.put("humans", "Éowyn");
characters.put("humans", "Faramir");
for (HashMap.Entry<String, String> character : characters.entrySet()) {
ProducerRecord<String, String> producerRecord = new ProducerRecord<>("lotr_characters", character.getKey(), character.getValue());
producer.send(producerRecord, (RecordMetadata recordMetadata, Exception err) -> {
if (err == null) {
log.info("Message received. \n" +
"topic [" + recordMetadata.topic() + "]\n" +
"partition [" + recordMetadata.partition() + "]\n" +
"offset [" + recordMetadata.offset() + "]\n" +
"timestamp [" + recordMetadata.timestamp() + "]");
} else {
log.error("An error occurred while producing messages", err);
}
});
}
producer.close();
}
}
Next, we're going to create a consumer to read these messages from Kafka.
First, create a Consumer
class. Again, you can call it whatever you want, but don't call it KafkaConsumer
as you will need that class in a moment.
All the Kafka-specific code will go in Consumer
's main()
method.
package org.example;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Consumer {
private static final Logger log = LoggerFactory.getLogger(Consumer.class);
public static void main(String[] args) {
log.info("This class consumes messages from Kafka");
}
}
Next, configure the consumer properties.
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "lotr_consumer_group");
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
Just like with Producer
, these properties are a set of string pairs. The ones we're using are:
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG
which specifies the IP address to use to access the Kafka clusterConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG
which specifies the deserializer to use for message keysConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG
which specifies the deserializer to use for message valuesConsumerConfig.GROUP_ID_CONFIG
which specifies the consumer group this consumer belongs toConsumerConfig.AUTO_OFFSET_RESET_CONFIG
which specifies the offset to start reading fromWe're connecting to the Kafka cluster on localhost:9092
, using string deserializers since our keys and values are strings, setting a group id for our consumer, and telling the consumer to read from the start of the topic.
Note: If you're running the consumer on Windows and accessing a Kafka broker running on WSL, you'll need to change "localhost:9091"
to "[::1]:9092"
or "0:0:0:0:0:0:0:1:9092"
, like you did in Producer
.
Next, we create a KafkaConsumer
and pass it the configuration properties.
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
We need to tell the consumer which topic, or topics, to subscribe to. The subscribe()
method takes in a collection of one or more strings, naming the topics you want to read from. Remember, consumers can subscribe to more than one topic at the same time. For this example, we'll use one topic, the lotr_characters
topic.
String topic = "lotr_characters";
consumer.subscribe(Arrays.asList(topic));
The consumer is now ready to start reading messages from the topic. It does this by regularly polling for new messages.
We'll use a while loop to repeatedly call the poll()
method to check for new messages.
poll()
takes in a duration for how long it should read for at a time. It then batches these messages into an iterable called ConsumerRecords
. We can then iterate over ConsumerRecords
and do something with each individual ConsumerRecord
.
In a real-world application, we would process this data or send it to some further destination, like a database or data pipeline. Here, we're just going to log the key, value, partition, and offset for each message we receive.
while(true){
ConsumerRecords<String, String> messages = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> message : messages){
log.info("key [" + message.key() + "] value [" + message.value() +"]");
log.info("partition [" + message.partition() + "] offset [" + message.offset() + "]");
}
}
Now it's time to run our consumer. Make sure you have Zookeeper and Kafka running. Run the Consumer
class and you'll see the messages that Producer
previously sent to the lotr_characters
topic in Kafka.
The Kafka client app consuming messages that were previously produced to Kafka
Right now, our consumer is running in an infinite loop and polling for new messages every 100 ms. This isn't a problem, but we should add safeguards to handle shutting down the consumer if an exception occurs.
We're going to wrap our code in a try-catch-finally block. If an exception occurs, we can handle it in the catch
block.
The finally
block will then call the consumer's close()
method. This will close the socket the consumer is using, commit the offsets it has processed, and trigger a consumer group rebalance so any other consumers in the group can take over reading the partitions this consumer was handling.
try {
// subscribe to topic(s)
String topic = "lotr_characters";
consumer.subscribe(Arrays.asList(topic));
while (true) {
// poll for new messages
ConsumerRecords<String, String> messages = consumer.poll(Duration.ofMillis(100));
// handle message contents
for (ConsumerRecord<String, String> message : messages) {
log.info("key [" + message.key() + "] value [" + message.value() + "]");
log.info("partition [" + message.partition() + "] offset [" + message.offset() + "]");
}
}
} catch (Exception err) {
// catch and handle exceptions
log.error("Error: ", err);
} finally {
// close consumer and commit offsets
consumer.close();
log.info("consumer is now closed");
}
Consumer
will continuously poll its assigned topics for new messages and shut down safely if it experiences an exception.
The full code for the Consumer
class should now be:
package org.example;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.time.Duration;
import java.util.Arrays;
import java.util.Properties;
public class Consumer {
private static final Logger log = LoggerFactory.getLogger(Consumer.class);
public static void main(String[] args) {
log.info("This class consumes messages from Kafka");
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "lotr_consumer_group");
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
try {
String topic = "lotr_characters";
consumer.subscribe(Arrays.asList(topic));
while (true) {
ConsumerRecords<String, String> messages = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> message : messages) {
log.info("key [" + message.key() + "] value [" + message.value() + "]");
log.info("partition [" + message.partition() + "] offset [" + message.offset() + "]");
}
}
} catch (Exception err) {
log.error("Error: ", err);
} finally {
consumer.close();
log.info("The consumer is now closed");
}
}
}
You now have a basic Java application that can send messages to and read messages from Kafka. If you got stuck at any point, the full code is available on GitHub.
Congratulations on making it this far. You've learned:
There's plenty more to learn about Kafka, whether that's Kafka Connect for connecting Kafka to common data systems or the Kafka Streams API for processing and transforming your data.
Some resources you might find useful as you continue your journey with Kafka are:
I hope this guide has been helpful and made you excited to learn more about Kafka, event streaming, and real-time data processing.
Original article source at https://www.freecodecamp.org
#apache #kafka #java
1673116920
🐫 + ☁️ = Apache Camel K
Apache Camel K is a lightweight integration framework built from Apache Camel that runs natively on Kubernetes and is specifically designed for serverless and microservice architectures. Users of Camel K
can instantly run integration code written in Camel DSL on their preferred Cloud provider.
Camel K allows you to run integrations directly on a Kubernetes
or OpenShift
cluster. To use it, you need to be connected to a cloud environment or to a local cluster created for development purposes.
You can use Camel DSL to define your Integration
. Just save it in a file and use kamel
command line interface (download latest release) to run it. As an example, just try running:
hello.groovy
from('timer:tick?period=3000')
.setBody().constant('Hello world from Camel K')
.to('log:info')
kamel run hello.groovy
You can even run your integrations in a dev
mode. Change the code and see the changes automatically applied (instantly) to the remote integration pod! We have provided more examples that you can use to inspire your next Integration
development.
You can use any of the Apache Camel components available. The related dependencies will be resolved automatically.
Discover more about dependencies and components.
Camel K supports multiple languages for writing integrations.
See all the languages available.
The details of how the integration is mapped into Kubernetes resources can be customized using traits.
More information is provided in the official documentation traits section.
Since the inception of the project, our goal was to bring Apache Camel
to the cloud.
See the software architecture details.
We love contributions and we want to make Camel K great!
Contributing is easy, just take a look at our developer’s guide.
Author: Apache
Source Code: https://github.com/apache/camel-k
License: Apache-2.0 license
1672941240
Learn how and why ShardingSphere can achieve database high availability using MySQL as an example.
Users have many options to customize and extend ShardingSphere's high availability (HA) solutions. Our team has completed two HA plans: A MySQL high availability solution based on MGR and an openGauss database high availability solution contributed by some community committers. The principles of the two solutions are the same.
Below is how and why ShardingSphere can achieve database high availability using MySQL as an example:
(Zhao Jinchao, CC BY-SA 4.0)
ShardingSphere checks if the underlying MySQL cluster environment is ready by executing the following SQL statement. ShardingSphere cannot be started if any of the tests fail.
Check if MGR is installed:
SELECT * FROM information_schema.PLUGINS WHERE PLUGIN_NAME='group_replication'
View the MGR group member number. The underlying MGR cluster should consist of at least three nodes:
SELECT COUNT(*) FROM performance_schema.replication_group_members
Check whether the MGR cluster's group name is consistent with that in the configuration. The group name is the marker of an MGR group, and each group of an MGR cluster only has one group name:
SELECT * FROM performance_schema.global_variables WHERE VARIABLE_NAME='group_replication_group_name'
Check if the current MGR is set as the single primary mode. Currently, ShardingSphere does not support dual-write or multi-write scenarios. It only supports single-write mode:
SELECT * FROM performance_schema.global_variables WHERE VARIABLE_NAME='group_replication_single_primary_mode'
Query all the node hosts, ports, and states in the MGR group cluster to check if the configured data source is correct:
SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE FROM performance_schema.replication_group_members
ShardingSphere finds the primary database URL according to the query master database SQL command provided by MySQL:
private String findPrimaryDataSourceURL(final Map<String, DataSource> dataSourceMap) {
String RESULT = "";
String SQL = "SELECT MEMBER_HOST, MEMBER_PORT FROM performance_schema.replication_group_members WHERE MEMBER_ID = "
+ "(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'group_replication_primary_member')";
FOR (DataSource each : dataSourceMap.values()) {
try (Connection connection = each.getConnection();
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(SQL)) {
IF (resultSet.next()) {
RETURN String.format("%s:%s", resultSet.getString("MEMBER_HOST"), resultSet.getString("MEMBER_PORT"));
}
} catch (final SQLException ex) {
log.error("An exception occurred while find primary data source url", ex);
}
}
RETURN RESULT;
}
Compare the primary database URLs found above one by one with the dataSources
URLs configured. The matched data source is the primary database. It will be updated to the current ShardingSphere memory and be perpetuated to the registry center, through which it will be distributed to other compute nodes in the cluster.
(Zhao Jinchao, CC BY-SA 4.0)
There are two types of secondary database states in ShardingSphere: enable and disable. The secondary database state will be synchronized to the ShardingSphere memory to ensure that read traffic can be routed correctly.
Get all the nodes in the MGR group:
SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE FROM performance_schema.replication_group_members
Disable secondary databases:
private void determineDisabledDataSource(final String schemaName, final Map<String, DataSource> activeDataSourceMap,
final List<String> memberDataSourceURLs, final Map<String, String> dataSourceURLs) {
FOR (Entry<String, DataSource> entry : activeDataSourceMap.entrySet()) {
BOOLEAN disable = TRUE;
String url = NULL;
try (Connection connection = entry.getValue().getConnection()) {
url = connection.getMetaData().getURL();
FOR (String each : memberDataSourceURLs) {
IF (NULL != url && url.contains(each)) {
disable = FALSE;
break;
}
}
} catch (final SQLException ex) {
log.error("An exception occurred while find data source urls", ex);
}
IF (disable) {
ShardingSphereEventBus.getInstance().post(NEW DataSourceDisabledEvent(schemaName, entry.getKey(), TRUE));
} ELSE IF (!url.isEmpty()) {
dataSourceURLs.put(entry.getKey(), url);
}
}
}
Whether the secondary database is disabled is based on the data source configured and all the nodes in the MGR group.
ShardingSphere can check one by one whether the data source configured can obtain Connection
properly and verify whether the data source URL contains nodes of the MGR group.
If Connection
cannot be obtained or the verification fails, ShardingSphere will disable the data source by an event trigger and synchronize it to the registry center.
Enable secondary databases:
private void determineEnabledDataSource(final Map<String, DataSource> dataSourceMap, final String schemaName,
final List<String> memberDataSourceURLs, final Map<String, String> dataSourceURLs) {
FOR (String each : memberDataSourceURLs) {
BOOLEAN enable = TRUE;
FOR (Entry<String, String> entry : dataSourceURLs.entrySet()) {
IF (entry.getValue().contains(each)) {
enable = FALSE;
break;
}
}
IF (!enable) {
continue;
}
FOR (Entry<String, DataSource> entry : dataSourceMap.entrySet()) {
String url;
try (Connection connection = entry.getValue().getConnection()) {
url = connection.getMetaData().getURL();
IF (NULL != url && url.contains(each)) {
ShardingSphereEventBus.getInstance().post(NEW DataSourceDisabledEvent(schemaName, entry.getKey(), FALSE));
break;
}
} catch (final SQLException ex) {
log.error("An exception occurred while find enable data source urls", ex);
}
}
}
}
After the crashed secondary database is recovered and added to the MGR group, the configuration will be checked to see whether the recovered data source is used. If yes, the event trigger will tell ShardingSphere that the data source needs to be enabled.
The heartbeat mechanism is introduced to the HA module to ensure that the primary-secondary states are synchronized in real-time.
By integrating the ShardingSphere sub-project ElasticJob, the above processes are executed by the ElasticJob scheduler framework in the form of Job when the HA module is initialized, thus achieving the separation of function development and job scheduling.
Even if developers need to extend the HA function, they do not need to care about how jobs are developed and operated:
private void initHeartBeatJobs(final String schemaName, final Map<String, DataSource> dataSourceMap) {
Optional<ModeScheduleContext> modeScheduleContext = ModeScheduleContextFactory.getInstance().get();
IF (modeScheduleContext.isPresent()) {
FOR (Entry<String, DatabaseDiscoveryDataSourceRule> entry : dataSourceRules.entrySet()) {
Map<String, DataSource> dataSources = dataSourceMap.entrySet().stream().filter(dataSource -> !entry.getValue().getDisabledDataSourceNames().contains(dataSource.getKey()))
.collect(Collectors.toMap(Entry::getKey, Entry::getValue));
CronJob job = NEW CronJob(entry.getValue().getDatabaseDiscoveryType().getType() + "-" + entry.getValue().getGroupName(),
each -> NEW HeartbeatJob(schemaName, dataSources, entry.getValue().getGroupName(), entry.getValue().getDatabaseDiscoveryType(), entry.getValue().getDisabledDataSourceNames())
.execute(NULL), entry.getValue().getHeartbeatProps().getProperty("keep-alive-cron"));
modeScheduleContext.get().startCronJob(job);
}
}
}
So far, Apache ShardingSphere's HA feature has proven applicable for MySQL and openGauss HA solutions. Moving forward, it will integrate more MySQL HA products and support more database HA solutions.
As always, if you're interested, you're more than welcome to join us and contribute to the Apache ShardingSphere project.
Original article source at: https://opensource.com/
1672915320
This article explains how to install PHP 8.2 and Apache 2.4 on Windows 10 or 11 (64-bit).
Linux and macOS users often have Apache and PHP pre-installed or available via package managers. Windows requires a little more effort. The steps below may work with other editions of Windows, PHP, and Apache, but check the documentation of each dependency for specific instructions.
PHP remains the most widespread and popular server-side programming language on the Web. It’s installed by most web hosts, and has a simple learning curve, close ties with the MySQL database, superb documentation, and a wide collection of libraries to cut your development time. PHP may not be perfect, but you should consider it for your next web application. It’s the language of choice for Facebook, Slack, Wikipedia, MailChimp, Etsy, and WordPress (the content management system which powers almost 45% of the web).
Installing PHP on your development PC allows you to create and test websites and applications without affecting the data or systems on your live server.
Before you jump in, there may be a simpler installation options…
All-in-one packages are available for Windows. They contain Apache, PHP, MySQL, and other useful dependencies in a single installation file. These packages include XAMPP, WampServer and Web.Developer.
These packages are easy to use, but they may not match your live server environment. Installing Apache and PHP manually will help you learn more about the system and configuration options.
Microsoft Hyper-V (provided in Windows Professional) and VirtualBox are free hypervisors which emulate a PC so you can install another operating system.
You can install any version of Linux, then follow its Apache and PHP installation instructions. Alternatively, distros such as Ubuntu Server provide them as standard (although they may not be the latest editions).
WSL2 is also a virtual machine, but it’s tightly integrated into Windows so activities such as file sharing and localhost
resolution are seamless. You can install a variety of Linux distros, so refer to the appropriate Apache and PHP instructions.
Docker creates a wrapper (known as a container) around pre-configured application dependencies such as Apache, PHP, MySQL, MongoDB, and most other web software. Containers look like full Linux Virtual Machines but are considerably more lightweight.
Once you’ve installed Docker Desktop on Windows, it’s easy to download, configure, and run Apache and PHP.
Docker is currently considered the best option for setting up a PHP development environment. Check out SitePoint’s article Setting Up a Modern PHP Development Environment with Docker for a complete guide to setting it up.
The following sections describe how to install Apache and PHP directly on Windows.
PHP provides a built-in web server, which you can launch by navigating to a folder and running the PHP executable with an -S
parameter to set the localhost
port. For example:
cd myproject
php -S localhost:8000
You can then view PHP pages in a browser at http://localhost:8000.
This may be adequate for quick tests, but your live server will use Apache or similar web server software. Emulating that environment as closely as possible permits more advanced customization and should prevent development errors.
To install Apache, download the latest Win64 ZIP file from https://www.apachelounge.com/download/ and extract its Apache24
folder to the root of your C:
drive. You’ll also need to install the Visual C++ Redistributable for Visual Studio 2015–2020 (vc_redist_x64
); the page has a link at the top.
Open a cmd
command prompt (not PowerShell) and start Apache with:
cd C:\Apache24\bin
httpd
You may need to accept a firewall exception before the server starts to run. Open http://localhost in a browser and an “It works!” message should appear. Note:
C:\Apache24\conf\httpd.conf
is Apache’s configuration file if you need to change server settings.
C:\Apache24\htdocs
is the web server’s root content folder. It contains a single index.html
file with the “It works!” message.
If Apache fails to start, another application could be hogging port 80. (Skype is the prime candidate, and the Windows app won’t let you disable it!) If this occurs, edit C:\Apache24\conf\httpd.conf
and change the line Listen 80
to Listen 8080
or any other free port. Restart Apache and, from that point onward, you can load web files at http://localhost:8080.
Stop the server by pressing Ctrl + C in the cmd
terminal. The ReadMe
file in the ZIP also provides instructions for installing Apache as a Windows service so it auto-starts on boot.
Install PHP by following the steps below. Note that there’s more than one way to configure Apache and PHP, but this is possibly the quickest method.
Get the latest PHP x64 Thread Safe ZIP package from https://windows.php.net/download/.
Create a new php
folder in the root of your C:\
drive and extract the content of the ZIP into it.
You can install PHP anywhere on your system, but you’ll need to change the paths referenced below if you use anything other than C:\php
.
php.ini
PHP’s configuration file is php.ini
. This doesn’t exist initially, so copy C:\php\php.ini-development
to C:\php\php.ini
. This default configuration provides a development setup which reports all PHP errors and warnings.
You can edit php.ini
in a text editor, and you may need to change lines such as those suggested below (use search to find the setting). In most cases, you’ll need to remove a leading semicolon (;
) to uncomment a value.
First, enable any required extensions according to the libraries you want to use. The following extensions should be suitable for most applications including WordPress:
extension=curl
extension=gd
extension=mbstring
extension=pdo_mysql
If you want to send emails using PHP’s mail()
function, enter the details of an SMTP server in the [mail function]
section (your ISP’s settings should be suitable):
[mail function]
; For Win32 only.
; http://php.net/smtp
SMTP = mail.myisp.com
; http://php.net/smtp-port
smtp_port = 25
; For Win32 only.
; http://php.net/sendmail-from
sendmail_from = my@emailaddress.com
C:\php
to the PATH
environment variableTo ensure Windows can find the PHP executable, you must add it to the PATH
environment variable. Click the Windows Start button and type “environment”, then click Edit the system environment variables. Select the Advanced tab, and click the Environment Variables button.
Scroll down the System variables list and click Path, followed by the Edit button. Click New and add C:\php
.
Note that older editions of Windows provide a single text box with paths separated by semi-colons (;
).
Now OK your way out. You shouldn’t need to reboot, but you may need to close and restart any cmd
terminals you have open.
Ensure Apache is not running and open its C:\Apache24\conf\httpd.conf
configuration file in a text editor. Add the following lines to the bottom of the file to set PHP as an Apache module (change the file locations if necessary but use forward slashes rather than Windows backslashes):
# PHP8 module
PHPIniDir "C:/php"
LoadModule php_module "C:/php/php8apache2_4.dll"
AddType application/x-httpd-php .php
Optionally, change the DirectoryIndex
setting to use index.php
as the default in preference to index.html
. The initial setting is:
<IfModule dir_module>
DirectoryIndex index.html
</IfModule>
Change it to:
<IfModule dir_module>
DirectoryIndex index.php index.html
</IfModule>
Save httpd.conf
and test the updates from a cmd
command line:
cd C:\Apache24\bin
httpd -t
Syntax OK
will appear … unless you have errors in your configuration.
If all went well, start Apache with httpd
.
Create a new file named index.php
in Apache’s web page root folder at C:\Apache24\htdocs
. Add the following PHP code:
<?php
phpinfo();
?>
Open a web browser and enter your server address: http://localhost/. A PHP version page should appear, showing all PHP and Apache configuration settings.
You can now create PHP sites and applications in any subfolder of C:\Apache24\htdocs
. If you need to work more than one project, consider defining Apache Virtual Hosts so you can run separate codebases on different localhost
subdomains or ports.
Further information:
Best of luck!
Original article source at: https://www.sitepoint.com/
1672835120
Follow this example of ShardingSphere's high availability and dynamic read/write splitting as the basis for your own configurations.
Modern business systems must be highly available, reliable, and stable in the digital age. As the cornerstone of the current business system, databases are supposed to embrace high availability.
High availability (HA) allows databases to switch services between primary and secondary database nodes. HA automatically selects a primary, picking the best node when the previous one crashes.
There are plenty of MySQL high availability options, each with pros and cons. Below are several common high availability options:
Apache ShardingSphere's architecture actually separates storage from computing. The storage node represents the underlying database, such as MySQL, PostgreSQL, openGauss, etc., while compute node refers to ShardingSphere-JDBC or ShardingSphere-Proxy.
Accordingly, the high availability solutions for storage nodes and compute nodes are different. Stateless compute nodes need to perceive the changes in storage nodes. They also need to set up separate load balancers and have the capabilities of service discovery and request distribution. Stateful storage nodes must provide data synchronization, connection testing, primary node election, and so on.
Although ShardingSphere doesn't provide a database with high availability, it can help users integrate database HA solutions such as primary-secondary switchover, faults discovery, traffic switching governance, and so on with the help of the database HA and through its capabilities of database discovery and dynamic perception.
When combined with the primary-secondary flow control feature in distributed scenarios, ShardingSphere can provide better high availability read/write splitting solutions. It will be easier to operate and manage ShardingSphere clusters using DistSQL's dynamic high availability adjustment rules to get primary/secondary nodes' information.
Apache ShardingSphere adopts a plugin-oriented architecture so that you can use all its enhanced capabilities independently or together. Its high availability function is often used with read/write splitting to distribute query requests to the secondary databases according to the load balancing algorithm to ensure system HA, relieve primary database pressure, and improve business system throughput.
Note that ShardingSphere HA implementation leans on its distributed governance capability. Therefore, it can only be used under the cluster mode for the time being. Meanwhile, read/write splitting rules are revised in ShardingSphere 5.1.0. Please refer to the official documentation about read/write splitting for details.
Consider the following HA+read/write splitting configuration with ShardingSphere DistSQL RAL statements as an example. The example begins with the configuration, requirements, and initial SQL.
schemaName: database_discovery_db
dataSources:
ds_0:
url: jdbc:mysql://127.0.0.1:1231/demo_primary_ds?serverTimezone=UTC&useSSL=false
username: root
password: 123456
connectionTimeoutMilliseconds: 3000
idleTimeoutMilliseconds: 60000
maxLifetimeMilliseconds: 1800000
maxPoolSize: 50
minPoolSize: 1
ds_1:
url: jdbc:mysql://127.0.0.1:1232/demo_primary_ds?serverTimezone=UTC&useSSL=false
username: root
password: 123456
connectionTimeoutMilliseconds: 3000
idleTimeoutMilliseconds: 60000
maxLifetimeMilliseconds: 1800000
maxPoolSize: 50
minPoolSize: 1
ds_2:
url: jdbc:mysql://127.0.0.1:1233/demo_primary_ds?serverTimezone=UTC&useSSL=false
username: root
password: 123456
connectionTimeoutMilliseconds: 3000
idleTimeoutMilliseconds: 50000
maxLifetimeMilliseconds: 1300000
maxPoolSize: 50
minPoolSize: 1
rules:
- !READWRITE_SPLITTING
dataSources:
replication_ds:
type: Dynamic
props:
auto-aware-data-source-name: mgr_replication_ds
- !DB_DISCOVERY
dataSources:
mgr_replication_ds:
dataSourceNames:
- ds_0
- ds_1
- ds_2
discoveryHeartbeatName: mgr-heartbeat
discoveryTypeName: mgr
discoveryHeartbeats:
mgr-heartbeat:
props:
keep-alive-cron: '0/5 * * * * ?'
discoveryTypes:
mgr:
type: MGR
props:
group-name: b13df29e-90b6-11e8-8d1b-525400fc3996
CREATE TABLE `t_user` (
`id` INT(8) NOT NULL,
`mobile` CHAR(20) NOT NULL,
`idcard` VARCHAR(18) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
First, view the primary-secondary relationship:
mysql> SHOW READWRITE_SPLITTING RULES;
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
| name | auto_aware_data_source_name | write_data_source_name | read_data_source_names | load_balancer_type | load_balancer_props |
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
| replication_ds | mgr_replication_ds | ds_0 | ds_1,ds_2 | NULL | |
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
1 ROW IN SET (0.09 sec)
You can also look at the secondary database state:
mysql> SHOW READWRITE_SPLITTING READ RESOURCES;
+----------+---------+
| resource | STATUS |
+----------+---------+
| ds_1 | enabled |
| ds_2 | enabled |
+----------+---------+
The results above show that the primary database is currently ds_0, while secondary databases are ds_1 and ds_2.
Next, test INSERT:
mysql> INSERT INTO t_user(id, mobile, idcard) VALUE (10000, '13718687777', '141121xxxxx');
Query OK, 1 ROW affected (0.10 sec)
View the ShardingSphere-Proxy log and see if the route node is the primary database ds_0.
[INFO ] 2022-02-28 15:28:21.495 [ShardingSphere-Command-2] ShardingSphere-SQL - Logic SQL: INSERT INTO t_user(id, mobile, idcard) value (10000, '13718687777', '141121xxxxx')
[INFO ] 2022-02-28 15:28:21.495 [ShardingSphere-Command-2] ShardingSphere-SQL - SQLStatement: MySQLInsertStatement(setAssignment=Optional.empty, onDuplicateKeyColumns=Optional.empty)
[INFO ] 2022-02-28 15:28:21.495 [ShardingSphere-Command-2] ShardingSphere-SQL - Actual SQL: ds_0 ::: INSERT INTO t_user(id, mobile, idcard) value (10000, '13718687777', '141121xxxxx')
Now test SELECT (repeat it twice):
mysql> SELECT id, mobile, idcard FROM t_user WHERE id = 10000;
View the ShardingSphere-Proxy log and see if the route node is ds_1 or ds_2.
[INFO ] 2022-02-28 15:34:07.912 [ShardingSphere-Command-4] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:07.913 [ShardingSphere-Command-4] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:34:07.913 [ShardingSphere-Command-4] ShardingSphere-SQL - Actual SQL: ds_1 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:21.501 [ShardingSphere-Command-4] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:21.502 [ShardingSphere-Command-4] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:34:21.502 [ShardingSphere-Command-4] ShardingSphere-SQL - Actual SQL: ds_2 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
Close the primary database ds_0:
(Zhao Jinchao, CC BY-SA 4.0)
View whether the primary database has changed and if the secondary database state is correct through DistSQL:
[INFO ] 2022-02-28 15:34:07.912 [ShardingSphere-Command-4] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:07.913 [ShardingSphere-Command-4] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:34:07.913 [ShardingSphere-Command-4] ShardingSphere-SQL - Actual SQL: ds_1 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:21.501 [ShardingSphere-Command-4] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
[INFO ] 2022-02-28 15:34:21.502 [ShardingSphere-Command-4] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:34:21.502 [ShardingSphere-Command-4] ShardingSphere-SQL - Actual SQL: ds_2 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10000
Now, INSERT another line of data:
mysql> INSERT INTO t_user(id, mobile, idcard) VALUE (10001, '13521207777', '110xxxxx');
Query OK, 1 ROW affected (0.04 sec)
View the ShardingSphere-Proxy log and see if the route node is the primary database ds_1:
[INFO ] 2022-02-28 15:40:26.784 [ShardingSphere-Command-6] ShardingSphere-SQL - Logic SQL: INSERT INTO t_user(id, mobile, idcard) value (10001, '13521207777', '110xxxxx')
[INFO ] 2022-02-28 15:40:26.784 [ShardingSphere-Command-6] ShardingSphere-SQL - SQLStatement: MySQLInsertStatement(setAssignment=Optional.empty, onDuplicateKeyColumns=Optional.empty)
[INFO ] 2022-02-28 15:40:26.784 [ShardingSphere-Command-6] ShardingSphere-SQL - Actual SQL: ds_1 ::: INSERT INTO t_user(id, mobile, idcard) value (10001, '13521207777', '110xxxxx')
Finally, test SELECT(repeat it twice):
mysql> SELECT id, mobile, idcard FROM t_user WHERE id = 10001;
View the ShardingSphere-Proxy log and see if the route node is ds_2:
[INFO ] 2022-02-28 15:42:00.651 [ShardingSphere-Command-7] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10001
[INFO ] 2022-02-28 15:42:00.651 [ShardingSphere-Command-7] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:42:00.651 [ShardingSphere-Command-7] ShardingSphere-SQL - Actual SQL: ds_2 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10001
[INFO ] 2022-02-28 15:42:02.148 [ShardingSphere-Command-7] ShardingSphere-SQL - Logic SQL: SELECT id, mobile, idcard FROM t_user WHERE id = 10001
[INFO ] 2022-02-28 15:42:02.149 [ShardingSphere-Command-7] ShardingSphere-SQL - SQLStatement: MySQLSelectStatement(table=Optional.empty, limit=Optional.empty, lock=Optional.empty, window=Optional.empty)
[INFO ] 2022-02-28 15:42:02.149 [ShardingSphere-Command-7] ShardingSphere-SQL - Actual SQL: ds_2 ::: SELECT id, mobile, idcard FROM t_user WHERE id = 10001
(Zhao Jinchao, CC BY-SA 4.0)
View the latest primary-secondary relationship changes through DistSQL. The state of the ds_0 node is recovered as enabled, while ds_0 is integrated to read_data_source_names:
mysql> SHOW READWRITE_SPLITTING RULES;
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
| name | auto_aware_data_source_name | write_data_source_name | read_data_source_names | load_balancer_type | load_balancer_props |
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
| replication_ds | mgr_replication_ds | ds_1 | ds_0,ds_2 | NULL | |
+----------------+-----------------------------+------------------------+------------------------+--------------------+---------------------+
1 ROW IN SET (0.01 sec)
mysql> SHOW READWRITE_SPLITTING READ RESOURCES;
+----------+---------+
| resource | STATUS |
+----------+---------+
| ds_0 | enabled |
| ds_2 | enabled |
+----------+---------+
2 ROWS IN SET (0.00 sec)
Database high availability is critical in today's business environments, and Apache ShardingSphere can help provide the necessary reliability. Based on the above example, you now know more about ShardingSphere's high availability and dynamic read/write splitting. Use this example as the basis for your own configurations.
Original article source at: https://opensource.com/
1672462440
Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information.
If you think you have found a bug, please file an issue in the Maven Issue Tracker.
More information can be found on Apache Maven Homepage. Questions related to the usage of Maven should be posted on the Maven User List.
You can download the release source from our download page.
If you are interested in the development of Maven, please consult the documentation first and afterward you are welcome to join the developers mailing list to ask questions or discuss new ideas/features/bugs etc.
Take a look into the contribution guidelines.
This code is under the Apache License, Version 2.0, January 2004.
See the NOTICE
file for required notices and attributions.
Do you like Apache Maven? Then donate back to the ASF to support the development.
If you want to bootstrap Maven, you'll need:
mvn -DdistributionTargetDir="$HOME/app/maven/apache-maven-4.0.x-SNAPSHOT" clean package
Author: apache
Source code: https://github.com/apache/maven
License: Apache-2.0 license
1671203100
Apache Camel is a rules-based routing and arbitration engine that provides a Java object-based implementation of the Enterprise Integration Pattern, using an API (or Declarative Java Domain Specific Language) to configure routing and arbitration rules. We can implement exception handling in two ways Using Do Try block and OnException block . A re-try policy defines rules when Camel Error Handler perform re-try attempts. e.g you can setup rules that state how many times to try retry, and the delay in between attempts, and so forth.
The project structure will be as follows-
The pom.xml will be as follows-
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.spring</groupId>
<artifactId>camel-spring-integration</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-core</artifactId>
<version>2.13.0</version>
</dependency>
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-spring</artifactId>
<version>2.13.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.5</version>
</dependency>
</dependencies>
</project>
Added MainApplication class
package com.spring.main;
import org.springframework.context.support.AbstractApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;
public class MainApplication {
/*
* It is a Main Application. It invokes routeBuilder bean through from applicationContect.xml file via
* ClassPathXmlApplicationContext
*/
public static void main(String[] args) {
AbstractApplicationContext ctx = new ClassPathXmlApplicationContext("applicationContext.xml");
ctx.start();
System.out.println("Application started...");
try {
System.out.println("inside try block");
System.out.println("--------------------- inputMessageBody ------------------- ");
Thread.sleep(5 * 60 * 1000);
}
catch (InterruptedException e) {
e.printStackTrace();
}
ctx.stop();
ctx.close();
}
}
Created CamelCustomException class to implement custom exception
package com.spring.exception;
/*
* Created the custom exception...
*/
public class CamelCustomException extends Exception {
private static final long serialVersionUID = 2L;
}
Created applicationContext.xml
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:camel="http://camel.apache.org/schema/spring"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://camel.apache.org/schema/spring
http://camel.apache.org/schema/spring/camel-spring.xsd">
<bean id="routeBuilder" class="com.spring.route.SimpleRouteBuilder" />
<camelContext xmlns="http://camel.apache.org/schema/spring">
<routeBuilder ref="routeBuilder" />
</camelContext>
</beans>
We modify the MyProcessor as follows. We check if the input contains text “test”, then an custom exception is thrown.
package com.spring.processor;
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import com.spring.exception.CamelCustomException;
public class MyProcessor implements Processor {
/*
* (non-Javadoc)
* @see org.apache.camel.Processor#process(org.apache.camel.Exchange)
* It is Camel processor class which implements Processor.
*/
public void process(Exchange exchng) throws Exception {
String inputMessageBody = exchange.getIn().getBody(String.class);
System.out.println("\n" + inputMessageBody);
if (inputMessageBody.contains("test"))
throw new CamelCustomException();
}
}
Our SimpleRoutebuilder class is as before-
package com.spring.route;
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import org.apache.camel.builder.RouteBuilder;
import com.spring.exception.CamelCustomException;
import com.spring.processor.MyProcessor;
import com.spring.processor.RetryProcessor;
public class SimpleRouteBuilder extends RouteBuilder {
/*
* (non-Javadoc)
* @see org.apache.camel.builder.RouteBuilder#configure()
* It is a SimpleRouteBuilder class which facilitates Routes
*/
@Override
public void configure() throws Exception {
onException(CamelCustomException.class).process(new Processor() {
public void process(Exchange exchng) throws Exception {
System.out.println("Exception is handling by onException{} block");
}
})
.log("Received body ${body}").handled(true);
from("file:/home/knoldus/Downloads/Softwares/Workspace/input?noop=true")
.process(new MyProcessor())
.to("file:/home/knoldus/Downloads/Softwares/Workspace/output");
}
}
After running the MainApp.java output will be as follow-
After the exception is thrown it is caught by the onException{} Block. We will now define the redelivery policy. So now each message will be tried 3 times before being caught.
Modify the applicationContext.xml
1.) maximumRedeliveries is the maximum number of times can we redeliver.
2.) redeliveryDelay is the delay time (in ms) between each re-try attempts.
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:camel="http://camel.apache.org/schema/spring"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://camel.apache.org/schema/spring
http://camel.apache.org/schema/spring/camel-spring.xsd">
<bean id="routeBuilder" class="com.spring.route.SimpleRouteBuilder" />
<camelContext xmlns="http://camel.apache.org/schema/spring">
<routeBuilder ref="routeBuilder" />
<redeliveryPolicyProfile id="localRedeliveryPolicyProfile"
retryAttemptedLogLevel="WARN" maximumRedeliveries="3"
redeliveryDelay="1" />
</camelContext>
</beans>
Configure the redelivery policy in the route of SimpleRouteBuilder class. So that it can process the redelivery logic.
package com.spring.route;
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import org.apache.camel.builder.RouteBuilder;
import com.spring.exception.CamelCustomException;
import com.spring.processor.MyProcessor;
import com.spring.processor.RetryProcessor;
public class SimpleRouteBuilder extends RouteBuilder {
/*
* (non-Javadoc)
* @see org.apache.camel.builder.RouteBuilder#configure()
* It is a SimpleRouteBuilder class which facilitates Routes
*/
@Override
public void configure() throws Exception {
onException(CamelCustomException.class).process(new Processor() {
public void process(Exchange exchange) throws Exception {
System.out.println("Exception is handling by onException{} block");
}
})
.redeliveryPolicyRef("localRedeliveryPolicyProfile")
.log("Received body ${body}").handled(true);
from("file:/home/knoldus/Downloads/Softwares/Workspace/input?noop=true")
.process(new MyProcessor())
.to("file:/home/knoldus/Downloads/Softwares/Workspace/output");
}
}
After running the MainApp.java output will be as follows-
Create RetryProcessor class. It will responsible for process the logic in retry attempts.
package com.spring.processor;
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
/**
* @author knoldus
* If the body contains text "test" then it will set new body
*/
public class RetryProcessor implements Processor {
public void process(Exchange exchange) throws Exception {
exchange.getIn().setBody("replaced new body...");
}
}
Configure the onRedelivery before the redelivery policy configuration in the route of SimpleRouteBuilder class
package com.spring.route;
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import org.apache.camel.builder.RouteBuilder;
import com.spring.exception.CamelCustomException;
import com.spring.processor.MyProcessor;
import com.spring.processor.RetryProcessor;
public class SimpleRouteBuilder extends RouteBuilder {
/*
* (non-Javadoc)
* @see org.apache.camel.builder.RouteBuilder#configure()
* It is a SimpleRouteBuilder class which facilitates Routes
*/
@Override
public void configure() throws Exception {
onException(CamelCustomException.class).process(new Processor() {
public void process(Exchange exchange) throws Exception {
System.out.println("Exception is handling by onException{} block");
}
})
.onRedelivery(new RetryProcessor())
.redeliveryPolicyRef("localRedeliveryPolicyProfile")
.log("Received body ${body}").handled(true);
from("file:/home/knoldus/Downloads/Softwares/Workspace/input?noop=true")
.process(new MyProcessor())
.to("file:/home/knoldus/Downloads/Softwares/Workspace/output");
}
}
In the Re-try Policy bean we have changed the exchange data so the exception will not thrown again
In this blog, we have covered how to implement and configure Exception Handling with retry policy in Apache Camel using Spring. Now you are ready to go to implement Exception Handling with retry policy in Apache Camel using Spring. For more, you can refer to the documentation: https://people.apache.org/~dkulp/camel/redeliverypolicy.html
Original article source at: https://blog.knoldus.com/
1670593211
Apache Pulsar is a multi-tenant, high-performance server to server messaging system. Yahoo developed it. In late 2016 it was a first open-source project. Now it is in the incubation, under the Apache Software Foundation(ASF). Pulsar works on the pub-sub pattern, where there is a Producer, and a Consumer also called the subscribers, the topic is the core of the pub-sub model, where producer publish their messages on a given pulsar topic, and consumer subscribes to a problem to get news from that topic and send an acknowledgement.
Once a subscription has been acknowledged, all the messages will be retained by the pulsar. One Consumer acknowledged has been processed only after that message gets deleted.Apache Pulsar Topics: are well defined named channels for transmitting messages from producers to consumers. Topics names are well-defined URL.
Namespaces: It is logical nomenclature within a tenant. A tenant can create multiple namespaces via admin API. A namespace allows the application to create and manage a hierarchy of topics. The number of issues can be created under the namespace.
A subscription is a named rule for the configuration that determines the delivery of the messages to the consumer. There are three subscription modes in Apache Pulsar
In Exclusive mode, only a single consumer is allowed to attach to the subscription. If more then one consumer attempts to subscribe to a topic using the same subscription, then the consumer receives an error. Exclusive mode as default is subscription model.
In failover, multiple consumers attached to the same topic. These consumers are sorted in lexically with names, and the first consumer is the master consumer, who gets all the messages. When a master consumer gets disconnected, the next consumers will get the words.
Shared and round-robin mode, in which a message is delivered only to that consumer in a round-robin manner. When that user is disconnected, then the messages sent and not acknowledged by that consumer will be re-scheduled to other consumers. Limitations of shared mode-
The process used for analyzing the huge amount of data at the moment it is used or produced. Click to explore about our, Real Time Data Streaming Tools
The routing modes determine which partition to which topic a message will be subscribed. There are three types of routing methods. When using partitioned questions to publish, routing is necessary.
If no key is provided to the producer, it will publish messages across all the partitions available in a round-robin way to achieve maximum throughput. Round-robin is not done per individual message but set to the same boundary of batching delay, and this ensures effective batching. While if a key is specified on the message, the producer that is partitioned will hash the key and assign all the messages to the particular partition. This is the default mode.
If no key is provided, the producer randomly picks a single partition and publish all the messages in that particular partition. While if the key is specified for the message, the partitioned producer will hash the key and assign the letter to the barrier.
The user can create a custom routing mode by using the java client and implementing the MessageRouter interface. Custom routing will be called for a particular partition for a specific message.
Pulsar cluster consists of different parts in it: In pulsar, there may be one more broker’s handles, and load balances incoming messages from producers, it dispatches messages to consumers, communicates with the pulsar configuration store to handle various coordination tasks. It stores messages in BookKeeper instances.
The broker is a stateless component that handles an HTTP server and the Dispatcher. An HTTP server exposes a Rest API for both administrative tasks and topic lookup for producers and consumers. A dispatcher is an async TCP server over a custom binary protocol used for all data transfers.
A Pulsar instance usually consists of one or more Pulsar clusters. It consists of: One or more brokers, a zookeeper quorum used for cluster-level configuration and coordination and an ensemble of bookies used for persistent storage of messages.
Pulsar uses apache zookeeper to store the metadata storage, cluster config and coordination.
Pulsar provides surety of message delivery. If a message reaches a Pulsar broker successfully, it will be delivered to the target that’s intended for it.
Pulsar has client API’s with language Java, Go, Python and C++. The client API encapsulates and optimizes pulsar’s client-broker communication protocol. It also exposes a simple and intuitive API for use by the applications. The current official Pulsar client libraries support transparent reconnection, and connection failover to brokers, queuing of messages until acknowledged by the broker, and these also consists of heuristics such as connection retries with backoff.
When an application wants to create a producer/consumer, the pulsar client library will initiate a setup phase that is composed of two setups:
Apache Pulsar’s Geo-replication enables messages to be produced in one geolocation and can be consumed in other geolocation. In the above diagram, whenever producers P1, P2, and P3 publish a message to the given topic T1 on Cluster – A, B and C respectively, all those messages are instantly replicated across clusters. Once replicated, this allows consumers C1 & C2 to consume the messages from their respective groups. Without geo-replication, C1 and C2 consumers are not able to consume messages published by P3 producers.
Pulsar was created from the group up as a multi-tenant system. Apache supports multi-tenancy. It is spread across a cluster, and each can have their authentication and authorization scheme applied to them. They are also the administrative unit at which storage, message Ttl, and isolation policies can be managed.
To each tenant in a particular pulsar instance you can assign:
The Dataset is a data structure in Spark SQL which is strongly typed, Object-oriented and is a map to a relational schema.Click to explore about our, RDD in Apache Spark Advantages
Pulsar has support for the authentication mechanism which can be configured at the broker, and it also supports authorization to identify the client and its access rights on topics and tenants.
Pulsar’s architecture allows topic backlogs to grow very large. This makes a rich set of the situation over time. To alleviate this cost is to use Tiered Storage. The Tiered Storage move older messages in the backlog can be moved from BookKeeper to cheaper storage. Which means clients can access older backlogs.
Type safety is paramount in communication between the producer and the consumer in it. For safety in messaging, pulsar adopted two basic approaches:
In this approach message producers and consumers are responsible for not only serializing and deserializing messages (which consist of raw bytes) but also “knowing” which types are being transmitted via which topics.
In this approach which producers and consumers inform the system which data types can be transmitted via the topic. With this approach, the messaging system enforces type safety and ensures that both producers and consumers remain in sync.
Pulsar schema is applied and enforced at the topic level. Producers and consumers upload schemas to pulsar are asked. Pulsar schema consists of :
It supports the following schema formats:
If no schema is defined, producers and consumers handle raw bytes.
The pros and cons of Apache Pulsar are described below:
S.No. | Kafka | Apache Pulsar |
1 | It is more mature and higher-level APIs. | It incorporated improved design stuff of Kafka and its existing capabilities. |
2 | Built on top of Kafka Streams | Unified messaging model and API.
|
3 | Producer-topic-consumer group-consumer | Producer-topic-subscription-consumer |
4 | Restricts fluidity and flexibility | Provide fluidity and flexibility |
5 | Messages are deleted based on retention. If a consumer doesn’t read words before the retention period, it will lose data. | Messages are only deleted after all subscriptions consumed them. No data loss, even the consumers of a subscription are down for a long time. Words are allowed to keep for a configured retention period time even after all subscriptions consume them. |
Drawbacks of Kafka
Even though it looks like Kafka lags behind pulsar, but kip (Kafka improvement proposals) has almost all of these drawbacks covered in its discussion and users can hope to see the changes in the upcoming versions of the Kafka.
Kafka To Pulsar – User can easily migrate to Pulsar from Kafka as Pulsar natively supports to work directly with Kafka data through connectors provided or one can import Kafka application data to pulsar quite easily.
Pulsar SQL uses Presto to query over the old messages that are kept in backlog (Apache BookKeeper).
Apache Pulsar is a powerful stream-processing platform that has been able to learn from the previously existing systems. It has a layered architecture which is complemented by the number of great out-of-the-box features like multi-tenancy, zero rebalancing downtime,geo-replication, proxy and durability and TLS-based authentication/authorization. Compared to other platforms, pulsar can give you the ultimate tools with more capabilities.
Original article source at: https://www.xenonstack.com/
1670579700
Apache Flink is a stream processing framework which is developed by Apache Software Foundation. It is an open source platform which is a streaming data flow engine that provides communication, and data-distribution for distributed computations over data streams. Apache Flink is a distributed data processing platform which is used in big data applications and primarily involves the analysis of data stored in the Hadoop clusters. It is capable of handling both the batch and stream processing jobs. It is the alternative of Map-reduce. Some of the best features of it are as follows -
An open-source, distributed processing engine and framework of stateful computations written in JAVA and Scala. Click to explore about our, How to Integrate AIOps for DevOps?
Unified framework - Flink is a unified framework, that allows to build a single data workflow and holds streaming, batch, and SQL. Flink can also process graph with its own Gelly library and use the Machine learning algorithm from its FlinkML library. Apart from This, Flink also supports iterative algorithms and interactive queries.
Custom Memory Manager - Flink implements its memory management inside the JVM and its features are as follows
Native Closed Loop Iteration Operators: Flink has its dedicated support for iterative computations. It iterates on data by using streaming Architecture. The concept of an iterative algorithm is tightly bounded into the flink query optimizer.
It is one of the best options to develop and run several types of applications because of its extensive features. Some of the use cases of Flink are as follows -
Event Driven Applications - An event-driven application is a type of stateful application through which events are ingested from one or more event streams, and it also reacts to the incoming events. Event-driven applications are based on stateful stream processing applications. Some of the event-driven applications are as follows -
Data Analytics Application - These types of applications extract the information from the Raw data. With the help of a proper stream processing engine, analytics can also be done in real time.
Some of the data analytics Applications are as follows -
Some of the data pipeline applications are as follows -
There are several challenges which are faced by the IoT industries when it comes to data processing some of the are as follows -
Several solutions behind streaming processing using Apache Flink in IoT are as follows -
NATS is an open source messaging system which consists of a server, a client and a connector framework which is a java based framework used for connecting it with other services. Its server is written in GO programming language. It also provides high performant and flexible messaging capabilities. The essential design principles which makes it easy to use are its performance and scalability. Features of NATS - It provide some of the best and unique features; some of them are as follows -
Several solutions behind streaming processing using Nats IO in Iot are as follows -
It is one of the most straightforward and powerful messaging systems and offers multiple quality of Services. Some of the best use cases of Nats are as follows -
The proper management of data streams has helped to enterprises to meet the demands of a real-time world. To facilitate Streaming Analytics as your Analytics Approach we advice taking the subsequent steps.
Original article source at: https://www.xenonstack.com/