1598289240
In this video, You will learn how to create Kafka Producer and Consumer with Spring Boot in Java.
You can start the Zookeeper and Kafka servers by using the below commands.
Start Zookeeper:
$ zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
Start Kafka server:
$ kafka-server-start /usr/local/etc/kafka/server.properties
GitHub Link: https://github.com/shameed1910/springboot-kafka
Subscribe : https://www.youtube.com/channel/UC3hc8Ql0_OzMY7XI1gH9g8g
#spring #spring-boot #kafka
1654075127
Amazon Aurora is a relational database management system (RDBMS) developed by AWS(Amazon Web Services). Aurora gives you the performance and availability of commercial-grade databases with full MySQL and PostgreSQL compatibility. In terms of high performance, Aurora MySQL and Aurora PostgreSQL have shown an increase in throughput of up to 5X over stock MySQL and 3X over stock PostgreSQL respectively on similar hardware. In terms of scalability, Aurora achieves enhancements and innovations in storage and computing, horizontal and vertical functions.
Aurora supports up to 128TB of storage capacity and supports dynamic scaling of storage layer in units of 10GB. In terms of computing, Aurora supports scalable configurations for multiple read replicas. Each region can have an additional 15 Aurora replicas. In addition, Aurora provides multi-primary architecture to support four read/write nodes. Its Serverless architecture allows vertical scaling and reduces typical latency to under a second, while the Global Database enables a single database cluster to span multiple AWS Regions in low latency.
Aurora already provides great scalability with the growth of user data volume. Can it handle more data and support more concurrent access? You may consider using sharding to support the configuration of multiple underlying Aurora clusters. To this end, a series of blogs, including this one, provides you with a reference in choosing between Proxy and JDBC for sharding.
AWS Aurora offers a single relational database. Primary-secondary, multi-primary, and global database, and other forms of hosting architecture can satisfy various architectural scenarios above. However, Aurora doesn’t provide direct support for sharding scenarios, and sharding has a variety of forms, such as vertical and horizontal forms. If we want to further increase data capacity, some problems have to be solved, such as cross-node database Join
, associated query, distributed transactions, SQL sorting, page turning, function calculation, database global primary key, capacity planning, and secondary capacity expansion after sharding.
It is generally accepted that when the capacity of a MySQL table is less than 10 million, the time spent on queries is optimal because at this time the height of its BTREE
index is between 3 and 5. Data sharding can reduce the amount of data in a single table and distribute the read and write loads to different data nodes at the same time. Data sharding can be divided into vertical sharding and horizontal sharding.
1. Advantages of vertical sharding
2. Disadvantages of vertical sharding
Join
can only be implemented by interface aggregation, which will increase the complexity of development.3. Advantages of horizontal sharding
4. Disadvantages of horizontal sharding
Join
is poor.Based on the analysis above, and the available studis on popular sharding middleware, we selected ShardingSphere, an open source product, combined with Amazon Aurora to introduce how the combination of these two products meets various forms of sharding and how to solve the problems brought by sharding.
ShardingSphere is an open source ecosystem including a set of distributed database middleware solutions, including 3 independent products, Sharding-JDBC, Sharding-Proxy & Sharding-Sidecar.
The characteristics of Sharding-JDBC are:
Hybrid Structure Integrating Sharding-JDBC and Applications
Sharding-JDBC’s core concepts
Data node: The smallest unit of a data slice, consisting of a data source name and a data table, such as ds_0.product_order_0.
Actual table: The physical table that really exists in the horizontal sharding database, such as product order tables: product_order_0, product_order_1, and product_order_2.
Logic table: The logical name of the horizontal sharding databases (tables) with the same schema. For instance, the logic table of the order product_order_0, product_order_1, and product_order_2 is product_order.
Binding table: It refers to the primary table and the joiner table with the same sharding rules. For example, product_order table and product_order_item are sharded by order_id, so they are binding tables with each other. Cartesian product correlation will not appear in the multi-tables correlating query, so the query efficiency will increase greatly.
Broadcast table: It refers to tables that exist in all sharding database sources. The schema and data must consist in each database. It can be applied to the small data volume that needs to correlate with big data tables to query, dictionary table and configuration table for example.
Download the example project code locally. In order to ensure the stability of the test code, we choose shardingsphere-example-4.0.0
version.
git clone
https://github.com/apache/shardingsphere-example.git
Project description:
shardingsphere-example
├── example-core
│ ├── config-utility
│ ├── example-api
│ ├── example-raw-jdbc
│ ├── example-spring-jpa #spring+jpa integration-based entity,repository
│ └── example-spring-mybatis
├── sharding-jdbc-example
│ ├── sharding-example
│ │ ├── sharding-raw-jdbc-example
│ │ ├── sharding-spring-boot-jpa-example #integration-based sharding-jdbc functions
│ │ ├── sharding-spring-boot-mybatis-example
│ │ ├── sharding-spring-namespace-jpa-example
│ │ └── sharding-spring-namespace-mybatis-example
│ ├── orchestration-example
│ │ ├── orchestration-raw-jdbc-example
│ │ ├── orchestration-spring-boot-example #integration-based sharding-jdbc governance function
│ │ └── orchestration-spring-namespace-example
│ ├── transaction-example
│ │ ├── transaction-2pc-xa-example #sharding-jdbc sample of two-phase commit for a distributed transaction
│ │ └──transaction-base-seata-example #sharding-jdbc distributed transaction seata sample
│ ├── other-feature-example
│ │ ├── hint-example
│ │ └── encrypt-example
├── sharding-proxy-example
│ └── sharding-proxy-boot-mybatis-example
└── src/resources
└── manual_schema.sql
Configuration file description:
application-master-slave.properties #read/write splitting profile
application-sharding-databases-tables.properties #sharding profile
application-sharding-databases.properties #library split profile only
application-sharding-master-slave.properties #sharding and read/write splitting profile
application-sharding-tables.properties #table split profile
application.properties #spring boot profile
Code logic description:
The following is the entry class of the Spring Boot application below. Execute it to run the project.
The execution logic of demo is as follows:
As business grows, the write and read requests can be split to different database nodes to effectively promote the processing capability of the entire database cluster. Aurora uses a reader/writer endpoint
to meet users' requirements to write and read with strong consistency, and a read-only endpoint
to meet the requirements to read without strong consistency. Aurora's read and write latency is within single-digit milliseconds, much lower than MySQL's binlog
-based logical replication, so there's a lot of loads that can be directed to a read-only endpoint
.
Through the one primary and multiple secondary configuration, query requests can be evenly distributed to multiple data replicas, which further improves the processing capability of the system. Read/write splitting can improve the throughput and availability of system, but it can also lead to data inconsistency. Aurora provides a primary/secondary architecture in a fully managed form, but applications on the upper-layer still need to manage multiple data sources when interacting with Aurora, routing SQL requests to different nodes based on the read/write type of SQL statements and certain routing policies.
ShardingSphere-JDBC provides read/write splitting features and it is integrated with application programs so that the complex configuration between application programs and database clusters can be separated from application programs. Developers can manage the Shard
through configuration files and combine it with ORM frameworks such as Spring JPA and Mybatis to completely separate the duplicated logic from the code, which greatly improves the ability to maintain code and reduces the coupling between code and database.
Create a set of Aurora MySQL read/write splitting clusters. The model is db.r5.2xlarge. Each set of clusters has one write node and two read nodes.
application.properties spring boot
Master profile description:
You need to replace the green ones with your own environment configuration.
# Jpa automatically creates and drops data tables based on entities
spring.jpa.properties.hibernate.hbm2ddl.auto=create-drop
spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.MySQL5Dialect
spring.jpa.properties.hibernate.show_sql=true
#spring.profiles.active=sharding-databases
#spring.profiles.active=sharding-tables
#spring.profiles.active=sharding-databases-tables
#Activate master-slave configuration item so that sharding-jdbc can use master-slave profile
spring.profiles.active=master-slave
#spring.profiles.active=sharding-master-slave
application-master-slave.properties sharding-jdbc
profile description:
spring.shardingsphere.datasource.names=ds_master,ds_slave_0,ds_slave_1
# data souce-master
spring.shardingsphere.datasource.ds_master.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_master.password=Your master DB password
spring.shardingsphere.datasource.ds_master.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_master.jdbc-url=Your primary DB data sourceurl spring.shardingsphere.datasource.ds_master.username=Your primary DB username
# data source-slave
spring.shardingsphere.datasource.ds_slave_0.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_slave_0.password= Your slave DB password
spring.shardingsphere.datasource.ds_slave_0.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_slave_0.jdbc-url=Your slave DB data source url
spring.shardingsphere.datasource.ds_slave_0.username= Your slave DB username
# data source-slave
spring.shardingsphere.datasource.ds_slave_1.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_slave_1.password= Your slave DB password
spring.shardingsphere.datasource.ds_slave_1.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_slave_1.jdbc-url= Your slave DB data source url
spring.shardingsphere.datasource.ds_slave_1.username= Your slave DB username
# Routing Policy Configuration
spring.shardingsphere.masterslave.load-balance-algorithm-type=round_robin
spring.shardingsphere.masterslave.name=ds_ms
spring.shardingsphere.masterslave.master-data-source-name=ds_master
spring.shardingsphere.masterslave.slave-data-source-names=ds_slave_0,ds_slave_1
# sharding-jdbc configures the information storage mode
spring.shardingsphere.mode.type=Memory
# start shardingsphere log,and you can see the conversion from logical SQL to actual SQL from the print
spring.shardingsphere.props.sql.show=true
As shown in the ShardingSphere-SQL log
figure below, the write SQL is executed on the ds_master
data source.
As shown in the ShardingSphere-SQL log
figure below, the read SQL is executed on the ds_slave
data source in the form of polling.
[INFO ] 2022-04-02 19:43:39,376 --main-- [ShardingSphere-SQL] Rule Type: master-slave
[INFO ] 2022-04-02 19:43:39,376 --main-- [ShardingSphere-SQL] SQL: select orderentit0_.order_id as order_id1_1_, orderentit0_.address_id as address_2_1_,
orderentit0_.status as status3_1_, orderentit0_.user_id as user_id4_1_ from t_order orderentit0_ ::: DataSources: ds_slave_0
---------------------------- Print OrderItem Data -------------------
Hibernate: select orderiteme1_.order_item_id as order_it1_2_, orderiteme1_.order_id as order_id2_2_, orderiteme1_.status as status3_2_, orderiteme1_.user_id
as user_id4_2_ from t_order orderentit0_ cross join t_order_item orderiteme1_ where orderentit0_.order_id=orderiteme1_.order_id
[INFO ] 2022-04-02 19:43:40,898 --main-- [ShardingSphere-SQL] Rule Type: master-slave
[INFO ] 2022-04-02 19:43:40,898 --main-- [ShardingSphere-SQL] SQL: select orderiteme1_.order_item_id as order_it1_2_, orderiteme1_.order_id as order_id2_2_, orderiteme1_.status as status3_2_,
orderiteme1_.user_id as user_id4_2_ from t_order orderentit0_ cross join t_order_item orderiteme1_ where orderentit0_.order_id=orderiteme1_.order_id ::: DataSources: ds_slave_1
Note: As shown in the figure below, if there are both reads and writes in a transaction, Sharding-JDBC routes both read and write operations to the master library. If the read/write requests are not in the same transaction, the corresponding read requests are distributed to different read nodes according to the routing policy.
@Override
@Transactional // When a transaction is started, both read and write in the transaction go through the master library. When closed, read goes through the slave library and write goes through the master library
public void processSuccess() throws SQLException {
System.out.println("-------------- Process Success Begin ---------------");
List<Long> orderIds = insertData();
printData();
deleteData(orderIds);
printData();
System.out.println("-------------- Process Success Finish --------------");
}
The Aurora database environment adopts the configuration described in Section 2.2.1.
3.2.4.1 Verification process description
Spring-Boot
project2. Perform a failover on Aurora’s console
3. Execute the Rest API
request
4. Repeatedly execute POST
(http://localhost:8088/save-user) until the call to the API failed to write to Aurora and eventually recovered successfully.
5. The following figure shows the process of executing code failover. It takes about 37 seconds from the time when the latest SQL write is successfully performed to the time when the next SQL write is successfully performed. That is, the application can be automatically recovered from Aurora failover, and the recovery time is about 37 seconds.
application.properties spring boot
master profile description
# Jpa automatically creates and drops data tables based on entities
spring.jpa.properties.hibernate.hbm2ddl.auto=create-drop
spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.MySQL5Dialect
spring.jpa.properties.hibernate.show_sql=true
#spring.profiles.active=sharding-databases
#Activate sharding-tables configuration items
#spring.profiles.active=sharding-tables
#spring.profiles.active=sharding-databases-tables
# spring.profiles.active=master-slave
#spring.profiles.active=sharding-master-slave
application-sharding-tables.properties sharding-jdbc
profile description
## configure primary-key policy
spring.shardingsphere.sharding.tables.t_order.key-generator.column=order_id
spring.shardingsphere.sharding.tables.t_order.key-generator.type=SNOWFLAKE
spring.shardingsphere.sharding.tables.t_order.key-generator.props.worker.id=123
spring.shardingsphere.sharding.tables.t_order_item.actual-data-nodes=ds.t_order_item_$->{0..1}
spring.shardingsphere.sharding.tables.t_order_item.table-strategy.inline.sharding-column=order_id
spring.shardingsphere.sharding.tables.t_order_item.table-strategy.inline.algorithm-expression=t_order_item_$->{order_id % 2}
spring.shardingsphere.sharding.tables.t_order_item.key-generator.column=order_item_id
spring.shardingsphere.sharding.tables.t_order_item.key-generator.type=SNOWFLAKE
spring.shardingsphere.sharding.tables.t_order_item.key-generator.props.worker.id=123
# configure the binding relation of t_order and t_order_item
spring.shardingsphere.sharding.binding-tables[0]=t_order,t_order_item
# configure broadcast tables
spring.shardingsphere.sharding.broadcast-tables=t_address
# sharding-jdbc mode
spring.shardingsphere.mode.type=Memory
# start shardingsphere log
spring.shardingsphere.props.sql.show=true
1. DDL operation
JPA automatically creates tables for testing. When Sharding-JDBC routing rules are configured, the client
executes DDL, and Sharding-JDBC automatically creates corresponding tables according to the table splitting rules. If t_address
is a broadcast table, create a t_address
because there is only one master instance. Two physical tables t_order_0
and t_order_1
will be created when creating t_order
.
2. Write operation
As shown in the figure below, Logic SQL
inserts a record into t_order
. When Sharding-JDBC is executed, data will be distributed to t_order_0
and t_order_1
according to the table splitting rules.
When t_order
and t_order_item
are bound, the records associated with order_item
and order
are placed on the same physical table.
3. Read operation
As shown in the figure below, perform the join
query operations to order
and order_item
under the binding table, and the physical shard is precisely located based on the binding relationship.
The join
query operations on order
and order_item
under the unbound table will traverse all shards.
Create two instances on Aurora: ds_0
and ds_1
When the sharding-spring-boot-jpa-example
project is started, tables t_order
, t_order_item
,t_address
will be created on two Aurora instances.
application.properties springboot
master profile description
# Jpa automatically creates and drops data tables based on entities
spring.jpa.properties.hibernate.hbm2ddl.auto=create
spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.MySQL5Dialect
spring.jpa.properties.hibernate.show_sql=true
# Activate sharding-databases configuration items
spring.profiles.active=sharding-databases
#spring.profiles.active=sharding-tables
#spring.profiles.active=sharding-databases-tables
#spring.profiles.active=master-slave
#spring.profiles.active=sharding-master-slave
application-sharding-databases.properties sharding-jdbc
profile description
spring.shardingsphere.datasource.names=ds_0,ds_1
# ds_0
spring.shardingsphere.datasource.ds_0.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_0.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_0.jdbc-url= spring.shardingsphere.datasource.ds_0.username=
spring.shardingsphere.datasource.ds_0.password=
# ds_1
spring.shardingsphere.datasource.ds_1.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_1.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_1.jdbc-url=
spring.shardingsphere.datasource.ds_1.username=
spring.shardingsphere.datasource.ds_1.password=
spring.shardingsphere.sharding.default-database-strategy.inline.sharding-column=user_id
spring.shardingsphere.sharding.default-database-strategy.inline.algorithm-expression=ds_$->{user_id % 2}
spring.shardingsphere.sharding.binding-tables=t_order,t_order_item
spring.shardingsphere.sharding.broadcast-tables=t_address
spring.shardingsphere.sharding.default-data-source-name=ds_0
spring.shardingsphere.sharding.tables.t_order.actual-data-nodes=ds_$->{0..1}.t_order
spring.shardingsphere.sharding.tables.t_order.key-generator.column=order_id
spring.shardingsphere.sharding.tables.t_order.key-generator.type=SNOWFLAKE
spring.shardingsphere.sharding.tables.t_order.key-generator.props.worker.id=123
spring.shardingsphere.sharding.tables.t_order_item.actual-data-nodes=ds_$->{0..1}.t_order_item
spring.shardingsphere.sharding.tables.t_order_item.key-generator.column=order_item_id
spring.shardingsphere.sharding.tables.t_order_item.key-generator.type=SNOWFLAKE
spring.shardingsphere.sharding.tables.t_order_item.key-generator.props.worker.id=123
# sharding-jdbc mode
spring.shardingsphere.mode.type=Memory
# start shardingsphere log
spring.shardingsphere.props.sql.show=true
1. DDL operation
JPA automatically creates tables for testing. When Sharding-JDBC’s library splitting and routing rules are configured, the client
executes DDL, and Sharding-JDBC will automatically create corresponding tables according to table splitting rules. If t_address
is a broadcast table, physical tables will be created on ds_0
and ds_1
. The three tables, t_address
, t_order
and t_order_item
will be created on ds_0
and ds_1
respectively.
2. Write operation
For the broadcast table t_address
, each record written will also be written to the t_address
tables of ds_0
and ds_1
.
The tables t_order
and t_order_item
of the slave library are written on the table in the corresponding instance according to the slave library field and routing policy.
3. Read operation
Query order
is routed to the corresponding Aurora instance according to the routing rules of the slave library .
Query Address
. Since address
is a broadcast table, an instance of address
will be randomly selected and queried from the nodes used.
As shown in the figure below, perform the join
query operations to order
and order_item
under the binding table, and the physical shard is precisely located based on the binding relationship.
As shown in the figure below, create two instances on Aurora: ds_0
and ds_1
When the sharding-spring-boot-jpa-example
project is started, physical tables t_order_01
, t_order_02
, t_order_item_01
,and t_order_item_02
and global table t_address
will be created on two Aurora instances.
application.properties springboot
master profile description
# Jpa automatically creates and drops data tables based on entities
spring.jpa.properties.hibernate.hbm2ddl.auto=create
spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.MySQL5Dialect
spring.jpa.properties.hibernate.show_sql=true
# Activate sharding-databases-tables configuration items
#spring.profiles.active=sharding-databases
#spring.profiles.active=sharding-tables
spring.profiles.active=sharding-databases-tables
#spring.profiles.active=master-slave
#spring.profiles.active=sharding-master-slave
application-sharding-databases.properties sharding-jdbc
profile description
spring.shardingsphere.datasource.names=ds_0,ds_1
# ds_0
spring.shardingsphere.datasource.ds_0.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_0.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_0.jdbc-url= 306/dev?useSSL=false&characterEncoding=utf-8
spring.shardingsphere.datasource.ds_0.username=
spring.shardingsphere.datasource.ds_0.password=
spring.shardingsphere.datasource.ds_0.max-active=16
# ds_1
spring.shardingsphere.datasource.ds_1.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_1.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_1.jdbc-url=
spring.shardingsphere.datasource.ds_1.username=
spring.shardingsphere.datasource.ds_1.password=
spring.shardingsphere.datasource.ds_1.max-active=16
# default library splitting policy
spring.shardingsphere.sharding.default-database-strategy.inline.sharding-column=user_id
spring.shardingsphere.sharding.default-database-strategy.inline.algorithm-expression=ds_$->{user_id % 2}
spring.shardingsphere.sharding.binding-tables=t_order,t_order_item
spring.shardingsphere.sharding.broadcast-tables=t_address
# Tables that do not meet the library splitting policy are placed on ds_0
spring.shardingsphere.sharding.default-data-source-name=ds_0
# t_order table splitting policy
spring.shardingsphere.sharding.tables.t_order.actual-data-nodes=ds_$->{0..1}.t_order_$->{0..1}
spring.shardingsphere.sharding.tables.t_order.table-strategy.inline.sharding-column=order_id
spring.shardingsphere.sharding.tables.t_order.table-strategy.inline.algorithm-expression=t_order_$->{order_id % 2}
spring.shardingsphere.sharding.tables.t_order.key-generator.column=order_id
spring.shardingsphere.sharding.tables.t_order.key-generator.type=SNOWFLAKE
spring.shardingsphere.sharding.tables.t_order.key-generator.props.worker.id=123
# t_order_item table splitting policy
spring.shardingsphere.sharding.tables.t_order_item.actual-data-nodes=ds_$->{0..1}.t_order_item_$->{0..1}
spring.shardingsphere.sharding.tables.t_order_item.table-strategy.inline.sharding-column=order_id
spring.shardingsphere.sharding.tables.t_order_item.table-strategy.inline.algorithm-expression=t_order_item_$->{order_id % 2}
spring.shardingsphere.sharding.tables.t_order_item.key-generator.column=order_item_id
spring.shardingsphere.sharding.tables.t_order_item.key-generator.type=SNOWFLAKE
spring.shardingsphere.sharding.tables.t_order_item.key-generator.props.worker.id=123
# sharding-jdbc mdoe
spring.shardingsphere.mode.type=Memory
# start shardingsphere log
spring.shardingsphere.props.sql.show=true
1. DDL operation
JPA automatically creates tables for testing. When Sharding-JDBC’s sharding and routing rules are configured, the client
executes DDL, and Sharding-JDBC will automatically create corresponding tables according to table splitting rules. If t_address
is a broadcast table, t_address
will be created on both ds_0
and ds_1
. The three tables, t_address
, t_order
and t_order_item
will be created on ds_0
and ds_1
respectively.
2. Write operation
For the broadcast table t_address
, each record written will also be written to the t_address
tables of ds_0
and ds_1
.
The tables t_order
and t_order_item
of the sub-library are written to the table on the corresponding instance according to the slave library field and routing policy.
3. Read operation
The read operation is similar to the library split function verification described in section2.4.3.
The following figure shows the physical table of the created database instance.
application.properties spring boot
master profile description
# Jpa automatically creates and drops data tables based on entities
spring.jpa.properties.hibernate.hbm2ddl.auto=create
spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.MySQL5Dialect
spring.jpa.properties.hibernate.show_sql=true
# activate sharding-databases-tables configuration items
#spring.profiles.active=sharding-databases
#spring.profiles.active=sharding-tables
#spring.profiles.active=sharding-databases-tables
#spring.profiles.active=master-slave
spring.profiles.active=sharding-master-slave
application-sharding-master-slave.properties sharding-jdbc
profile description
The url, name and password of the database need to be changed to your own database parameters.
spring.shardingsphere.datasource.names=ds_master_0,ds_master_1,ds_master_0_slave_0,ds_master_0_slave_1,ds_master_1_slave_0,ds_master_1_slave_1
spring.shardingsphere.datasource.ds_master_0.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_master_0.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_master_0.jdbc-url= spring.shardingsphere.datasource.ds_master_0.username=
spring.shardingsphere.datasource.ds_master_0.password=
spring.shardingsphere.datasource.ds_master_0.max-active=16
spring.shardingsphere.datasource.ds_master_0_slave_0.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_master_0_slave_0.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_master_0_slave_0.jdbc-url= spring.shardingsphere.datasource.ds_master_0_slave_0.username=
spring.shardingsphere.datasource.ds_master_0_slave_0.password=
spring.shardingsphere.datasource.ds_master_0_slave_0.max-active=16
spring.shardingsphere.datasource.ds_master_0_slave_1.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_master_0_slave_1.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_master_0_slave_1.jdbc-url= spring.shardingsphere.datasource.ds_master_0_slave_1.username=
spring.shardingsphere.datasource.ds_master_0_slave_1.password=
spring.shardingsphere.datasource.ds_master_0_slave_1.max-active=16
spring.shardingsphere.datasource.ds_master_1.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_master_1.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_master_1.jdbc-url=
spring.shardingsphere.datasource.ds_master_1.username=
spring.shardingsphere.datasource.ds_master_1.password=
spring.shardingsphere.datasource.ds_master_1.max-active=16
spring.shardingsphere.datasource.ds_master_1_slave_0.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_master_1_slave_0.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_master_1_slave_0.jdbc-url=
spring.shardingsphere.datasource.ds_master_1_slave_0.username=
spring.shardingsphere.datasource.ds_master_1_slave_0.password=
spring.shardingsphere.datasource.ds_master_1_slave_0.max-active=16
spring.shardingsphere.datasource.ds_master_1_slave_1.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.ds_master_1_slave_1.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds_master_1_slave_1.jdbc-url= spring.shardingsphere.datasource.ds_master_1_slave_1.username=admin
spring.shardingsphere.datasource.ds_master_1_slave_1.password=
spring.shardingsphere.datasource.ds_master_1_slave_1.max-active=16
spring.shardingsphere.sharding.default-database-strategy.inline.sharding-column=user_id
spring.shardingsphere.sharding.default-database-strategy.inline.algorithm-expression=ds_$->{user_id % 2}
spring.shardingsphere.sharding.binding-tables=t_order,t_order_item
spring.shardingsphere.sharding.broadcast-tables=t_address
spring.shardingsphere.sharding.default-data-source-name=ds_master_0
spring.shardingsphere.sharding.tables.t_order.actual-data-nodes=ds_$->{0..1}.t_order_$->{0..1}
spring.shardingsphere.sharding.tables.t_order.table-strategy.inline.sharding-column=order_id
spring.shardingsphere.sharding.tables.t_order.table-strategy.inline.algorithm-expression=t_order_$->{order_id % 2}
spring.shardingsphere.sharding.tables.t_order.key-generator.column=order_id
spring.shardingsphere.sharding.tables.t_order.key-generator.type=SNOWFLAKE
spring.shardingsphere.sharding.tables.t_order.key-generator.props.worker.id=123
spring.shardingsphere.sharding.tables.t_order_item.actual-data-nodes=ds_$->{0..1}.t_order_item_$->{0..1}
spring.shardingsphere.sharding.tables.t_order_item.table-strategy.inline.sharding-column=order_id
spring.shardingsphere.sharding.tables.t_order_item.table-strategy.inline.algorithm-expression=t_order_item_$->{order_id % 2}
spring.shardingsphere.sharding.tables.t_order_item.key-generator.column=order_item_id
spring.shardingsphere.sharding.tables.t_order_item.key-generator.type=SNOWFLAKE
spring.shardingsphere.sharding.tables.t_order_item.key-generator.props.worker.id=123
# master/slave data source and slave data source configuration
spring.shardingsphere.sharding.master-slave-rules.ds_0.master-data-source-name=ds_master_0
spring.shardingsphere.sharding.master-slave-rules.ds_0.slave-data-source-names=ds_master_0_slave_0, ds_master_0_slave_1
spring.shardingsphere.sharding.master-slave-rules.ds_1.master-data-source-name=ds_master_1
spring.shardingsphere.sharding.master-slave-rules.ds_1.slave-data-source-names=ds_master_1_slave_0, ds_master_1_slave_1
# sharding-jdbc mode
spring.shardingsphere.mode.type=Memory
# start shardingsphere log
spring.shardingsphere.props.sql.show=true
1. DDL operation
JPA automatically creates tables for testing. When Sharding-JDBC’s library splitting and routing rules are configured, the client
executes DDL, and Sharding-JDBC will automatically create corresponding tables according to table splitting rules. If t_address
is a broadcast table, t_address
will be created on both ds_0
and ds_1
. The three tables, t_address
, t_order
and t_order_item
will be created on ds_0
and ds_1
respectively.
2. Write operation
For the broadcast table t_address
, each record written will also be written to the t_address
tables of ds_0
and ds_1
.
The tables t_order
and t_order_item
of the slave library are written to the table on the corresponding instance according to the slave library field and routing policy.
3. Read operation
The join
query operations on order
and order_item
under the binding table are shown below.
3. Conclusion
As an open source product focusing on database enhancement, ShardingSphere is pretty good in terms of its community activitiy, product maturity and documentation richness.
Among its products, ShardingSphere-JDBC is a sharding solution based on the client-side, which supports all sharding scenarios. And there’s no need to introduce an intermediate layer like Proxy, so the complexity of operation and maintenance is reduced. Its latency is theoretically lower than Proxy due to the lack of intermediate layer. In addition, ShardingSphere-JDBC can support a variety of relational databases based on SQL standards such as MySQL/PostgreSQL/Oracle/SQL Server, etc.
However, due to the integration of Sharding-JDBC with the application program, it only supports Java language for now, and is strongly dependent on the application programs. Nevertheless, Sharding-JDBC separates all sharding configuration from the application program, which brings relatively small changes when switching to other middleware.
In conclusion, Sharding-JDBC is a good choice if you use a Java-based system and have to to interconnect with different relational databases — and don’t want to bother with introducing an intermediate layer.
Author
Sun Jinhua
A senior solution architect at AWS, Sun is responsible for the design and consult on cloud architecture. for providing customers with cloud-related design and consulting services. Before joining AWS, he ran his own business, specializing in building e-commerce platforms and designing the overall architecture for e-commerce platforms of automotive companies. He worked in a global leading communication equipment company as a senior engineer, responsible for the development and architecture design of multiple subsystems of LTE equipment system. He has rich experience in architecture design with high concurrency and high availability system, microservice architecture design, database, middleware, IOT etc.
1677668905
Mocking library for TypeScript inspired by http://mockito.org/
mock
) (also abstract classes) #examplespy
) #examplewhen
) via:verify
)reset
, resetCalls
) #example, #examplecapture
) #example'Expected "convertNumberToString(strictEqual(3))" to be called 2 time(s). But has been called 1 time(s).'
)npm install ts-mockito --save-dev
// Creating mock
let mockedFoo:Foo = mock(Foo);
// Getting instance from mock
let foo:Foo = instance(mockedFoo);
// Using instance in source code
foo.getBar(3);
foo.getBar(5);
// Explicit, readable verification
verify(mockedFoo.getBar(3)).called();
verify(mockedFoo.getBar(anything())).called();
// Creating mock
let mockedFoo:Foo = mock(Foo);
// stub method before execution
when(mockedFoo.getBar(3)).thenReturn('three');
// Getting instance
let foo:Foo = instance(mockedFoo);
// prints three
console.log(foo.getBar(3));
// prints null, because "getBar(999)" was not stubbed
console.log(foo.getBar(999));
// Creating mock
let mockedFoo:Foo = mock(Foo);
// stub getter before execution
when(mockedFoo.sampleGetter).thenReturn('three');
// Getting instance
let foo:Foo = instance(mockedFoo);
// prints three
console.log(foo.sampleGetter);
Syntax is the same as with getter values.
Please note, that stubbing properties that don't have getters only works if Proxy object is available (ES6).
// Creating mock
let mockedFoo:Foo = mock(Foo);
// Getting instance
let foo:Foo = instance(mockedFoo);
// Some calls
foo.getBar(1);
foo.getBar(2);
foo.getBar(2);
foo.getBar(3);
// Call count verification
verify(mockedFoo.getBar(1)).once(); // was called with arg === 1 only once
verify(mockedFoo.getBar(2)).twice(); // was called with arg === 2 exactly two times
verify(mockedFoo.getBar(between(2, 3))).thrice(); // was called with arg between 2-3 exactly three times
verify(mockedFoo.getBar(anyNumber()).times(4); // was called with any number arg exactly four times
verify(mockedFoo.getBar(2)).atLeast(2); // was called with arg === 2 min two times
verify(mockedFoo.getBar(anything())).atMost(4); // was called with any argument max four times
verify(mockedFoo.getBar(4)).never(); // was never called with arg === 4
// Creating mock
let mockedFoo:Foo = mock(Foo);
let mockedBar:Bar = mock(Bar);
// Getting instance
let foo:Foo = instance(mockedFoo);
let bar:Bar = instance(mockedBar);
// Some calls
foo.getBar(1);
bar.getFoo(2);
// Call order verification
verify(mockedFoo.getBar(1)).calledBefore(mockedBar.getFoo(2)); // foo.getBar(1) has been called before bar.getFoo(2)
verify(mockedBar.getFoo(2)).calledAfter(mockedFoo.getBar(1)); // bar.getFoo(2) has been called before foo.getBar(1)
verify(mockedFoo.getBar(1)).calledBefore(mockedBar.getFoo(999999)); // throws error (mockedBar.getFoo(999999) has never been called)
let mockedFoo:Foo = mock(Foo);
when(mockedFoo.getBar(10)).thenThrow(new Error('fatal error'));
let foo:Foo = instance(mockedFoo);
try {
foo.getBar(10);
} catch (error:Error) {
console.log(error.message); // 'fatal error'
}
You can also stub method with your own implementation
let mockedFoo:Foo = mock(Foo);
let foo:Foo = instance(mockedFoo);
when(mockedFoo.sumTwoNumbers(anyNumber(), anyNumber())).thenCall((arg1:number, arg2:number) => {
return arg1 * arg2;
});
// prints '50' because we've changed sum method implementation to multiply!
console.log(foo.sumTwoNumbers(5, 10));
You can also stub method to resolve / reject promise
let mockedFoo:Foo = mock(Foo);
when(mockedFoo.fetchData("a")).thenResolve({id: "a", value: "Hello world"});
when(mockedFoo.fetchData("b")).thenReject(new Error("b does not exist"));
You can reset just mock call counter
// Creating mock
let mockedFoo:Foo = mock(Foo);
// Getting instance
let foo:Foo = instance(mockedFoo);
// Some calls
foo.getBar(1);
foo.getBar(1);
verify(mockedFoo.getBar(1)).twice(); // getBar with arg "1" has been called twice
// Reset mock
resetCalls(mockedFoo);
// Call count verification
verify(mockedFoo.getBar(1)).never(); // has never been called after reset
You can also reset calls of multiple mocks at once resetCalls(firstMock, secondMock, thirdMock)
Or reset mock call counter with all stubs
// Creating mock
let mockedFoo:Foo = mock(Foo);
when(mockedFoo.getBar(1)).thenReturn("one").
// Getting instance
let foo:Foo = instance(mockedFoo);
// Some calls
console.log(foo.getBar(1)); // "one" - as defined in stub
console.log(foo.getBar(1)); // "one" - as defined in stub
verify(mockedFoo.getBar(1)).twice(); // getBar with arg "1" has been called twice
// Reset mock
reset(mockedFoo);
// Call count verification
verify(mockedFoo.getBar(1)).never(); // has never been called after reset
console.log(foo.getBar(1)); // null - previously added stub has been removed
You can also reset multiple mocks at once reset(firstMock, secondMock, thirdMock)
let mockedFoo:Foo = mock(Foo);
let foo:Foo = instance(mockedFoo);
// Call method
foo.sumTwoNumbers(1, 2);
// Check first arg captor values
const [firstArg, secondArg] = capture(mockedFoo.sumTwoNumbers).last();
console.log(firstArg); // prints 1
console.log(secondArg); // prints 2
You can also get other calls using first()
, second()
, byCallIndex(3)
and more...
You can set multiple returning values for same matching values
const mockedFoo:Foo = mock(Foo);
when(mockedFoo.getBar(anyNumber())).thenReturn('one').thenReturn('two').thenReturn('three');
const foo:Foo = instance(mockedFoo);
console.log(foo.getBar(1)); // one
console.log(foo.getBar(1)); // two
console.log(foo.getBar(1)); // three
console.log(foo.getBar(1)); // three - last defined behavior will be repeated infinitely
Another example with specific values
let mockedFoo:Foo = mock(Foo);
when(mockedFoo.getBar(1)).thenReturn('one').thenReturn('another one');
when(mockedFoo.getBar(2)).thenReturn('two');
let foo:Foo = instance(mockedFoo);
console.log(foo.getBar(1)); // one
console.log(foo.getBar(2)); // two
console.log(foo.getBar(1)); // another one
console.log(foo.getBar(1)); // another one - this is last defined behavior for arg '1' so it will be repeated
console.log(foo.getBar(2)); // two
console.log(foo.getBar(2)); // two - this is last defined behavior for arg '2' so it will be repeated
Short notation:
const mockedFoo:Foo = mock(Foo);
// You can specify return values as multiple thenReturn args
when(mockedFoo.getBar(anyNumber())).thenReturn('one', 'two', 'three');
const foo:Foo = instance(mockedFoo);
console.log(foo.getBar(1)); // one
console.log(foo.getBar(1)); // two
console.log(foo.getBar(1)); // three
console.log(foo.getBar(1)); // three - last defined behavior will be repeated infinity
Possible errors:
const mockedFoo:Foo = mock(Foo);
// When multiple matchers, matches same result:
when(mockedFoo.getBar(anyNumber())).thenReturn('one');
when(mockedFoo.getBar(3)).thenReturn('one');
const foo:Foo = instance(mockedFoo);
foo.getBar(3); // MultipleMatchersMatchSameStubError will be thrown, two matchers match same method call
You can mock interfaces too, just instead of passing type to mock
function, set mock
function generic type Mocking interfaces requires Proxy
implementation
let mockedFoo:Foo = mock<FooInterface>(); // instead of mock(FooInterface)
const foo: SampleGeneric<FooInterface> = instance(mockedFoo);
You can mock abstract classes
const mockedFoo: SampleAbstractClass = mock(SampleAbstractClass);
const foo: SampleAbstractClass = instance(mockedFoo);
You can also mock generic classes, but note that generic type is just needed by mock type definition
const mockedFoo: SampleGeneric<SampleInterface> = mock(SampleGeneric);
const foo: SampleGeneric<SampleInterface> = instance(mockedFoo);
You can partially mock an existing instance:
const foo: Foo = new Foo();
const spiedFoo = spy(foo);
when(spiedFoo.getBar(3)).thenReturn('one');
console.log(foo.getBar(3)); // 'one'
console.log(foo.getBaz()); // call to a real method
You can spy on plain objects too:
const foo = { bar: () => 42 };
const spiedFoo = spy(foo);
foo.bar();
console.log(capture(spiedFoo.bar).last()); // [42]
Author: NagRock
Source Code: https://github.com/NagRock/ts-mockito
License: MIT license
1659714540
ruby-kafka
A Ruby client library for Apache Kafka, a distributed log and message bus. The focus of this library will be operational simplicity, with good logging and metrics that can make debugging issues easier.
Add this line to your application's Gemfile:
gem 'ruby-kafka'
And then execute:
$ bundle
Or install it yourself as:
$ gem install ruby-kafka
Producer API | Consumer API | |
---|---|---|
Kafka 0.8 | Full support in v0.4.x | Unsupported |
Kafka 0.9 | Full support in v0.4.x | Full support in v0.4.x |
Kafka 0.10 | Full support in v0.5.x | Full support in v0.5.x |
Kafka 0.11 | Full support in v0.7.x | Limited support |
Kafka 1.0 | Limited support | Limited support |
Kafka 2.0 | Limited support | Limited support |
Kafka 2.1 | Limited support | Limited support |
Kafka 2.2 | Limited support | Limited support |
Kafka 2.3 | Limited support | Limited support |
Kafka 2.4 | Limited support | Limited support |
Kafka 2.5 | Limited support | Limited support |
Kafka 2.6 | Limited support | Limited support |
Kafka 2.7 | Limited support | Limited support |
This library is targeting Kafka 0.9 with the v0.4.x series and Kafka 0.10 with the v0.5.x series. There's limited support for Kafka 0.8, and things should work with Kafka 0.11, although there may be performance issues due to changes in the protocol.
This library requires Ruby 2.1 or higher.
Please see the documentation site for detailed documentation on the latest release. Note that the documentation on GitHub may not match the version of the library you're using – there are still being made many changes to the API.
A client must be initialized with at least one Kafka broker, from which the entire Kafka cluster will be discovered. Each client keeps a separate pool of broker connections. Don't use the same client from more than one thread.
require "kafka"
# The first argument is a list of "seed brokers" that will be queried for the full
# cluster topology. At least one of these *must* be available. `client_id` is
# used to identify this client in logs and metrics. It's optional but recommended.
kafka = Kafka.new(["kafka1:9092", "kafka2:9092"], client_id: "my-application")
You can also use a hostname with seed brokers' IP addresses:
kafka = Kafka.new("seed-brokers:9092", client_id: "my-application", resolve_seed_brokers: true)
The simplest way to write a message to a Kafka topic is to call #deliver_message
:
kafka = Kafka.new(...)
kafka.deliver_message("Hello, World!", topic: "greetings")
This will write the message to a random partition in the greetings
topic. If you want to write to a specific partition, pass the partition
parameter:
# Will write to partition 42.
kafka.deliver_message("Hello, World!", topic: "greetings", partition: 42)
If you don't know exactly how many partitions are in the topic, or if you'd rather have some level of indirection, you can pass in partition_key
instead. Two messages with the same partition key will always be assigned to the same partition. This is useful if you want to make sure all messages with a given attribute are always written to the same partition, e.g. all purchase events for a given customer id.
# Partition keys assign a partition deterministically.
kafka.deliver_message("Hello, World!", topic: "greetings", partition_key: "hello")
Kafka also supports message keys. When passed, a message key can be used instead of a partition key. The message key is written alongside the message value and can be read by consumers. Message keys in Kafka can be used for interesting things such as Log Compaction. See Partitioning for more information.
# Set a message key; the key will be used for partitioning since no explicit
# `partition_key` is set.
kafka.deliver_message("Hello, World!", key: "hello", topic: "greetings")
While #deliver_message
works fine for infrequent writes, there are a number of downsides:
The Producer API solves all these problems and more:
# Instantiate a new producer.
producer = kafka.producer
# Add a message to the producer buffer.
producer.produce("hello1", topic: "test-messages")
# Deliver the messages to Kafka.
producer.deliver_messages
#produce
will buffer the message in the producer but will not actually send it to the Kafka cluster. Buffered messages are only delivered to the Kafka cluster once #deliver_messages
is called. Since messages may be destined for different partitions, this could involve writing to more than one Kafka broker. Note that a failure to send all buffered messages after the configured number of retries will result in Kafka::DeliveryFailed
being raised. This can be rescued and ignored; the messages will be kept in the buffer until the next attempt.
Read the docs for Kafka::Producer for more details.
A normal producer will block while #deliver_messages
is sending messages to Kafka, possibly for tens of seconds or even minutes at a time, depending on your timeout and retry settings. Furthermore, you have to call #deliver_messages
manually, with a frequency that balances batch size with message delay.
In order to avoid blocking during message deliveries you can use the asynchronous producer API. It is mostly similar to the synchronous API, with calls to #produce
and #deliver_messages
. The main difference is that rather than blocking, these calls will return immediately. The actual work will be done in a background thread, with the messages and operations being sent from the caller over a thread safe queue.
# `#async_producer` will create a new asynchronous producer.
producer = kafka.async_producer
# The `#produce` API works as normal.
producer.produce("hello", topic: "greetings")
# `#deliver_messages` will return immediately.
producer.deliver_messages
# Make sure to call `#shutdown` on the producer in order to avoid leaking
# resources. `#shutdown` will wait for any pending messages to be delivered
# before returning.
producer.shutdown
By default, the delivery policy will be the same as for a synchronous producer: only when #deliver_messages
is called will the messages be delivered. However, the asynchronous producer offers two complementary policies for automatic delivery:
These policies can be used alone or in combination.
# `async_producer` will create a new asynchronous producer.
producer = kafka.async_producer(
# Trigger a delivery once 100 messages have been buffered.
delivery_threshold: 100,
# Trigger a delivery every 30 seconds.
delivery_interval: 30,
)
producer.produce("hello", topic: "greetings")
# ...
When calling #shutdown
, the producer will attempt to deliver the messages and the method call will block until that has happened. Note that there's no guarantee that the messages will be delivered.
Note: if the calling thread produces messages faster than the producer can write them to Kafka, you'll eventually run into problems. The internal queue used for sending messages from the calling thread to the background worker has a size limit; once this limit is reached, a call to #produce
will raise Kafka::BufferOverflow
.
This library is agnostic to which serialization format you prefer. Both the value and key of a message is treated as a binary string of data. This makes it easier to use whatever serialization format you want, since you don't have to do anything special to make it work with ruby-kafka. Here's an example of encoding data with JSON:
require "json"
# ...
event = {
"name" => "pageview",
"url" => "https://example.com/posts/123",
# ...
}
data = JSON.dump(event)
producer.produce(data, topic: "events")
There's also an example of encoding messages with Apache Avro.
Kafka topics are partitioned, with messages being assigned to a partition by the client. This allows a great deal of flexibility for the users. This section describes several strategies for partitioning and how they impact performance, data locality, etc.
Load Balanced Partitioning
When optimizing for efficiency, we either distribute messages as evenly as possible to all partitions, or make sure each producer always writes to a single partition. The former ensures an even load for downstream consumers; the latter ensures the highest producer performance, since message batching is done per partition.
If no explicit partition is specified, the producer will look to the partition key or the message key for a value that can be used to deterministically assign the message to a partition. If there is a big number of different keys, the resulting distribution will be pretty even. If no keys are passed, the producer will randomly assign a partition. Random partitioning can be achieved even if you use message keys by passing a random partition key, e.g. partition_key: rand(100)
.
If you wish to have the producer write all messages to a single partition, simply generate a random value and re-use that as the partition key:
partition_key = rand(100)
producer.produce(msg1, topic: "messages", partition_key: partition_key)
producer.produce(msg2, topic: "messages", partition_key: partition_key)
# ...
You can also base the partition key on some property of the producer, for example the host name.
Semantic Partitioning
By assigning messages to a partition based on some property of the message, e.g. making sure all events tracked in a user session are assigned to the same partition, downstream consumers can make simplifying assumptions about data locality. In this example, a consumer can keep process local state pertaining to a user session knowing that all events for the session will be read from a single partition. This is also called semantic partitioning, since the partition assignment is part of the application behavior.
Typically it's sufficient to simply pass a partition key in order to guarantee that a set of messages will be assigned to the same partition, e.g.
# All messages with the same `session_id` will be assigned to the same partition.
producer.produce(event, topic: "user-events", partition_key: session_id)
However, sometimes it's necessary to select a specific partition. When doing this, make sure that you don't pick a partition number outside the range of partitions for the topic:
partitions = kafka.partitions_for("events")
# Make sure that we don't exceed the partition count!
partition = some_number % partitions
producer.produce(event, topic: "events", partition: partition)
Compatibility with Other Clients
There's no standardized way to assign messages to partitions across different Kafka client implementations. If you have a heterogeneous set of clients producing messages to the same topics it may be important to ensure a consistent partitioning scheme. This library doesn't try to implement all schemes, so you'll have to figure out which scheme the other client is using and replicate it. An example:
partitions = kafka.partitions_for("events")
# Insert your custom partitioning scheme here:
partition = PartitioningScheme.assign(partitions, event)
producer.produce(event, topic: "events", partition: partition)
Another option is to configure a custom client partitioner that implements call(partition_count, message)
and uses the same schema as the other client. For example:
class CustomPartitioner
def call(partition_count, message)
...
end
end
partitioner = CustomPartitioner.new
Kafka.new(partitioner: partitioner, ...)
Or, simply create a Proc handling the partitioning logic instead of having to add a new class. For example:
partitioner = -> (partition_count, message) { ... }
Kafka.new(partitioner: partitioner, ...)
Supported partitioning schemes
In order for semantic partitioning to work a partition_key
must map to the same partition number every time. The general approach, and the one used by this library, is to hash the key and mod it by the number of partitions. There are many different algorithms that can be used to calculate a hash. By default crc32
is used. murmur2
is also supported for compatibility with Java based Kafka producers.
To use murmur2
hashing pass it as an argument to Partitioner
. For example:
Kafka.new(partitioner: Kafka::Partitioner.new(hash_function: :murmur2))
The producer is designed for resilience in the face of temporary network errors, Kafka broker failovers, and other issues that prevent the client from writing messages to the destination topics. It does this by employing local, in-memory buffers. Only when messages are acknowledged by a Kafka broker will they be removed from the buffer.
Typically, you'd configure the producer to retry failed attempts at sending messages, but sometimes all retries are exhausted. In that case, Kafka::DeliveryFailed
is raised from Kafka::Producer#deliver_messages
. If you wish to have your application be resilient to this happening (e.g. if you're logging to Kafka from a web application) you can rescue this exception. The failed messages are still retained in the buffer, so a subsequent call to #deliver_messages
will still attempt to send them.
Note that there's a maximum buffer size; by default, it's set to 1,000 messages and 10MB. It's possible to configure both these numbers:
producer = kafka.producer(
max_buffer_size: 5_000, # Allow at most 5K messages to be buffered.
max_buffer_bytesize: 100_000_000, # Allow at most 100MB to be buffered.
...
)
A final note on buffers: local buffers give resilience against broker and network failures, and allow higher throughput due to message batching, but they also trade off consistency guarantees for higher availability and resilience. If your local process dies while messages are buffered, those messages will be lost. If you require high levels of consistency, you should call #deliver_messages
immediately after #produce
.
Once the client has delivered a set of messages to a Kafka broker the broker will forward them to its replicas, thus ensuring that a single broker failure will not result in message loss. However, the client can choose when the leader acknowledges the write. At one extreme, the client can choose fire-and-forget delivery, not even bothering to check whether the messages have been acknowledged. At the other end, the client can ask the broker to wait until all its replicas have acknowledged the write before returning. This is the safest option, and the default. It's also possible to have the broker return as soon as it has written the messages to its own log but before the replicas have done so. This leaves a window of time where a failure of the leader will result in the messages being lost, although this should not be a common occurrence.
Write latency and throughput are negatively impacted by having more replicas acknowledge a write, so if you require low-latency, high throughput writes you may want to accept lower durability.
This behavior is controlled by the required_acks
option to #producer
and #async_producer
:
# This is the default: all replicas must acknowledge.
producer = kafka.producer(required_acks: :all)
# This is fire-and-forget: messages can easily be lost.
producer = kafka.producer(required_acks: 0)
# This only waits for the leader to acknowledge.
producer = kafka.producer(required_acks: 1)
Unless you absolutely need lower latency it's highly recommended to use the default setting (:all
).
There are basically two different and incompatible guarantees that can be made in a message delivery system such as Kafka:
Of these two options, ruby-kafka implements the second one: when in doubt about whether a message has been delivered, a producer will try to deliver it again.
The guarantee is made only for the synchronous producer and boils down to this:
producer = kafka.producer
producer.produce("hello", topic: "greetings")
# If this line fails with Kafka::DeliveryFailed we *may* have succeeded in delivering
# the message to Kafka but won't know for sure.
producer.deliver_messages
# If we get to this line we can be sure that the message has been delivered to Kafka!
That is, once #deliver_messages
returns we can be sure that Kafka has received the message. Note that there are some big caveats here:
required_acks
to zero there is no guarantee that the message will ever make it to a Kafka broker.#deliver_messages
returns. A way of blocking until a message has been delivered with the asynchronous producer may be implemented in the future.It's possible to improve your chances of success when calling #deliver_messages
, at the price of a longer max latency:
producer = kafka.producer(
# The number of retries when attempting to deliver messages. The default is
# 2, so 3 attempts in total, but you can configure a higher or lower number:
max_retries: 5,
# The number of seconds to wait between retries. In order to handle longer
# periods of Kafka being unavailable, increase this number. The default is
# 1 second.
retry_backoff: 5,
)
Note that these values affect the max latency of the operation; see Understanding Timeouts for an explanation of the various timeouts and latencies.
If you use the asynchronous producer you typically don't have to worry too much about this, as retries will be done in the background.
Depending on what kind of data you produce, enabling compression may yield improved bandwidth and space usage. Compression in Kafka is done on entire messages sets rather than on individual messages. This improves the compression rate and generally means that compressions works better the larger your buffers get, since the message sets will be larger by the time they're compressed.
Since many workloads have variations in throughput and distribution across partitions, it's possible to configure a threshold for when to enable compression by setting compression_threshold
. Only if the defined number of messages are buffered for a partition will the messages be compressed.
Compression is enabled by passing the compression_codec
parameter to #producer
with the name of one of the algorithms allowed by Kafka:
:snappy
for Snappy compression.:gzip
for gzip compression.:lz4
for LZ4 compression.:zstd
for zstd compression.By default, all message sets will be compressed if you specify a compression codec. To increase the compression threshold, set compression_threshold
to an integer value higher than one.
producer = kafka.producer(
compression_codec: :snappy,
compression_threshold: 10,
)
A typical use case for Kafka is tracking events that occur in web applications. Oftentimes it's advisable to avoid having a hard dependency on Kafka being available, allowing your application to survive a Kafka outage. By using an asynchronous producer, you can avoid doing IO within the individual request/response cycles, instead pushing that to the producer's internal background thread.
In this example, a producer is configured in a Rails initializer:
# config/initializers/kafka_producer.rb
require "kafka"
# Configure the Kafka client with the broker hosts and the Rails
# logger.
$kafka = Kafka.new(["kafka1:9092", "kafka2:9092"], logger: Rails.logger)
# Set up an asynchronous producer that delivers its buffered messages
# every ten seconds:
$kafka_producer = $kafka.async_producer(
delivery_interval: 10,
)
# Make sure to shut down the producer when exiting.
at_exit { $kafka_producer.shutdown }
In your controllers, simply call the producer directly:
# app/controllers/orders_controller.rb
class OrdersController
def create
@order = Order.create!(params[:order])
event = {
order_id: @order.id,
amount: @order.amount,
timestamp: Time.now,
}
$kafka_producer.produce(event.to_json, topic: "order_events")
end
end
Note: If you're just looking to get started with Kafka consumers, you might be interested in visiting the Higher level libraries section that lists ruby-kafka based frameworks. Read on, if you're interested in either rolling your own executable consumers or if you want to learn more about how consumers work in Kafka.
Consuming messages from a Kafka topic with ruby-kafka is simple:
require "kafka"
kafka = Kafka.new(["kafka1:9092", "kafka2:9092"])
kafka.each_message(topic: "greetings") do |message|
puts message.offset, message.key, message.value
end
While this is great for extremely simple use cases, there are a number of downsides:
The Consumer API solves all of the above issues, and more. It uses the Consumer Groups feature released in Kafka 0.9 to allow multiple consumer processes to coordinate access to a topic, assigning each partition to a single consumer. When a consumer fails, the partitions that were assigned to it are re-assigned to other members of the group.
Using the API is simple:
require "kafka"
kafka = Kafka.new(["kafka1:9092", "kafka2:9092"])
# Consumers with the same group id will form a Consumer Group together.
consumer = kafka.consumer(group_id: "my-consumer")
# It's possible to subscribe to multiple topics by calling `subscribe`
# repeatedly.
consumer.subscribe("greetings")
# Stop the consumer when the SIGTERM signal is sent to the process.
# It's better to shut down gracefully than to kill the process.
trap("TERM") { consumer.stop }
# This will loop indefinitely, yielding each message in turn.
consumer.each_message do |message|
puts message.topic, message.partition
puts message.offset, message.key, message.value
end
Each consumer process will be assigned one or more partitions from each topic that the group subscribes to. In order to handle more messages, simply start more processes.
In order to be able to resume processing after a consumer crashes, each consumer will periodically checkpoint its position within each partition it reads from. Since each partition has a monotonically increasing sequence of message offsets, this works by committing the offset of the last message that was processed in a given partition. Kafka handles these commits and allows another consumer in a group to resume from the last commit when a member crashes or becomes unresponsive.
By default, offsets are committed every 10 seconds. You can increase the frequency, known as the offset commit interval, to limit the duration of double-processing scenarios, at the cost of a lower throughput due to the added coordination. If you want to improve throughput, and double-processing is of less concern to you, then you can decrease the frequency. Set the commit interval to zero in order to disable the timer-based commit trigger entirely.
In addition to the time based trigger it's possible to trigger checkpointing in response to n messages having been processed, known as the offset commit threshold. This puts a bound on the number of messages that can be double-processed before the problem is detected. Setting this to 1 will cause an offset commit to take place every time a message has been processed. By default this trigger is disabled (set to zero).
It is possible to trigger an immediate offset commit by calling Consumer#commit_offsets
. This blocks the caller until the Kafka cluster has acknowledged the commit.
Stale offsets are periodically purged by the broker. The broker setting offsets.retention.minutes
controls the retention window for committed offsets, and defaults to 1 day. The length of the retention window, known as offset retention time, can be changed for the consumer.
Previously committed offsets are re-committed, to reset the retention window, at the first commit and periodically at an interval of half the offset retention time.
consumer = kafka.consumer(
group_id: "some-group",
# Increase offset commit frequency to once every 5 seconds.
offset_commit_interval: 5,
# Commit offsets when 100 messages have been processed.
offset_commit_threshold: 100,
# Increase the length of time that committed offsets are kept.
offset_retention_time: 7 * 60 * 60
)
For some use cases it may be necessary to control when messages are marked as processed. Note that since only the consumer position within each partition can be saved, marking a message as processed implies that all messages in the partition with a lower offset should also be considered as having been processed.
The method Consumer#mark_message_as_processed
marks a message (and all those that precede it in a partition) as having been processed. This is an advanced API that you should only use if you know what you're doing.
# Manually controlling checkpointing:
# Typically you want to use this API in order to buffer messages until some
# special "commit" message is received, e.g. in order to group together
# transactions consisting of several items.
buffer = []
# Messages will not be marked as processed automatically. If you shut down the
# consumer without calling `#mark_message_as_processed` first, the consumer will
# not resume where you left off!
consumer.each_message(automatically_mark_as_processed: false) do |message|
# Our messages are JSON with a `type` field and other stuff.
event = JSON.parse(message.value)
case event.fetch("type")
when "add_to_cart"
buffer << event
when "complete_purchase"
# We've received all the messages we need, time to save the transaction.
save_transaction(buffer)
# Now we can set the checkpoint by marking the last message as processed.
consumer.mark_message_as_processed(message)
# We can optionally trigger an immediate, blocking offset commit in order
# to minimize the risk of crashing before the automatic triggers have
# kicked in.
consumer.commit_offsets
# Make the buffer ready for the next transaction.
buffer.clear
end
end
For each topic subscription it's possible to decide whether to consume messages starting at the beginning of the topic or to just consume new messages that are produced to the topic. This policy is configured by setting the start_from_beginning
argument when calling #subscribe
:
# Consume messages from the very beginning of the topic. This is the default.
consumer.subscribe("users", start_from_beginning: true)
# Only consume new messages.
consumer.subscribe("notifications", start_from_beginning: false)
Once the consumer group has checkpointed its progress in the topic's partitions, the consumers will always start from the checkpointed offsets, regardless of start_from_beginning
. As such, this setting only applies when the consumer initially starts consuming from a topic.
In order to shut down a running consumer process cleanly, call #stop
on it. A common pattern is to trap a process signal and initiate the shutdown from there:
consumer = kafka.consumer(...)
# The consumer can be stopped from the command line by executing
# `kill -s TERM <process-id>`.
trap("TERM") { consumer.stop }
consumer.each_message do |message|
...
end
Sometimes it is easier to deal with messages in batches rather than individually. A batch is a sequence of one or more Kafka messages that all belong to the same topic and partition. One common reason to want to use batches is when some external system has a batch or transactional API.
# A mock search index that we'll be keeping up to date with new Kafka messages.
index = SearchIndex.new
consumer.subscribe("posts")
consumer.each_batch do |batch|
puts "Received batch: #{batch.topic}/#{batch.partition}"
transaction = index.transaction
batch.messages.each do |message|
# Let's assume that adding a document is idempotent.
transaction.add(id: message.key, body: message.value)
end
# Once this method returns, the messages have been successfully written to the
# search index. The consumer will only checkpoint a batch *after* the block
# has completed without an exception.
transaction.commit!
end
One important thing to note is that the client commits the offset of the batch's messages only after the entire batch has been processed.
There are two performance properties that can at times be at odds: throughput and latency. Throughput is the number of messages that can be processed in a given timespan; latency is the time it takes from a message is written to a topic until it has been processed.
In order to optimize for throughput, you want to make sure to fetch as many messages as possible every time you do a round trip to the Kafka cluster. This minimizes network overhead and allows processing data in big chunks.
In order to optimize for low latency, you want to process a message as soon as possible, even if that means fetching a smaller batch of messages.
There are three values that can be tuned in order to balance these two concerns.
min_bytes
is the minimum number of bytes to return from a single message fetch. By setting this to a high value you can increase the processing throughput. The default value is one byte.max_wait_time
is the maximum number of seconds to wait before returning data from a single message fetch. By setting this high you also increase the processing throughput – and by setting it low you set a bound on latency. This configuration overrides min_bytes
, so you'll always get data back within the time specified. The default value is one second. If you want to have at most five seconds of latency, set max_wait_time
to 5. You should make sure max_wait_time
* num brokers + heartbeat_interval
is less than session_timeout
.max_bytes_per_partition
is the maximum amount of data a broker will return for a single partition when fetching new messages. The default is 1MB, but increasing this number may lead to better throughtput since you'll need to fetch less frequently. Setting it to a lower value is not recommended unless you have so many partitions that it's causing network and latency issues to transfer a fetch response from a broker to a client. Setting the number too high may result in instability, so be careful.The first two settings can be passed to either #each_message
or #each_batch
, e.g.
# Waits for data for up to 5 seconds on each broker, preferring to fetch at least 5KB at a time.
# This can wait up to num brokers * 5 seconds.
consumer.each_message(min_bytes: 1024 * 5, max_wait_time: 5) do |message|
# ...
end
The last setting is configured when subscribing to a topic, and can vary between topics:
# Fetches up to 5MB per partition at a time for better throughput.
consumer.subscribe("greetings", max_bytes_per_partition: 5 * 1024 * 1024)
consumer.each_message do |message|
# ...
end
In some cases, you might want to assign more partitions to some consumers. For example, in applications inserting some records to a database, the consumers running on hosts nearby the database can process more messages than the consumers running on other hosts. You can use a custom assignment strategy by passing an object that implements #call
as the argument assignment_strategy
like below:
class CustomAssignmentStrategy
def initialize(user_data)
@user_data = user_data
end
# Assign the topic partitions to the group members.
#
# @param cluster [Kafka::Cluster]
# @param members [Hash<String, Kafka::Protocol::JoinGroupResponse::Metadata>] a hash
# mapping member ids to metadata
# @param partitions [Array<Kafka::ConsumerGroup::Assignor::Partition>] a list of
# partitions the consumer group processes
# @return [Hash<String, Array<Kafka::ConsumerGroup::Assignor::Partition>] a hash
# mapping member ids to partitions.
def call(cluster:, members:, partitions:)
...
end
end
strategy = CustomAssignmentStrategy.new("some-host-information")
consumer = kafka.consumer(group_id: "some-group", assignment_strategy: strategy)
members
is a hash mapping member IDs to metadata, and partitions is a list of partitions the consumer group processes. The method call
must return a hash mapping member IDs to partitions. For example, the following strategy assigns partitions randomly:
class RandomAssignmentStrategy
def call(cluster:, members:, partitions:)
member_ids = members.keys
partitions.each_with_object(Hash.new {|h, k| h[k] = [] }) do |partition, partitions_per_member|
partitions_per_member[member_ids[rand(member_ids.count)]] << partition
end
end
end
If the strategy needs user data, you should define the method user_data
that returns user data on each consumer. For example, the following strategy uses the consumers' IP addresses as user data:
class NetworkTopologyAssignmentStrategy
def user_data
Socket.ip_address_list.find(&:ipv4_private?).ip_address
end
def call(cluster:, members:, partitions:)
# Display the pair of the member ID and IP address
members.each do |id, metadata|
puts "#{id}: #{metadata.user_data}"
end
# Assign partitions considering the network topology
...
end
end
Note that the strategy uses the class name as the default protocol name. You can change it by defining the method protocol_name
:
class NetworkTopologyAssignmentStrategy
def protocol_name
"networktopology"
end
def user_data
Socket.ip_address_list.find(&:ipv4_private?).ip_address
end
def call(cluster:, members:, partitions:)
...
end
end
As the method call
might receive different user data from what it expects, you should avoid using the same protocol name as another strategy that uses different user data.
You typically don't want to share a Kafka client object between threads, since the network communication is not synchronized. Furthermore, you should avoid using threads in a consumer unless you're very careful about waiting for all work to complete before returning from the #each_message
or #each_batch
block. This is because checkpointing assumes that returning from the block means that the messages that have been yielded have been successfully processed.
You should also avoid sharing a synchronous producer between threads, as the internal buffers are not thread safe. However, the asynchronous producer should be safe to use in a multi-threaded environment. This is because producers, when instantiated, get their own copy of any non-thread-safe data such as network sockets. Furthermore, the asynchronous producer has been designed in such a way to only a single background thread operates on this data while any foreground thread with a reference to the producer object can only send messages to that background thread over a safe queue. Therefore it is safe to share an async producer object between many threads.
It's a very good idea to configure the Kafka client with a logger. All important operations and errors are logged. When instantiating your client, simply pass in a valid logger:
logger = Logger.new("log/kafka.log")
kafka = Kafka.new(logger: logger, ...)
By default, nothing is logged.
Most operations are instrumented using Active Support Notifications. In order to subscribe to notifications, make sure to require the notifications library:
require "active_support/notifications"
require "kafka"
The notifications are namespaced based on their origin, with separate namespaces for the producer and the consumer.
In order to receive notifications you can either subscribe to individual notification names or use regular expressions to subscribe to entire namespaces. This example will subscribe to all notifications sent by ruby-kafka:
ActiveSupport::Notifications.subscribe(/.*\.kafka$/) do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
puts "Received notification `#{event.name}` with payload: #{event.payload.inspect}"
end
All notification events have the client_id
key in the payload, referring to the Kafka client id.
produce_message.producer.kafka
is sent whenever a message is produced to a buffer. It includes the following payload:
value
is the message value.key
is the message key.topic
is the topic that the message was produced to.buffer_size
is the size of the producer buffer after adding the message.max_buffer_size
is the maximum size of the producer buffer.deliver_messages.producer.kafka
is sent whenever a producer attempts to deliver its buffered messages to the Kafka brokers. It includes the following payload:
attempts
is the number of times delivery was attempted.message_count
is the number of messages for which delivery was attempted.delivered_message_count
is the number of messages that were acknowledged by the brokers - if this number is smaller than message_count
not all messages were successfully delivered.All notifications have group_id
in the payload, referring to the Kafka consumer group id.
process_message.consumer.kafka
is sent whenever a message is processed by a consumer. It includes the following payload:
value
is the message value.key
is the message key.topic
is the topic that the message was consumed from.partition
is the topic partition that the message was consumed from.offset
is the message's offset within the topic partition.offset_lag
is the number of messages within the topic partition that have not yet been consumed.start_process_message.consumer.kafka
is sent before process_message.consumer.kafka
, and contains the same payload. It is delivered before the message is processed, rather than after.
process_batch.consumer.kafka
is sent whenever a message batch is processed by a consumer. It includes the following payload:
message_count
is the number of messages in the batch.topic
is the topic that the message batch was consumed from.partition
is the topic partition that the message batch was consumed from.highwater_mark_offset
is the message batch's highest offset within the topic partition.offset_lag
is the number of messages within the topic partition that have not yet been consumed.start_process_batch.consumer.kafka
is sent before process_batch.consumer.kafka
, and contains the same payload. It is delivered before the batch is processed, rather than after.
join_group.consumer.kafka
is sent whenever a consumer joins a consumer group. It includes the following payload:
group_id
is the consumer group id.sync_group.consumer.kafka
is sent whenever a consumer is assigned topic partitions within a consumer group. It includes the following payload:
group_id
is the consumer group id.leave_group.consumer.kafka
is sent whenever a consumer leaves a consumer group. It includes the following payload:
group_id
is the consumer group id.seek.consumer.kafka
is sent when a consumer first seeks to an offset. It includes the following payload:
group_id
is the consumer group id.topic
is the topic we are seeking in.partition
is the partition we are seeking in.offset
is the offset we have seeked to.heartbeat.consumer.kafka
is sent when a consumer group completes a heartbeat. It includes the following payload:
group_id
is the consumer group id.topic_partitions
is a hash of { topic_name => array of assigned partition IDs }request.connection.kafka
is sent whenever a network request is sent to a Kafka broker. It includes the following payload:api
is the name of the API that was called, e.g. produce
or fetch
.request_size
is the number of bytes in the request.response_size
is the number of bytes in the response.It is highly recommended that you monitor your Kafka client applications in production. Typical problems you'll see are:
You can quite easily build monitoring on top of the provided instrumentation hooks. In order to further help with monitoring, a prebuilt Statsd and Datadog reporter is included with ruby-kafka.
We recommend monitoring the following:
The Statsd reporter is automatically enabled when the kafka/statsd
library is required. You can optionally change the configuration.
require "kafka/statsd"
# Default is "ruby_kafka".
Kafka::Statsd.namespace = "custom-namespace"
# Default is "127.0.0.1".
Kafka::Statsd.host = "statsd.something.com"
# Default is 8125.
Kafka::Statsd.port = 1234
The Datadog reporter is automatically enabled when the kafka/datadog
library is required. You can optionally change the configuration.
# This enables the reporter:
require "kafka/datadog"
# Default is "ruby_kafka".
Kafka::Datadog.namespace = "custom-namespace"
# Default is "127.0.0.1".
Kafka::Datadog.host = "statsd.something.com"
# Default is 8125.
Kafka::Datadog.port = 1234
It's important to understand how timeouts work if you have a latency sensitive application. This library allows configuring timeouts on different levels:
Network timeouts apply to network connections to individual Kafka brokers. There are two config keys here, each passed to Kafka.new
:
connect_timeout
sets the number of seconds to wait while connecting to a broker for the first time. When ruby-kafka initializes, it needs to connect to at least one host in seed_brokers
in order to discover the Kafka cluster. Each host is tried until there's one that works. Usually that means the first one, but if your entire cluster is down, or there's a network partition, you could wait up to n * connect_timeout
seconds, where n
is the number of seed brokers.socket_timeout
sets the number of seconds to wait when reading from or writing to a socket connection to a broker. After this timeout expires the connection will be killed. Note that some Kafka operations are by definition long-running, such as waiting for new messages to arrive in a partition, so don't set this value too low. When configuring timeouts relating to specific Kafka operations, make sure to make them shorter than this one.Producer timeouts can be configured when calling #producer
on a client instance:
ack_timeout
is a timeout executed by a broker when the client is sending messages to it. It defines the number of seconds the broker should wait for replicas to acknowledge the write before responding to the client with an error. As such, it relates to the required_acks
setting. It should be set lower than socket_timeout
.retry_backoff
configures the number of seconds to wait after a failed attempt to send messages to a Kafka broker before retrying. The max_retries
setting defines the maximum number of retries to attempt, and so the total duration could be up to max_retries * retry_backoff
seconds. The timeout can be arbitrarily long, and shouldn't be too short: if a broker goes down its partitions will be handed off to another broker, and that can take tens of seconds.When sending many messages, it's likely that the client needs to send some messages to each broker in the cluster. Given n
brokers in the cluster, the total wait time when calling Kafka::Producer#deliver_messages
can be up to
n * (connect_timeout + socket_timeout + retry_backoff) * max_retries
Make sure your application can survive being blocked for so long.
By default, communication between Kafka clients and brokers is unencrypted and unauthenticated. Kafka 0.9 added optional support for encryption and client authentication and authorization. There are two layers of security made possible by this:
Encryption of Communication
By enabling SSL encryption you can have some confidence that messages can be sent to Kafka over an untrusted network without being intercepted.
In this case you just need to pass a valid CA certificate as a string when configuring your Kafka
client:
kafka = Kafka.new(["kafka1:9092"], ssl_ca_cert: File.read('my_ca_cert.pem'))
Without passing the CA certificate to the client it would be impossible to protect against man-in-the-middle attacks.
Using your system's CA cert store
If you want to use the CA certs from your system's default certificate store, you can use:
kafka = Kafka.new(["kafka1:9092"], ssl_ca_certs_from_system: true)
This configures the store to look up CA certificates from the system default certificate store on an as needed basis. The location of the store can usually be determined by: OpenSSL::X509::DEFAULT_CERT_FILE
Client Authentication
In order to authenticate the client to the cluster, you need to pass in a certificate and key created for the client and trusted by the brokers.
NOTE: You can disable hostname validation by passing ssl_verify_hostname: false
.
kafka = Kafka.new(
["kafka1:9092"],
ssl_ca_cert: File.read('my_ca_cert.pem'),
ssl_client_cert: File.read('my_client_cert.pem'),
ssl_client_cert_key: File.read('my_client_cert_key.pem'),
ssl_client_cert_key_password: 'my_client_cert_key_password',
ssl_verify_hostname: false,
# ...
)
Once client authentication is set up, it is possible to configure the Kafka cluster to authorize client requests.
Using JKS Certificates
Typically, Kafka certificates come in the JKS format, which isn't supported by ruby-kafka. There's a wiki page that describes how to generate valid X509 certificates from JKS certificates.
Kafka has support for using SASL to authenticate clients. Currently GSSAPI, SCRAM and PLAIN mechanisms are supported by ruby-kafka.
NOTE: With SASL for authentication, it is highly recommended to use SSL encryption. The default behavior of ruby-kafka enforces you to use SSL and you need to configure SSL encryption by passing ssl_ca_cert
or enabling ssl_ca_certs_from_system
. However, this strict SSL mode check can be disabled by setting sasl_over_ssl
to false
while initializing the client.
GSSAPI
In order to authenticate using GSSAPI, set your principal and optionally your keytab when initializing the Kafka client:
kafka = Kafka.new(
["kafka1:9092"],
sasl_gssapi_principal: 'kafka/kafka.example.com@EXAMPLE.COM',
sasl_gssapi_keytab: '/etc/keytabs/kafka.keytab',
# ...
)
AWS MSK (IAM)
In order to authenticate using IAM w/ an AWS MSK cluster, set your access key, secret key, and region when initializing the Kafka client:
k = Kafka.new(
["kafka1:9092"],
sasl_aws_msk_iam_access_key_id: 'iam_access_key',
sasl_aws_msk_iam_secret_key_id: 'iam_secret_key',
sasl_aws_msk_iam_aws_region: 'us-west-2',
ssl_ca_certs_from_system: true,
# ...
)
PLAIN
In order to authenticate using PLAIN, you must set your username and password when initializing the Kafka client:
kafka = Kafka.new(
["kafka1:9092"],
ssl_ca_cert: File.read('/etc/openssl/cert.pem'),
sasl_plain_username: 'username',
sasl_plain_password: 'password'
# ...
)
SCRAM
Since 0.11 kafka supports SCRAM.
kafka = Kafka.new(
["kafka1:9092"],
sasl_scram_username: 'username',
sasl_scram_password: 'password',
sasl_scram_mechanism: 'sha256',
# ...
)
OAUTHBEARER
This mechanism is supported in kafka >= 2.0.0 as of KIP-255
In order to authenticate using OAUTHBEARER, you must set the client with an instance of a class that implements a token
method (the interface is described in Kafka::Sasl::OAuth) which returns an ID/Access token.
Optionally, the client may implement an extensions
method that returns a map of key-value pairs. These can be sent with the SASL/OAUTHBEARER initial client response. This is only supported in kafka >= 2.1.0.
class TokenProvider
def token
"some_id_token"
end
end
# ...
client = Kafka.new(
["kafka1:9092"],
sasl_oauth_token_provider: TokenProvider.new
)
In addition to producing and consuming messages, ruby-kafka supports managing Kafka topics and their configurations. See the Kafka documentation for a full list of topic configuration keys.
Return an array of topic names.
kafka = Kafka.new(["kafka:9092"])
kafka.topics
# => ["topic1", "topic2", "topic3"]
kafka = Kafka.new(["kafka:9092"])
kafka.create_topic("topic")
By default, the new topic has 1 partition, replication factor 1 and default configs from the brokers. Those configurations are customizable:
kafka = Kafka.new(["kafka:9092"])
kafka.create_topic("topic",
num_partitions: 3,
replication_factor: 2,
config: {
"max.message.bytes" => 100000
}
)
After a topic is created, you can increase the number of partitions for the topic. The new number of partitions must be greater than the current one.
kafka = Kafka.new(["kafka:9092"])
kafka.create_partitions_for("topic", num_partitions: 10)
kafka = Kafka.new(["kafka:9092"])
kafka.describe_topic("topic", ["max.message.bytes", "retention.ms"])
# => {"max.message.bytes"=>"100000", "retention.ms"=>"604800000"}
Update the topic configurations.
NOTE: This feature is for advanced usage. Only use this if you know what you're doing.
kafka = Kafka.new(["kafka:9092"])
kafka.alter_topic("topic", "max.message.bytes" => 100000, "retention.ms" => 604800000)
kafka = Kafka.new(["kafka:9092"])
kafka.delete_topic("topic")
After a topic is marked as deleted, Kafka only hides it from clients. It would take a while before a topic is completely deleted.
The library has been designed as a layered system, with each layer having a clear responsibility:
Kafka::Connection
for more details.Kafka::Protocol
for more details.Kafka::Cluster
, which represents an entire cluster, while simpler ones are only available through Kafka::Broker
, which represents a single Kafka broker. In general, Kafka::Cluster
is the high-level API, with more polish.Kafka::Consumer
while the Producer API is implemented in Kafka::Producer
and Kafka::AsyncProducer
.Kafka::Client
implements the public APIs. For convenience, the method Kafka.new
can instantiate the class for you.Note that only the API and configuration layers have any backwards compatibility guarantees – the other layers are considered internal and may change without warning. Don't use them directly.
The producer is designed with resilience and operational ease of use in mind, sometimes at the cost of raw performance. For instance, the operation is heavily instrumented, allowing operators to monitor the producer at a very granular level.
The producer has two main internal data structures: a list of pending messages and a message buffer. When the user calls Kafka::Producer#produce
, a message is appended to the pending message list, but no network communication takes place. This means that the call site does not have to handle the broad range of errors that can happen at the network or protocol level. Instead, those errors will only happen once Kafka::Producer#deliver_messages
is called. This method will go through the pending messages one by one, making sure they're assigned a partition. This may fail for some messages, as it could require knowing the current configuration for the message's topic, necessitating API calls to Kafka. Messages that cannot be assigned a partition are kept in the list, while the others are written into the message buffer. The producer then figures out which topic partitions are led by which Kafka brokers so that messages can be sent to the right place – in Kafka, it is the responsibility of the client to do this routing. A separate produce API request will be sent to each broker; the response will be inspected; and messages that were acknowledged by the broker will be removed from the message buffer. Any messages that were not acknowledged will be kept in the buffer.
If there are any messages left in either the pending message list or the message buffer after this operation, Kafka::DeliveryFailed
will be raised. This exception must be rescued and handled by the user, possibly by calling #deliver_messages
at a later time.
The synchronous producer allows the user fine-grained control over when network activity and the possible errors arising from that will take place, but it requires the user to handle the errors nonetheless. The async producer provides a more hands-off approach that trades off control for ease of use and resilience.
Instead of writing directly into the pending message list, Kafka::AsyncProducer
writes the message to an internal thread-safe queue, returning immediately. A background thread reads messages off the queue and passes them to a synchronous producer.
Rather than triggering message deliveries directly, users of the async producer will typically set up automatic triggers, such as a timer.
The Consumer API is designed for flexibility and stability. The first is accomplished by not dictating any high-level object model, instead opting for a simple loop-based approach. The second is accomplished by handling group membership, heartbeats, and checkpointing automatically. Messages are marked as processed as soon as they've been successfully yielded to the user-supplied processing block, minimizing the cost of processing errors.
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
Note: the specs require a working Docker instance, but should work out of the box if you have Docker installed. Please create an issue if that's not the case.
If you would like to contribute to ruby-kafka, please join our Slack team and ask how best to do it.
If you've discovered a bug, please file a Github issue, and make sure to include all the relevant information, including the version of ruby-kafka and Kafka that you're using.
If you have other questions, or would like to discuss best practises, how to contribute to the project, or any other ruby-kafka related topic, join our Slack team!
Version 0.4 will be the last minor release with support for the Kafka 0.9 protocol. It is recommended that you pin your dependency on ruby-kafka to ~> 0.4.0
in order to receive bugfixes and security updates. New features will only target version 0.5 and up, which will be incompatible with the Kafka 0.9 protocol.
Last stable release with support for the Kafka 0.9 protocol. Bug and security fixes will be released in patch updates.
Latest stable release, with native support for the Kafka 0.10 protocol and eventually newer protocol versions. Kafka 0.9 is no longer supported by this release series.
Currently, there are three actively developed frameworks based on ruby-kafka, that provide higher level API that can be used to work with Kafka messages and two libraries for publishing messages.
Racecar - A simple framework that integrates with Ruby on Rails to provide a seamless way to write, test, configure, and run Kafka consumers. It comes with sensible defaults and conventions.
Karafka - Framework used to simplify Apache Kafka based Ruby and Rails applications development. Karafka provides higher abstraction layers, including Capistrano, Docker and Heroku support.
Phobos - Micro framework and library for applications dealing with Apache Kafka. It wraps common behaviors needed by consumers and producers in an easy and convenient API.
DeliveryBoy – A library that integrates with Ruby on Rails, making it easy to publish Kafka messages from any Rails application.
WaterDrop – A library for Ruby and Ruby on Rails applications, to easy publish Kafka messages in both sync and async way.
There are a few existing Kafka clients in Ruby:
We needed a robust client that could be used from our existing Ruby apps, allowed our Ops to monitor operation, and provided flexible error handling. There didn't exist such a client, hence this project.
Bug reports and pull requests are welcome on GitHub at https://github.com/zendesk/ruby-kafka.
Author: Zendesk
Source Code: https://github.com/zendesk/ruby-kafka
License: Apache-2.0 license
1598289240
In this video, You will learn how to create Kafka Producer and Consumer with Spring Boot in Java.
You can start the Zookeeper and Kafka servers by using the below commands.
Start Zookeeper:
$ zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
Start Kafka server:
$ kafka-server-start /usr/local/etc/kafka/server.properties
GitHub Link: https://github.com/shameed1910/springboot-kafka
Subscribe : https://www.youtube.com/channel/UC3hc8Ql0_OzMY7XI1gH9g8g
#spring #spring-boot #kafka
1604402755
#spring-framework #spring #kafka