Apache Kafka using Keys for Partition

29/12/2020
Chưa phân loại
Apache Kafka is a data streaming platform responsible for streaming data from a number of sources to a lot of targets. The sources are also called producers. The data produced is needed by a completely different group called consumers for various purposes. Kafka is the layer that sits between the producers and consumers and aggregates the data into a usable pipeline. Also Kafka itself is a distributed platform, so the Kafka layer is composed of various servers running a kafka, these servers or nodes are hence known as Kafka Brokers.

That overview is a bit in the abstract so let’s ground it in a real-world scenario, imagine you need to monitor several web servers. Each running its own website, and new logs are constantly being generated in each one of them every second of the day. On top of that there are a number of email servers that you need to monitor as well.

You may need to store that data for record keeping and billing purposes, which is a batch job that doesn’t require immediate attention. You might want to run analytics on the data to make decisions in real-time which requires accurate and immediate input of data.  Suddenly you find yourself in the need for streamlining the data in a sensible way for all the various needs. Kafka acts as that layer of abstraction to which multiple sources can publish different streams of data and a given consumer can subscribe to the streams it finds relevant. Kafka will make sure that the data is well-ordered. It is the internals of Kafka that we need to understand before we get to the topic of Partitioning and Keys.

Kafka Topics, Broker and Partitions

Kafka Topics are like tables of a database. Each topic consists of data from a particular source of a particular type. For example, your cluster’s health can be a topic consisting of CPU and memory utilization information. Similarly, incoming traffic to across the cluster can be another topic.

Kafka is designed to be horizontally scalable. That is to say, a single instance of Kafka consists of multiple Kafka brokers running across multiple nodes, each can handle streams of data parallel to the other. Even if a few of the nodes fail your data pipeline can continue to function. A particular topic can then be split into a number of partitions. This partitioning is one of the crucial factors behind the horizontal scalability of Kafka.

Multiple producers, data sources for a given topic, can write to that topic simultaneously because each writes to a different partition, at any given point. Now, usually data is assigned to a partition randomly, unless we provide it with a key.

Partitioning and Ordering

Just to recap, producers are writing data to a given topic. That topic is actually split into multiple partitions. And each partition lives independently of the others, even for a given topic. This can lead to a lot of confusion when the ordering to data matters. Maybe you need your data in a chronological order but having multiple partitions for your datastream doesn’t guarantee perfect ordering.

You can use only a single partition per topic, but that defeats the whole purpose of Kafka’s distributed architecture. So we need some other solution.

Keys for Partitions

Data from a producer are sent to partitions randomly, as we mentioned before. Messages being the actual chunks of data. What producers can do besides just sending messages is to add a key that goes along with it.

All the messages that come with the specific key will go to the same partition. So, for example, a user’s activity can be tracked chronologically if that user’s data is tagged with a key and so it always end up in one partition. Let’s call this partition p0 and the user u0.

Partition p0 will always pick up the u0 related messages because that key tie them together. But that doesn’t mean that p0 is only tied up with that. It can also take up messages from u1 and u2 if it has the capacity to do so. Similarly, other partitions can consume data from other users.

The point that a given user’s data isn’t spread across different partition ensuring chronological ordering for that user. However, the overall topic of user data, can still leverage the distributed architecture of Apache Kafka.

Conclusion

While distributed systems like Kafka solve some older problems like lack of scalability or having single a point of failure. They come with a set of problems that are unique to their own design. Anticipating these problems is an essential job of any system architect. Not only that, sometimes you really have to do a cost-benefit analysis to determine whether the new problems are a worthy trade-off for getting rid of the older ones. Ordering and synchronization are just the tip of the iceberg.

Hopefully, articles like these and the official documentation can help you along the way.

ONET IDC thành lập vào năm 2012, là công ty chuyên nghiệp tại Việt Nam trong lĩnh vực cung cấp dịch vụ Hosting, VPS, máy chủ vật lý, dịch vụ Firewall Anti DDoS, SSL… Với 10 năm xây dựng và phát triển, ứng dụng nhiều công nghệ hiện đại, ONET IDC đã giúp hàng ngàn khách hàng tin tưởng lựa chọn, mang lại sự ổn định tuyệt đối cho website của khách hàng để thúc đẩy việc kinh doanh đạt được hiệu quả và thành công.
Bài viết liên quan

[Ubuntu] Hướng dẫn cài đặt LAMP Stack trên Ubuntu server 20.04 Develop version

LAMP là một hệ thống các phần mềm để tạo dựng môi trường máy chủ web được viết bằng...
30/12/2020

How to Undo Last Commit in Git

A common workflow in Git is that you make changes to your project, add the changes to the staging area, commit the changes,...
29/12/2020

Check if your Linux System is 32 or 64 Bit?

In this post, we will quickly see how we can check if our Linux based system is either 32-bit or 64-bit. Although, most...
28/12/2020
Bài Viết

Bài Viết Mới Cập Nhật

Reliable IPv4 and IPv6 Subnet Rental Services: The Perfect Solution for Global Businesses
23/12/2024

Tìm Hiểu Về Thuê Proxy US – Lợi Ích và Cách Sử Dụng Hiệu Quả
11/12/2024

Mua Proxy V6 Nuôi Facebook Spam Hiệu Quả Tại Onetcomvn
03/06/2024

Hướng dẫn cách sử dụng ProxyDroid để duyệt web ẩn danh
03/06/2024

Mua proxy Onet uy tín tại Onet.com.vn
03/06/2024