在 Hacker News 上看到「Kafka at the low end: how bad can it get? (broot.ca)」這篇,原文「Kafka at the low end: how bad can it get?」這邊提到他認為 Kafka 當 Job Queue 最嚴重的問題是在量少時的不平均問題... 這當然是是其中一個問題,但缺乏許多 job queue 的功能 (需要自己實作) 我覺得才是大問題。
內文裡面提到 KIP-932: Queues for Kafka,不過他的連結是「Queues in Apache Kafka®: Enhancing Message Processing and Scalability」這篇文章,可以看到 Kafka 官方 (btw,Kafka 官方裡面一堆 Confluent 的人) 有在搞 Queue 的機制了。
從 wiki 頁上可以看到 Motivation 的部分有提到使用 Kafka 都會習慣先 over-partition:
Users of Kafka often have to “over-partition” simply to ensure they can have sufficient parallel consumption to cope with peak loads.
但 paritition 也容易產生很多不平均之類的問題,所以想法是透過 share group 的方式實作 queue 的概念:
This is much easier to achieve using a queue rather than a partitioned topic with a consumer group.
This KIP introduces the concept of a share group as a way of enabling cooperative consumption using Kafka topics. It does not add the concept of a “queue” to Kafka per se, but rather that introduces cooperative consumption to accommodate these queuing use-cases using regular Kafka topics. Share groups make this possible. You can think of a share group as roughly equivalent to a “durable shared subscription” in existing systems.
然後在 Proposed Changes 這段可以看到 ADT,首先從這段就可以看出來是 lock 帶 timeout 的架構:
When a consumer in a share-group fetches records, it receives available records from any of the topic-partitions that match its subscriptions. Records are acquired for delivery to this consumer with a time-limited acquisition lock. While a record is acquired, it is not available for another consumer. By default, the lock duration is 30s, but it can also be controlled using the group share.record.lock.duration.ms configuration parameter.
然後 consumer (worker) 有三個操作可以用,宣佈做完、放棄或是退貨:
- The consumer can acknowledge successful processing of the record
- The consumer can release the record, which makes the record available for another delivery attempt
- The consumer can reject the record, which indicates that the record is unprocessable and does not make the record available for another delivery attempt
- The consumer can do nothing, in which case the lock is automatically released when the lock duration has elapsed
這樣看起來演算法上還是 lock-based 的老方法,是可以看一看後續的發展,但不應該預期 scalability 會比起以前的方案有什麼重大突破?