Hacker News new | past | comments | ask | show | jobs | submit login

millions of topics, no zookeeper ect. Kafka is addressing these shortcomings on the roadmap.



The Pulsar documentation says it requires Zookeeper: https://pulsar.apache.org/docs/en/administration-zk-bk/


oh sorry i meant storing topic info in zookeper that limits kafka to a certain number of topics.


Nope, it also stores topic metadata in ZK - it's not exactly going to store that in the (near) stateless brokers, or in Bookkeeper - and BK also relies on ZK, but it's common to reuse the ZK quorum between the brokers and the bookies.

It also needs an additional ZK for cluster replication.


For a lot a projects this is hardly a problem. On the other hand Kafka is more mature and has a huge ecosystem (Kafka Connect, Kafka Streams, KSQL, ...).


Kafka also has less moving parts even today before zookeeper removal is complete (2 vs pulsar 3).


But one of those moving parts of Pulsar, BookKeeper, means that you're no longer storing data on message brokers. Worth the extra puzzle piece for a lot of use cases.


Pulsar is less mature but does provide functional equivalents to all of the above. Pulsar IO (Kafka Connect), Pulsar SQL (KSQL), Pulsar Functions (Kafka Streams).


Nah, Pulsar functions is nowhere near Kafka Streams - it's more like AWS lambda.

Example off the top of my head, is that you can't, in Pulsar Functions write the equivalent of "aggregate this stream across a 10 minute window and emit the results on window close".

But that's fine, it doesn't need to be like Kafka Streams, you can use Flink or Spark or Storm etc. to fill the same niche. In fact one of the founders of StreamNative (Pulsar's equivalent of Confluent) is a core committer on Flink.


Kafka is not more mature just more hyped. I just wish Aphyr jepsen test would also cover more scenario like - what happen to your data if x+1 server permanently fail in the cluster with a replication factor of X. - what happen if a single partition data size or request rate become 90% of the cluster capacity - what happen in multi-tenancy scenario to other user throughput and latency when one user try to use all the capacity of the cluster - ...


It's way more mature. I just spent a week evaluating Pulsar vs Kafka for a client and the fact Kafka has been open sourced for 10 years vs. Pulsar's 1.5 really shows in documentation, community support etc.

> what happen to your data if x+1 server permanently fail in the cluster with a replication factor of X.

It depends on how many in sync replica sets existed entirely within those X+1 servers. Their partitions will go offline, and other ISRs will have underreplicated partitions, and the alerting you've set up as a good engineer will have told you this was happening.

> what happen in multi-tenancy scenario to other user throughput and latency when one user try to use all the capacity of the cluster

Nothing because you're using ACL and have configured quotas appropriately.

Bad things otherwise.

PS, also been running Kafka since 0.8.


if replication factor is 3 and 3 server go down in the span of 1 or 2 hours no alert will save you


Yes, but this is true of any system offering N - 1 safety, e.g., HDFS, Vertica, Pulsar. It's not specific to Kafka.

And you can switch to your warm replicated cluster in this scenario, if you have one, Mirror Maker 2 supports replicated offsets so consumers can switch without losing state.

But what you're describing is going to shaft any replicated system.


not true for HDFS, Cassandra ,pulsar and most distributed file system.

As soon as a segment is under-replicated it”s replication factor is restored under less than 2 minutes by selecting new machine as replica.

Kafka try to do it with “kafka cruise control” but adding a replica to the in sync replica list take several hours if partition are 300GB and servers are already busy handling regular live traffic


> adding a replica to the in sync replica list take several hours if partition are 300GB

I'd be curious to hear more about this, because I run several topics with similar partition sizes, and haven't encountered several hours for one replica, and I've routinely shifted 350GB partition replicas as part of routine maintenance.

I have encountered 2 hours to restore a broker that was shut down improperly, but yeah, assuming your replica fetchers aren't throttled to shit, or your brokers aren't overloaded (what's the request handler avg idle? 20% or lower is time to add another broker, 10% is time to add another broker right now), that's really extreme.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: