Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
KafkaHQ (github.com/tchiotludo)
219 points by rickette on Dec 30, 2019 | hide | past | favorite | 33 comments


If you're considering Kafka, also check Pulsar (https://pulsar.apache.org/). Also the dashboard: https://github.com/apache/pulsar-manager


Why should I consider Pulsar instead of Kafka?


It has all the bits missing from Kafka you want and they are trying to bolt on to Kafka.

Tiered storage Geo replication Multi-tenancy Massive partition counts Benchmarks have lower latency

Really the question should be why Kafka over Pulsar and the only real answer is it’s been around longer and Confluent are really aggressive with their marketing, as an example you might often hear “why Kafka is more ACID than your database” from Confluent, or “Kafka does exactly once” vs effectively once with idempotent operations.


Separates storage from brokers for better scaling and performance. Millions of topics without a problem and built-in tenant/namespace/topic hierarchy. Kubernetes-native. Per-message acknowledgement instead of just an offset. Ephemeral pub/sub or persistent data. Built-in functions/lambda platform. Long-term/tiered storage into S3/object storage. Geo-replication across clusters.


I'd looked into Kafka before and really liked the partitioned topics and specifying partition keys. This makes sure only one consumer handles all messages for a specific key. Is there a way to do this in Pulsar? I see they have partitioned topics but it seems to only for performance reasons, but I don't see how you'd set up a partition key or make sure a specific key would always go to the same consumer.

EDIT: Also wanted to note that, for Python, there is some support for async clients in Kafka (aiokafka, asynckafka) but I don't see any async libs for Pulsar yet. Maybe I'm missing something.


Yes, you can tag messages with a 'key' for compaction (maintaining the latest message for a given key) and there's now a beta feature for key-ordered consumers on the same topic: https://pulsar.apache.org/docs/en/concepts-messaging/#key_sh...

You can also just use separate topics since they're very lightweight and can be nested under the tenant/namespace hierarchy.


Wow, yeah that does sound way better than Kafka, and very useful for what I'm doing. Thanks.


How does it compare to Liftbridge, besides being a lot heavier weight?

https://github.com/liftbridge-io/liftbridge

https://liftbridge.io/


In Pulsar, storage can be added at runtime. Storage is decoupled from ingest. That’s what makes Pulsar unique.


To be fair, there's a bunch of prior art here: hasn't FB's Scribe has been doing this since 2008, for example?


That is a really neat operational benefit. Thanks!


If this is the plugging subthread: gazette is another project in this space which plays particularly well with batch workflows (https://gazette.dev). It’s streaming consumer framework is Go only today, however.


It is interesting that you made the decision to commit to all replicas of a journal before acknowledging a write and rely on etcd for explicit failure monitoring and re-configuration. FoundationDB also does this, relying on an internal metadata store called the "coordinators", which store some small configuration state.

Have you ever found the 20s failure detection window to be too long? FoundationDB's failure monitor pings hosts approximately once a second, so failures are detected very quickly. If you were willing to dedicate a process in the cluster to failure monitoring, you might be able to shrink that window.


What you describe is how Etcd leases work, just inverted: clients instead check in with the Etcd leader at prescribed intervals or face lease revocation. The leader is doing some add'l clever things like scaling check-in intervals to not DoS itself.

20s is configurable per-client, and gazette lifts this default from etcd. This one number is trying to summarize the tension and trade-off inherent in the (famously hard) problem of distributed failure detection. Network flakes or partitions of a few seconds are all too common and you don't necessarily want to assume failure if one occurs (particularly given the compounding effects of resend timers, etc).

In practice, no, it's not been a problem since clients typically buffer writes anyway, in order to batch many small writes into few larger ones.


I didn’t mean to imply they were different approaches (they aren’t really), only that you could offload the job from etcd to somewhere else to rely on it less, which would hopefully help scalability or allow you to lower the ping time.

And yes, FDB separates the act of detecting a failure from reacting to a detected failure for that reason. In FDB’s case, a 20s network hiccup on a specific process could mean it is better to initiate a recovery anyway. I think for your use case that trade off makes sense as there are lots of journals, whereas FDB only runs a single database for clients to access.


Interesting. How does the system recover from corrupt "journals"? Do you have to rely on e.g. the blob stores' data recovery function to do a restore... is that even possible? (since the journal sounds like its immutable)


By "corruption" do you mean "whoopsie, I wrote bad data"?

Same way as always - there's some kind of delimiter in the message serialization, and you can "re-sync" by finding that delimiter and hoping to continue reading valid messages. For example, a bad CSV stream can be recovered by finding the next newline and hoping you read correct CSV again.

The problem is just moved to the client library which is layering a notion of messages atop journals, and is not a direct broker concern.


Yep, thanks. Reading more about it, the "immutability" of the append log seems like a convention rather than something that is enforced programmatically (i.e. you can always write over the persisted fragments since they're objects in blob storage), which is what I was concerned about.


Just looked at the github page, and i’m surprised by the small number of contributors (for a project that’s been in production in a big company for years).

I find it even more worrying knowing that this project has a lot of very advanced features that look like they’d require a lot of people to maintain...


Don’t judge the book by the cover. Pulsar is built on top of another Apache project: BookKeeper. You can use BookKeeper directly.


Ha, indeed that's an interesting thing to known and it explains the situation way better. I think they should emphasize that in the pulsar documentation (i read "BookKeeper" in their doc but didn't figure out that it was another independant apache project).


We're using AWS's managed kafka (MSK) which doesn't have a dashboard, so we're using KafkaHQ. I wouldn't say it's perfect, but does the job pretty well. In retrospect, I think we should have gone with Confluent cloud, which has all this functionality + a schema registry + KSQL + Kafka connect built in. It's also pay-as-you-go, instead of >500 euro per month.


https://operatr.io also expanding into this space


We are currently using Kafka Manager (https://github.com/yahoo/kafka-manager). It seems Kafka HQ have the same features (it is not clear from the doc if you can actually manage the partition arrangement or only view it) plus the possibility to view the actual messages that are streaming in Kafka that might be very useful.


View message inside Kafka was my main goal when building KafkaHQ ! Unfortunately, you can't, for now, reassign the partitions, but maybe I will can add the feature when this one will be ready : https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A... (targeted for Kafka 2.4, but I don't think this was release)


You can only set the number of partitions when you create a topic in KafkaHQ, not after the fact. You can view messages, even if they are AVRO encoded and backed by a schema registry.


SMM is another option and has some really cool features for alerting as well. Runs in any cloud or on premise.

https://blog.cloudera.com/smm-1-2-released-with-powerful-new...


How does this compare to Kafdrop which hit the HN front page a couple weeks ago? Looks like Kafdrop may be read-only?

https://news.ycombinator.com/item?id=21788863


I use Kafka with SSL/SASL on K8S and last time I checked it's not easy or fast to put any of this dashboards to work with it.

I think I will try again, maybe it's easier now.


I confirm, this works ! Behind KafkaHQ is the official Kafka java client, so you can't connect to every cluster with any kind of security.


Security is almost never easy, but KafkaHQ works fine with TLS and SASL in my experience.


I used Kafka at my last company. One of the places that it failed us was trying to use it as a queue. Don’t do that.


How did it fail you as a queue?

Did you use it successfully as something else?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: