Open-Sourcing Yahoo's Pulsar, Pub-Sub Messaging at Scale

yarapavan · on Sept 8, 2016

Github page: https://github.com/yahoo/pulsar

Pulsar backs major Yahoo applications like Mail, Finance, Sports, Gemini Ads, and Sherpa, Yahoo’s distributed key-value service.

On the scale front:

- Deployed globally, in 10+ data-centers, with full mesh replication capability

- Greater than 100 billion messages/day published

- More than 1.4 million topics

- Average publish latency across the service of less than 5 ms

perryh2 · on Sept 8, 2016

I worked at Yahoo but have never heard of "Pulsar" before. Was this known as "CMS" internally?

witchking · on Sept 8, 2016

bimbam1024 · on Sept 8, 2016

jaytaylor · on Sept 8, 2016

I wonder how this compares to Kafka and what tradeoffs were made.

mmerli · on Sept 8, 2016

Pulsar dev here. There are certainly many similarities with what Kafka provides, though we made some different choices, especially for the storage layer.

The design has been heavily influenced by these traits: multi-tenancy, millions of topics, strong-durability, geo-replication, low publish latency (even when reading at max IO capacity), maintaining consumer position and, finally, operability (we need to replace machines, shift traffic and increase capacity seamlessly, without any user impact).

davidpelaez · on Sept 8, 2016

Storage apart, would you compare it to NATS? Is there something as latency sensitive as Pulsar or would you say it seems to be a leader in this prevalence? We see things like NSQ and NATS and we are never sure if latency is low enough for responding in an HTTP API with 1s timeout.

nehanarkhede · on Sept 8, 2016

I'm one of the Kafka authors, so admittedly my view might be slightly biased

Here is a quick comparison of Kafka and Pulsar:

- Kafka is a complete streaming platform vs a messaging system which is what Pulsar is. Through Kafka Connect (http://www.confluent.io/blog/announcing-kafka-connect-buildi...), it has support for connectors to stream data between various sources and systems. Through Kafka Streams (http://www.confluent.io/blog/introducing-kafka-streams-strea...), it has support to do stream processing and transformations over Kafka topics.

- Broad adoption base: Kafka is very widely adopted across thousands of companies worldwide. https://cwiki.apache.org/confluence/display/KAFKA/Powered+By

- Tunable durability and consistency knobs on the producer: The Kafka producer API allows the application to either wait until a message is fully committed across all replicas or just the leader. This allows applications to make the right tradeoffs for throughput vs durability. One size does not fit all.

- Performance and efficiency: Kafka supports zero-copy consumption allowing the consumers to read large amounts of data at high throughput. To the extent that I understand, Pulsar with its legder-broker model does not support zero-copy consumption.

- A lot of the reasons quoted for creating Pulsar are features that exist in Kafka and are used in production:

-- Kafka has multi-tenancy support through user-defined quotas (See this http://www.confluent.io/blog/sharing-is-caring-multi-tenancy...)

-- Kafka has support for authentication, authorization, user-defined ACLs (See this http://www.confluent.io/blog/apache-kafka-security-authoriza...)

-- Kafka has support for geo replication. In fact, that is the most common use case for Kafka in several companies. (See this https://engineering.linkedin.com/kafka/running-kafka-scale)

-- Latency: The end-to-end latency from publish to consume can be very low in Kafka (<10ms).

- Support for millions of topics: To the extent that I understand, both Pulsar and Kafka use ZooKeeper for metadata management. That is the main bottleneck for supporting a large number of topics and likely the same tradeoffs apply to both Kafka and Pulsar as a result.

- Storage model: The length of a partition in BookKeeper and hence in Pulsar is not bounded by the capacity of a server. So you have the ability to add servers to accommodate a workload spike.

This is merely a quick overview. There might be more aspects of this comparison that I'm missing.

BoorishBears · on Sept 8, 2016

Why does this read less like a comparison and more like a "this shouldn't even exist they should have just went with our thing"-type pitch? (With proper consideration to the possibility of bias you mentioned)

itaifrenkel · on Sept 9, 2016

Could you (Kafka/pulsar) devs add more details about low latency guarantees? With an in memory solution it is easier to talk about 99.9 or even more messages below 10ms latency. Also this requires a client side push protocol (like redis subscribe). Assuming there are no slow consumers involved is there a configuration that guarantees 10-20ms latency? At what percentile?

babo · on Sept 9, 2016

How do you compare the geo-replication capabilities of Kafka vs. Pulsar?

mmerli · on Sept 13, 2016

Added some info on geo-replication in Pulsar at https://github.com/yahoo/pulsar/blob/master/docs/GeoReplicat...

johnlon · on Sept 12, 2016

I'd like to see a blow by blow comparison of Twitter DistributedLog vs Kafka vs Pulsar. Particularly focusing on what each solution means by "geo-replication" ,what geo consistently guarantees are made, how operable the geo rep features are in practice etc. At first glance those interested in not losing data and interested in geo-replication and constituency are better off going with DistLog rather than either of the other two solutions.

However if all you care about is a single DC then K is simpler as long as you are happy for consumers to track offsets.

But if you want broker side delivery tracking then perhaps go with Yahoo or kafka with lazy commits of offsets to zk.

withinrafael · on Sept 8, 2016

Interested to find/read any comparisons to TIBCO EMS, software rolling out for large scale use in government circles.

NikolaeVarius · on Sept 8, 2016

Before Obligatory "What is so different about this from Kafka?"

Edit- Got wrong product.

from the looks of it, it just seems to be a slightly different take on Kafka. From what I gather, looks like Pulsar allows for scaling of producers/brokers independently?

eropple · on Sept 8, 2016

That doesn't look to be Yahoo's Pulsar, but rather some analytics thing. Yahoo's documentation is on Github: https://github.com/yahoo/pulsar/blob/master/docs/GettingStar...

The documentation I've read suggests that it's very similar to Kafka (and I'll have you all know that I considered "Kafkaesque" there but I have some self-restraint).

edit: an acquaintance just suggested that maybe it's more related to Twitter's DistributedLog than Kafka: https://github.com/twitter/distributedlog

crudbug · on Sept 8, 2016

There is an another system from eBay named Pulsar [0] for Analytics. They should do some research before selecting a project name. Quick idea - Quesar ?

http://gopulsar.io/index.html

nine_k · on Sept 8, 2016

Quasar probably? Quesar is cheesy :)