Can Kafka be thought of as a non-hosted version of Kinesis? I thought Kinesis would be a good solution to dump logfile data to for processing and ingestion. Could you explain some technical reasons to use Kafka vs. Kinesis? Thanks.
I believe Kafka can still lose some data if all the active machines fail. It's a deliberate design decision (it's the right thing to do if you want to remain available and can tolerate some data data loss). I believe the Kafka team are working on it, but it's non-trivial to fix.
I believe the ability to configure this behavior is being tracked at the link below. It seems like it's a switch between consistency and availability. By default Kafka prefers availability, and the possible inconsistency results in data loss (because Kafka just discards some inconsistent data it can't resolve). But the JIRA linked below should make that behavior optional, so if a majority of machines fail the cluster will become unavailable rather than inconsistent.
Yes, Kinesis can be thought as a non-hosted version of Kafka. To me , using either of them is a cost versus benefits trade-off i.e. if you are willing to pay the cost of using Kinesis to get a hosted solution where-in the operational burden is greatly reduced or vice-versa.
One main advantage is that Kinesis is elastic -- it scales automatically based on load. Managing a Kafka cluster is an unnecessary task with Kinesis available, which alleviates quite a bit of headache.
Ehh - this is just my two cents with working on Kafka. When they say it's high performance, they really really mean. I have gotten very high throughputs on just 2 medium machines.
If you process that much data, Kafka is one of the last things which you'll need to scale out.