> we set out to build and deliver Google’s infrastructure to everyone else
This statement rings pretty true as Kubernetes (also known as k8s) has some Google biases. Not all cloud providers will have such an easy time providing all the infrastructure necessary to run CoreOS + k8s smoothly.
For example, Kubernetes assigns each Pod (k8s unit of computation) an IP address, which is only simple to do if your cloud provider supplies something like a /24 private block to your nodes. CoreOS came up with the VXLAN-based Flannel project to make this model more portable[0], but Layer 2 over Layer 3 isn't something I'd like to throw haphazardly into my production environments. Google Compute Engine conveniently provides this setup as an option.
Another example of Google-favoritism is the strong preference of centralized storage--particularly GCEPersistentDisk. At first I was concerned about centralized storage by default as we know disk locality is a Good Thing (TM), but after reading a paper that claimed networking is improving faster than disks are[2], I felt somewhat better about this. However, it's still pretty obvious that a Google Persistent Disk is the way to go with k8s[3].
That said, I'm really happy that Google has open-sourced this project because it is indeed a functioning, tested, and easy-to-use distributed system. I'm sure that the devs aren't aggressively shutting out other cloud providers and that these biases are probably just a side-effect of their resource allocation process and the problems that they intend to solve (e.g. GCEPersistentDisk used to be a core type instead of a module--it has since gotten better). It's still important to evaluate a technology's biases and potential evolution before throwing your product on it.
I'm working on adding AWS support for Kubernetes. I just last week finished Load-Balancer (ELB) & Persistent Storage (EBS) support, and they're currently going through the pull-request review process. Once they merge (I'd guess a week or two?), AWS will be on-par with Google Cloud Engine feature-wise.
I have found the Kubernetes team to be nothing other than extremely supportive of efforts to support AWS & non-Google platforms. It takes a little longer to translate some of the Google-isms to other platforms, but I'm happy for the thinking behind those decisions, vs just adopting lowest-common denominator.
I'm mostly concerned about feature lag and vendor lock-in, so I'm happy to hear that this will be out so soon. I'm excited to try it out.
> I have found the Kubernetes team to be nothing other than extremely supportive of efforts to support AWS & non-Google platforms.
I don't doubt it one bit; in my experience, people on the Kubernetes IRC channel have been always really helpful and supportive. I just tend to be a little more pessimistic when it comes to resource allocation: a Google team probably prioritizes support for Google platforms, and that's no one's fault or foul play.
About disk locality -- I've read that paper and know that Google increasingly has the philosophy of disk locality being irrelevant.
However, I don't buy it for 2 reasons:
1. Highly available distributed services need to have geographical diversity, i.e. they should be "multihomed". This is true on AWS or in Google's internal data centers. That means you have WAN latency, in which case locality becomes again the primary design concern for performance.
Pre-Spanner, Google's solution was to use application-specific logic to be multihomed -- i.e. nearly rewrite your application, depending on how stateful it is. Spanner isn't a silver bullet either. You still have to solve latency problems, just within the ontology of Spanner rather than the application.
It's bad for your code to ignore latency within the data center, and then later add (incorrect) hacks to work around latency between data centers. If you pay attention to network boundaries from the beginning, it will be easier to multi-home.
2. A single machine is still your domain of failure. Even if it doesn't matter for performance, you still have think about machines to handle failures.
The interfaces between machines should be idempotent to handle failures gracefully. And many distributed storage services have complicated performance vs. durability knobs for how many machines/processes have accepted writes.
So I think Google does have a "single system image" bias, and you are right that Kubernetes has these Google-isms in its architecture.
I have serious trouble with [2]. Disks not evolving as fast as the network? Under what rock have they been living?
The paper seems to peg local disk bandwidth at 150Mbps, and then compare it to remote network disk access at... 150Mbps. NVMe is going to grant us 2.2GBps bandwidth and 450K IOPS (from a single consumer-grade product), so that paper is off by more than an order of magnitude. Local disk is non-volatile storage sitting a PCIe lane away from your CPU. I just don't see how disk locality is not going to be crucial for many workloads, for decades to come.
In 2020 a flash-only SAN isn't going to deliver 20Gbit/sec to each of 100 blades in the rack. A 4TB NVMe card on each blade will though...
Look at Intel's latest Xeon-D SOC, yeah, it's got Dual 10GBE, but you're not going to get 7.7GB/s over that... [1]
You should look at their measurements and assumptions in context. As you can see from the URL, it was written in 2011 when the NVMe working group was first formed. It was also written in the context of cluster-based applications in a data center and specifically mentions SSD and cost effectiveness. Storage cost effectiveness is critical at these scales because your data is growing by terabytes per day.
You also mention blades, which goes into the next point of context which is that operations like Google and Facebook don't utilize blades like you might expect working at your average enterprise because they aren't leasing rack space or working with a limited amount of physical space. They don't need the same U to performance ratio so they can save money by using commodity hardware. Their applications also scale readily, so the loss of entire boxes is meaningless within a certain threshold.
Each pod has it's own IP address that is routeable anywhere in the cluster. This makes life much easier because you don't have to do port-forwarding onto the host node.
In all current k8s set-ups, each Minion/Worker node has a subnet that it allocates these Pod IP addresses out of. This isn't a hard requirement necessarily, but it tends to be much easier to make this work, since you only have O(Workers) routes to configure instead of O(Pods), but long term, I think we would rather do away with subnets per node, and simply allocate IP addresses for each Pod individually.
This statement rings pretty true as Kubernetes (also known as k8s) has some Google biases. Not all cloud providers will have such an easy time providing all the infrastructure necessary to run CoreOS + k8s smoothly.
For example, Kubernetes assigns each Pod (k8s unit of computation) an IP address, which is only simple to do if your cloud provider supplies something like a /24 private block to your nodes. CoreOS came up with the VXLAN-based Flannel project to make this model more portable[0], but Layer 2 over Layer 3 isn't something I'd like to throw haphazardly into my production environments. Google Compute Engine conveniently provides this setup as an option.
Another example of Google-favoritism is the strong preference of centralized storage--particularly GCEPersistentDisk. At first I was concerned about centralized storage by default as we know disk locality is a Good Thing (TM), but after reading a paper that claimed networking is improving faster than disks are[2], I felt somewhat better about this. However, it's still pretty obvious that a Google Persistent Disk is the way to go with k8s[3].
That said, I'm really happy that Google has open-sourced this project because it is indeed a functioning, tested, and easy-to-use distributed system. I'm sure that the devs aren't aggressively shutting out other cloud providers and that these biases are probably just a side-effect of their resource allocation process and the problems that they intend to solve (e.g. GCEPersistentDisk used to be a core type instead of a module--it has since gotten better). It's still important to evaluate a technology's biases and potential evolution before throwing your product on it.
[0] https://github.com/coreos/flannel
[1] https://github.com/GoogleCloudPlatform/kubernetes/blob/maste...
[2] http://www.eecs.berkeley.edu/~ganesha/disk-irrelevant_hotos2...
[3] Do you see any other providers here? https://github.com/GoogleCloudPlatform/kubernetes/tree/maste...