How We Failed at OpenStack

jpgvm · on Jan 19, 2015

Doesn't surprise me the slightest to be honest. Having worked on a customized fork of OpenStack that used a pure L3 networking model I know that you are set for pain the moment you don't want to run everything on a single Ethernet segment.

It doens't help that the Neutron data-model at the time that I was working on in (say 12 months ago or so) was terrible and basically impossible to scale/make to perform.

Inevitably you were then stuck with the deprecated and janky nova-network interface. Which while efficient and fast was also old and missing tons of stuff - meaning more monkey patching and janking around. Not to mention the fact that because of it's deprecation many completely ridiculous bugs befell it in later releases. (Grizzly and onwards basically)

TBH I am so disillusioned with the project I hope I don't have to work in or around it again.

ewindisch · on Jan 19, 2015

> TBH I am so disillusioned with the project I hope I don't have to work in or around it again.

You're not the first I've heard this from, nor do I suspect the last.

The problem isn't that the code is bad as much as it is that the climate often makes it impossible to fix it. Review queues are weeks or months long. The article makes a good point about the necessary man hours to work on OpenStack. I've seen code removed not because it didn't have a maintainer, but because 200 lines of code didn't have 3-5 full time developers. Insanity persists and money talks.

Looking back, I'd say that OpenStack Nova in the beginning was never this bad. It may not have been the best thing ever, because it wasn't, but no code needs to be terribly great in the beginning. The beginning of a project needs good process more than it needs good code, and OpenStack didn't establish this well enough, early enough.

OpenStack never had a solid, centralized architectural vision. Anyone that attempted to contribute architecturally was essentially ejected. Those that flushed millions into controlling the process and millions more into building adhoc features got their way. I mistakenly advocated early for wrangling control from Rackspace. The increased influence gained by individual contributors was quickly dwarfed by large corporate influences.

I'm still involved with OpenStack, but far less than I had been in the past. Mostly, I prefer to see myself peripherally involved where I might improve the lives of those trapped in that ecosystem, either to help them deal with the pains they've inflicted upon themselves, or to escape them entirely.

nl · on Jan 20, 2015

The problem isn't that the code is bad as much as it is that the climate often makes it impossible to fix it.

Lots of the code isn't great either.

cdent · on Jan 19, 2015

The networking situation has improved in Juno and Kilo, but yeah: One often gets the impression with OpenStack that there's so much attention being paid to new stuff that none of the existing stuff (even recently new) is ever brought to a state of stability and usefulness.

hueving · on Jan 20, 2015

Considering that the neutron core is entirely L2 abstractions with an L3 plugin on top of it, it's not really surprising that you had issue with a pure L3 model.

bhaisaab · on Jan 19, 2015

Not sure if people know about Apache CloudStack or not, it has all those IaaS feature and it just works with various basic to advance networking models.

andyidsinga · on Jan 19, 2015

Apparently CloudStack "lost" to OpenStack ( see: http://www.infoworld.com/article/2608995/openstack/cloudstac... ). I've also heard this sentiment in my admittedly openstack biased circles.

That said, I'm not suggesting it should not be considered.

nl · on Jan 20, 2015

It lost out in getting vendor support from lots of different big vendors. OTOH, many hosting environments find it more useful. (It used to be a Citrix product which might explain the more unified architecture it has).

retrack · on Jan 19, 2015

There are case studies also: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Case+...

iwince · on Jan 19, 2015

It looks like they have a pretty good user list going:

http://cloudstack.apache.org/users.html

pm90 · on Jan 19, 2015

I agree with the Author's observation that there are a lot of vendor specific changes on top of openstack before making it production ready. This struck me as slightly alarming when I first started working with the project: my previous experience with FOSS had been linux, GCC and the like, which were good to go from the start. To his credit, it does seem like the author made a serious effort to understand how to get Neutron to do what he wanted...

I'm guessing that a lot of people make the same mistake of thinking openstack is just as easy as linux to get running. Its really not. But it does provide 95% of the groundwork to get you started; often that remaining 5% is either your secret sauce or security overheads. And unfortunately, the details of how to do that are not open to the public...yet.

Also, a the slow pace of adding changes to openstack makes many projects add their changes to their custom patches and once its working there really isn't much incentive to push it upstream.

marktangotango · on Jan 19, 2015

Sounds like these guys are doubling down on the IAAS model, 'premium bare metal'? Certainly there are a lot of people who'd like to run on bare metal, with a more configurable network, but how realistic is it at this time?

>>You see, physical switch operating systems leave a lot to be desired in terms of supporting modern automation and API interaction (Juniper’s forthcoming 14.2 JUNOS updates offer some refreshing REST API’s!).

This. Network hardware vendors have no incentive to make their devices more easily automated, and in fact face disincentive not to.

Doe anyone remember the excitement and promise around Google App engine when it was first announced, and before they changed the pricing model to per instance? The ability to put your app on the cloud, and scale up to the free tier, then out from the free tier on a paid plan if that's what you needed.

That model entirely disappeared. I miss it. Is anyone doing that now?

thomseddon · on Jan 19, 2015

>> >> You see, physical switch operating systems leave a lot to be desired in terms of supporting modern automation and API interaction (Juniper’s forthcoming 14.2 JUNOS updates offer some refreshing REST API’s!).

>> This. Network hardware vendors have no incentive to make their devices more easily automated, and in fact face disincentive not to.

There is actually a relatively established roadmap for the solution to this in "bare metal" / "white box" switches that essentially just talk OpenFlow to a controller. Google moved their entire international internal backbone (more traffic than public facing) to this model[1].

The issue at the moment is that there isn't lots of OS options and subsequently very little hardware support. Google developed their own hardware (despite preferring to have bought it[2]) and my understanding is they wrote their own software too.

[1] http://www.opennetsummit.org/archives/apr12/hoelzle-tue-open... [2] http://youtu.be/VLHJUfgxEO4?t=39m20s

donavanm · on Jan 19, 2015

+1, talking about rest endpoints hosted by junos is missing the forest for the trees. Protocols like Openflow, and whatever the contrail version was called, seems o be where network automation is headed. Centralized state & modelling and pushing specific paths/updates out to the edge.

With regards to "bare metal" virtualization I'd expect to see a lot more in the next 12-18 months. On the network you need dynamic path configuration and traffic encapsulation/isolation. That should be "openflow" and vxlan/nvgre. On the host hardware youll want io virtualization (sr/mr iov) and possibly hardware encap as well. Substantial progress is being made on both fronts.

Edit: although its great to have two encap options I think theyre incomplete at best. All of the hard work has been punted to the centralized controllers and the rfcs have nothing useful to contribute there. Some of the rfc behavior is also insane/laughable; multicast for broadcast and mac/tunnel endpoint discovery ORLY? Ill be very surprised if there are any large vxlan/nvgre deployments which arent bespoke.

thomseddon · on Jan 19, 2015

The opportunity extends beyond support for x86 devices in that more traditional hardware switches etc. should also work with openflow.

I've found there's lots of OSS around the controller and virtual switches for testing/lab but the only serious openflow agent designed for hardware switches I've found is Big Switch's Indigo[1] it has very limited hardware support.

I see experimental support in OpenWRT - this is very interesting as it opens up a shed load of hardware options.

[1] http://www.projectfloodlight.org/indigo/

superbaconman · on Jan 19, 2015

As someone who works with OpenFlow (a lot) I have my doubts if the tech will pan out. Look at what Facebook did to their network using traditional technologies. Look at Cisco's ACI, and Jupier's Contrail. About the only thing OpenFlow has going for it is that it runs on multiple vendor platforms (assuming you ignore all the switch-to-controller interop problems).

nl · on Jan 19, 2015

Doe anyone remember the excitement and promise around Google App engine when it was first announced, and before they changed the pricing model to per instance? The ability to put your app on the cloud, and scale up to the free tier, then out from the free tier on a paid plan if that's what you needed.

That model entirely disappeared. I miss it. Is anyone doing that now?

Just about every PAAS (including AppEngine) does this[1][2][3][4]. What am I missing?

[1] https://www.heroku.com/pricing

[2] https://cloud.google.com/appengine/docs/quotas

[3] https://www.openshift.com/products/pricing

[4] https://appharbor.com/pricing

marktangotango · on Jan 19, 2015

The difference in my estimation is: pricing based on per instance (what ever an instance is) and pricing per resource used, past the free tier. And auto scaling. I could have been clearer, apologies.

Edit; further, in the original GAE pricing model, the customer paid for specific services, usually by volume. Maybe accounting was prohibitive?

mietek · on Jan 19, 2015

You can do auto-scaling on OpenShift, as part of the platform, and on Heroku, with HireFire:

http://hirefire.io

nl · on Jan 20, 2015

GAE still uses the same model - the prices just are more than they were previously. See my previous link and [1] for details. It even explicitly gives you an instance/hour cost (above the free quota).

Plenty of PAASs do autoscaling.

[1] https://cloud.google.com/appengine/pricing

jambox888 · on Jan 19, 2015

Don't forget BlueMix!

andyidsinga · on Jan 19, 2015

It might not be a bad area at all to be solving problems in.

I was a little incredulous when I read that they started writing ip manager code, but then I remembered this article about amazon AWS scale:

see section titled "The Network Is A Bigger Pain Point Than Servers" in this article about AWS scale: http://www.enterprisetech.com/2014/11/14/rare-peek-massive-s...

fragmede · on Jan 19, 2015

> how realistic is it at this time?

Quite realistic, as there are many bare-metal offerings at this time, as a quick Google search will attest to. What exactly the packet.net people mean when they say 'premium', however, is unclear.

> That model entirely disappeared. I miss it. Is anyone doing that now?

Heroku still offers a free-tier to start.

parasubvert · on Jan 19, 2015

This seems to be more of a lesson on the failures of community building vs. secrecy in the face of a presumed pile of money and the resulting vendor politics, than "OpenStack sucks".

An OSS project isn't really supposed to be about "I can freeload on the works of others for some investment in comprehension and customization", which is how I felt the author framed his situation at times.

The underlying failure seems to be that the author decided it was easier to maintain his own proprietary platform than modify OpenStack for their needs and contributing back to the community. This would lead to others to pick up their stuff down the road and potentially reducing the maintenance burden (at the expense of exposing any secret sauce you feel you might have).

This deeper failure is In the incentives for Rackspace to withhold key commits on Ironic from the community because they feel it is secret sauce. (I am taking the OP's version of the tale at face value). They're one of the flagship supporters of OpenStack, and their behavior is perceptably a big reason for its failures to date.

The limitations of Neurtron without a product like VMware NSX underneath are well known. Production grade virtual networking at scale is hard and also mostly a secret sauce (for now).

OpenStack seems to effectively have become the OMG and CORBA 1.0 with a reference implementation - it's cloud vendor kabuki instead of distributed objects square dancing. You need vendor help to get going, and the portability is very limited, you'll get some value out of what's been done but at great effort. It seems to also be a useful commons for network and storage vendors to help drive interperability with the side modules (Cinder and Neutron). If anything, OpenStack is how the industry is desperately brute-force learning what Amazon Web Services has accomplished before they swallow the universe, which is valuable but messy.

OpenStack seems the only "I want to run a general purpose cloud" game in town today - CloudStack exists but doesn't seem to have a lot of momentum. Google, Azure and Digital Ocean are the only competitors to AWS of note and they don't open source their stuff. CoreOS on PXE or Ubuntu MaaS might work but needs a much more mature cluster schedulers, network and volume management. Or perhaps the real next generation will be "none of the above".

jroll · on Jan 19, 2015

> This deeper failure is In the incentives for Rackspace to withhold key commits on Ironic from the community because they feel it is secret sauce. (I am taking the OP's version of the tale at face value). They're one of the flagship supporters of OpenStack, and their behavior is perceptably a big reason for its failures to date.

I'm an Ironic core reviewer and work on OnMetal at Rackspace.

At Rackspace, we run ahead of Ironic trunk. It's true that we haven't been super vigilant about upstreaming our patches into Ironic; this is not because it's "secret sauce", not because we don't care. Priorities are hard, both upstream and downstream.

OpenStack moves slowly compared to a team developing proprietary software. This is a well-known fact. We do our best to upstream our patches as quickly as the project allows, but they often need to be improved to work with other hardware/drivers/etc.

For example, when we launched in July, we already had support for "cleaning" a server - erasing disks, flashing firmware, etc. The "spec" for the new feature was first posted upstream June 25, 2014.[0] This spec finally landed last January 16, 2015.

Our work on improving network support in Ironic has been similar; the project hasn't been ready for it (again, priorities). It's been done in the open[1], but the code is not in Ironic trunk yet.

We've been extremely open about what we're doing since we joined the Ironic project almost a year ago; I'm curious which patches the article has in mind.

As an Ironic developer, this article bums me out a bit, but it's a good pointer as to what we're doing poorly. /me starts writing better docs

[0] https://review.openstack.org/#/c/102685/ [1] https://etherpad.openstack.org/p/ironic-neutron-bonding

dlaube · on Jan 19, 2015

I just wanted to chime in here to say that although there were several situations where our questions couldn't be answered, we probably wouldn't have made it as far with our testing if it wasn't for the answers that were received from the openstack ironic developers. I should also point out that I've always found the openstack ironic devs to be kind and professional. So be it as it may, it is unfortunate that there are some conflicting priorities but I certainly do not blame the devs.

parasubvert · on Jan 19, 2015

This is excellent information, thanks for sharing it, and correcting my assumptions above.

hyperliner · on Jan 19, 2015

" 'I can freeload on the works of others for some investment in comprehension and customization', which is how I felt the author framed his situation at times. "

That was a huge, unnecessary leap from your part. I did not read it like that. To me, it was more like "we were going to leverage the existing projects and add to that and give back (as they said they will do), but we could not because the underlying projects are not mature so right now it is more work to fix than to start from scratch."

I don't know if what they did is advisable or not. All I am saying yours is an unnecessarily aggressive conclusion attacking someone who just spent a lot of time warning the community about a lot of the issues under discussion.

parasubvert · on Jan 19, 2015

I wasn't making any conclusions. What's aggressive or unnecessary about explaining how I felt when I read his article? You felt differently than I did reading it.

Also, freeloading is, by far, how most people and organizations use open source, so it's not exactly a unique situation.

xj9 · on Jan 19, 2015

SmartDataCenter[1] was recently open sourced and is built on a lot of cool tech (ZFS, Dtrace, node.js, Illumos).

[1]: https://github.com/joyent/sdc/

parasubvert · on Jan 19, 2015

Good point, I have to remember that. I wish the world was more open to Illumos.

tjradcliffe · on Jan 19, 2015

> it's cloud vendor kabuki instead of distributed objects square dancing

This is a beautiful and evocative metaphor.

_ondq · on Jan 19, 2015

My experience agrees with the general tone of the article (although I didn't dig as deep into the code as OP). I implemented an OpenStack private cloud for testing/QA purposes but never felt comfortable enough with it to migrate production (this was Icehouse, so pretty recent).

It was too easy to break core functionality--for example, I literally never saw resizing an instance work properly. It does this crazy hack where under the hood it SCPs the VM image to another host and then tries to bring it up. It could have been a quirk of our installation but it would break every time. I saw similar breakage with Cinder operations where volumes would get "stuck" on VMs. Again, it could be a bad installation but it goes to show you how easy it is to break OpenStack if you aren't an expert in the codebase.

My current thinking is that a container-centric (as opposed to VM-centric) infrastructure is the way to go--that way I can just throw CoreOS or whatever on the bare metal nodes and migrate containers as needed.

hueving · on Jan 20, 2015

Sure, container-centric is great if you are running containers. Until it is shown that containers are actually secure, people are going to push for virtualization.

Luyt · on Jan 19, 2015

The author writes:

"As we finalize our installation setup for CoreOS this next week (after plowing through Ubuntu, Debian and CentOS)"

Pity he doesn't elaborate on that. I understand that CoreOS is his choice, but it would be nice to know why the other distros aren't.

toast0 · on Jan 19, 2015

I think the author is saying they already finished setup fur the other three.

mratzloff · on Jan 19, 2015

Our guys tried CentOS for awhile with OpenStack, but gave up because so much of it is clearly maintained with Ubuntu in mind. So they switched to Ubuntu and things have been mostly smooth sailing since then.

AndyNemmity · on Jan 19, 2015

Same for us. So much of the documentation is really only for Ubuntu, and you quickly run into gotchas where you have to decide.

Do I want to work out and fix all the issues to get this distro working, or do I want it to work and move on to the other gotchas?

snarfy · on Jan 19, 2015

Can someone explain 'bare metal' to me? Is it a better hypervisor or something? Why would it be better than all development effort put into something like linux? Doesn't the linux kernel run on 'bare metal'?

A fellow developer tried to get me into openstack a little over three years ago, and when I looked, it was far too enterprise for my tastes, but I care more about code than the devops and managing servers.

andrewstuart2 · on Jan 19, 2015

Bare metal just means not virtualized. There's no hypervisor between the OS and the hardware (the hardware is made of metal, therefore "bare metal"). If you buy a computer with Windows on it, Windows is running on bare metal. Hypervisors run on bare metal.

So yeah, as you suggested, if you install Linux on your computer, the Linux Kernel is running on the bare metal.

chris_wot · on Jan 20, 2015

So really, this is still virtualisation - but with less overhead.

detaro · on Jan 19, 2015

If you want to give your users instances that have full hardware access, bare metal instances allow you to still manage that using OpenStack. This might be to allow access to GPUs or other specialized hardware.

Another example is deploying hypervisors: Test suites for OpenStack are run against many different versions by using OpenStack to deploy the systems to test. HP's OpenStack distribution uses it as a deployment mechanism, taking over and managing the nodes of the OpenStack cluster from a small initial cluster.

_ondq · on Jan 19, 2015

It means owning and controlling your own hardware. It's really, really nice to be able to reason about real hardware when doing performance optimization or deep bug analysis. If you're stuck in the public cloud ghetto there's only so deep you can go before throwing up your hands and saying, "eh, I guess Amazon is having a bad day..."

chris_wot · on Jan 19, 2015

I'm pretty certain that's the point - you get a dedicated server. They seem to guaranteeing one to be setup for you in four minutes!

Someone tell me if I'm wrong :-)

This would mean he could give two hoots about virtualisation, I guess his concern would be automated deployment and network allocation, along with monitoring.

andyidsinga · on Jan 19, 2015

I tend to define it more like andrewstuart2 above > "not virtualized" ...as far as I know it doesn't really mean dedicated though. (although maybe it should ?)

edit : why not dedicated ? ...because you can have containers running on "bare metal"

mburns · on Jan 19, 2015

Containers are a form of virtualization, as are zones or jails.

Bare metal implies dedicated hardware.

andyidsinga · on Jan 19, 2015

It would be cool if the author could elaborate on this conversation: "As the conversation developed, I eventually agreed that many of the public cloud services were not user friendly and had an overly high barrier to usage"

...as I read through the article is sounds like it was probably around bare-metal needs - still, elaboration would be nice here :)

AndyNemmity · on Jan 19, 2015

I am doing this same work for the company I work for. I think it's going well, but it's taken far longer than any estimates I have planned.

The Ironic guys are amazing, really great people to work with. The guys in IRC are good at working with us.

Just hope we can provide some value to the project as well to return the favor.

zsmith928 · on Jan 19, 2015

@AndyNemmity I totally agree - the Ironic guys (particularly jroll) are awesome and were very helpful.

Rapzid · on Jan 19, 2015

Shouldn't take more than an evening for somebody experienced with hosting to pick up these red flags reading through the OpenStack documenation/source.

From what I recall the documentation left a ton to be desired. Just trying to figure out how Neutron and their "VPC" equivalent was supposed to be implemented left more questions than answers :|

mdekkers · on Jan 20, 2015

Yeah, I looked at openstack for a few days - including a small test deployment, and came away with "not for a few more years"

kordless · on Jan 19, 2015

> Premium Bare Metal

Given that's the offering, it doesn't surprise me a bit they didn't go with OpenStack. That said, I guess they think running containers on bare metal is a better way to roll.

thinkingkong · on Jan 19, 2015

It would be, if the docker containers were actually root safe. Currently, youll probably have to rent an entire physical machine to run your containers on, which will be fast, but not necessarily great for packet as they grow.

Openstack really isnt appropriate for this type of scenario, unless their original goal was to use KVM machines to add some extra security / multi-tenancy.

ewindisch · on Jan 19, 2015

They were looking to use Ironic, which is specifically built for renting entire physical machines. I agree, however, that contributing and working upstream in OpenStack is challenging. I do not doubt that they would find it easier to build new infrastructure than contribute to OpenStack.

OpenStack no longer behaves like a nimble startup and may no longer be the right option for someone looking for a quick, iterative development process. I'd question if any startup should really be a consumer of OpenStack at this point.

Daviey · on Jan 19, 2015

Eric, I think that is bit of a leap. If we look back to the mission statement, it still fulfills that role IMO. Ironic is without doubt the immature stepchild, which to me really only makes sense if you wanted to do visualization - but also offer bare metal under the same API.

ewindisch · on Jan 19, 2015

To put it another way, I question if any startup should be using OpenStack today if OpenStack does not immediately solve the needs that startup expects to solve in the future. That's especially true for DIY. I'm speaking as consumers of OpenStack, of course, not for companies building value on it.

If OpenStack doesn't solve the startup's future needs right now, the startup's future need will come sooner than the features needed in OpenStack. Contributing upstream will have too great an opportunity cost. The only legitimate options for such companies are not to use OpenStack or maintain their own fork.

Right now, at the rate of innovation and improvement currently in OpenStack and the processes necessary for participating in the community, I'd argue that if a startup consuming OpenStack has resources to dedicate toward upstream development and baby-sitting that process, that they're either A) Not a startup, or B) a failing startup.

geoffarnold · on Jan 19, 2015

s/OpenStack/customLinuxKernel/

No startup should be rolling their own cloud, any more than they should be putting together their own Linux kernel. Go public, or if you MUST be on your own metal, use a turnkey solution like Metacloud or Nebula (and let them manage it for you).