Hacker News new | past | comments | ask | show | jobs | submit login
When EC2 Hardware Changes Underneath You (picloud.com)
153 points by usaar333 on Jan 9, 2013 | hide | past | favorite | 47 comments



I see no evidence here that EC2 hardware is "changing underneath" anyone; rather, some EC2 instances are running on different hardware from other EC2 instances, and PiCloud is moving the customer between different types of hardware.


(I'm the post author.)

We did not intend for readers to think that hardware configurations were changing while the instance (virtual machine) was running. As EC2 does not support live migration, this is not possible.

When you use EC2 (or any cloud infrastructure) a lot, you stop thinking within the constraints of physical servers. The very nature of EC2 is the first letter of its acronym: Elasticity. You are able to bring up functionally equivalent duplicate servers when the demand arises and terminate them when the demand no longer exists. This exists not only for HPC applications (our domain), but even horizontally scaling web servers based on demand.

EC2 allows you to request different "instance types". A given class offers the same cpu speed[1], memory, and disk usage. To make scaling tractable, you need to make the assumption that if your program operates correctly on a single (or even a few), say, m2.4xlarge instance types, it will do so on all of them.

Customers do not use us solely because we provide an excellent distributed computing interface. They also use PiCloud, because we (thanks to the capabilities of Amazon EC2) are highly elastic. This customer doesn't want to continually run several dozen m2.4xlarge equivalent instances, because their utilization doesn't justify doing so. Instead, PiCloud allows them to run massive computation for a short time and not incur charges when the servers are no longer needed. (Resulting in far lower average costs than having their own on-site cluster.)

Think about how this customer would have used EC2 directly within these constraints. Their workload would be something like deploy a few dozen servers, run computation on said servers, and terminate servers when complete. After getting this application to work correctly, they may conclude they are "done". It might work for months, and then one day, he would find that some "jobs" raised Illegal Instructions. And there would have been no way to anticipate this happening. And from their perspective, where the "requesting dozens of servers" is part of the application, the hardware had changed underneath them.

[1] Yes, performance will differ a bit across different instances. However, you expect that functionality is identical across all instances of the same type (e.g. m2.4xlarge).


Well yeah. Except you can't ask for a given type of hardware, you can only ask for a certain instance type, right?

Basically the picloud model is you do a "cloud" call, and that runs on some instance (usually shared, picture a single instance doing some number of picloud calls in parallel) at some point in the future.

The sticking point for them was that the same instance type was backed not only by different CPUs, but by some subset of instances whose CPU capability advertising was broken.

It does actually break "instance type equivalence" for people running real code. I think Amazon will fix it though.


I thought the "changing underneath" in the title was referring to "hot" VM migration too (i.e. no downtime, but different physical hardware, like http://wiki.xen.org/wiki/Migration) but I think it's a bit picky to object to the way it's used in this blog post, even if it's slightly confusing... What other short phrase would you use?


Fair enough, it is a bit misleading. I would say " 'instance type' equivalence is broken" rather than "hardware changes underneath".

So picloud stripes (simplifying; I imagine it's a pull model with some load balancing smarts) python "cloud" calls across a variety of instances they have warmed up, but there is an assumption that, say, arbitrary Cython code will work the same way between one m1 instance and another.

That turns out not to be the case. Some (say m1 for argument's sake) instances on Amazon are broken because they advertise the wrong CPU capabilities. So not all e.g. m1s are alike...


An 'instance' is not a deployment of a virtual machine on a box. An 'instance' can sit on top of multiple machines. For example, if you stop your instance and restart it, you might note that your instance is now on another box with a different CPU on it. This is the problem I think for PiCloud. One moment they shut down their instance and the next, whee, no AVX


That is a very different thing than 'changing under you', which implies that a running instance is being shuttled around to different underlying hardware, which though I think is possible, isn't what is being described in TFA.

We don't even know that the case you are talking about, of stopping, then starting an instance leads to it living on different hardware types, though I would suspect that is true.

TFA is saying that when you boot NEW instances you sometimes get subtly different hardware, which when deploying binary code can lead to issues.

That does seem like a valid EC2 bug, but also not one that most of us will ever run into unless you are doing cool massive dynamic scaling like they are.


I don't know if AWS does it, but I remember going to a VMWare-sponsored industry day several years ago, and they described that for enterprise systems, the hardware was pooled and an abstraction layer sat between them and the guests. Guests could be migrated between machines while they were running and active. It allowed a system where in quiet times you could move your guest load to only part of your hardware group and put the rest into low-power mode (to save on power, it seems).

The best part of that day was a guy from the government who gave a great talk on his experiences virtualising from running separate physical servers. While he was very much in favour of it, he mentioned a few drawbacks that the vendor talks obviously played down, but probably the best thing he said was "when doing something major with infrastructure like this, get your boss at the top on board early, because everyone above you will try to scuttle it to CYA if it goes wrong."


VMWare calls it vMotion [1]. You define the host cluster and it handles the rest. Xen supports live migration [2] but EC2 does not.

[1] http://www.vmware.com/products/datacenter-virtualization/vsp...

[2] http://sysadmin.wikia.com/wiki/Live_migration_xen


Fwiw, vMotion checks the cpu supported flags compatibility before migrating. Pretty sure cpu must be all of the same family to be clustered together for this purpose.


VMware EVC simplified the CPU requirements that you mention. It can be very restrictive still, or more lax depending on your requirements.

http://kb.vmware.com/selfservice/microsites/search.do?langua...

The original article just struck me as yet another company that doesn't know how to properly run applications in virtualized environments. Yes, if your hosting provider supports hot or cold migrations, you should be aware and develop accordingly. I do see this quite often, that still today, people are surprised that their VM is not always running on the same physical box at all times.


I think you're being overly harsh. Picloud is a super-awesome abstraction layer for running Python over AWS.

The problem they have is that they are trying to support arbitrary Python (which includes underlying libraries like LAPACK which depend on runtime CPU detection working correctly) and expecting code will work the same across identical instance types.

I think normal people would consider this a fair assumption, and in the event that AWS's advertised CPU capabilities were not broken for some instances, it would hold.

I don't think it's fair to say that they're ignorant of how virtualisation works. Disclaimer: I am a huge picloud fan and they have saved me a lot of time.


Did you even read the article?

This has nothing to do with not running on the same physical box and more to do with some physical boxes are advertising incorrect instruction sets to the OS.


While EC2's hyppervisor, Xen, does support live migration, EC2 has no support for it.

With that said, you can "migrate" EBS-backed instances by start/stopping them which (may) place them on different hardware Indeed, if you are large enough, you'll occasionally get emails from Amazon directing you to do just that due to scheduled maintenance on the physical machines.


It's not so much "no avx". It's "we said we had avx, but we don't".


Point being, that's only if you stop the instance. Since VMs can be migrated live on some vm ecosystems, "changing underneath" is very misleading in this case since it strongly implies a live migration.

If you stop your ec2 instance then you are effectively creating a whole new vm instance when you start it. That's not exactly a secret or something unexpected. It's why ephemeral storage is wiped clean and why you can only retain data via ebs (which is not local storage).

If you restart an instance (rather than start/stop) then you stay on the same hardware.


Describe a problem many people might have. Write well. Slide your solution in at the end.

I like it (no sarcasm). Good technique for writing a company blog.


Also, the post advertises their technical know-how. Being able to debug this kind of problem is not trivial.


Not only that, it introduced me to a company that somehow had completely slipped under my radar and that I really really wish I'd known about few month ago.

I will almost definitely be using them in the near future.


I'm a very happy PiCloud customer. They have "Environments", which let you log into an instance and install whatever you want. You can then run your code on that customized instance in the future. So your Python code can access any weird compiled packages you want. You can also "publish" Python routines, which can then be called through a RESTful interface. I use this to completely decouple my website code from my computationally intensive code. (The Django website just calls the published functions.) I've had good support on the few issues I've had.


Yep. Cloud stuff is still immature.

While guarantees that would remove all such issues ("leave me on system 'x' with no shifting sands forever!") are costly and therefore undesirable within a cloud context, this describes but one example of an edge case that could be elegantly handled by informing the node that it must go down and come up again a while later, after which its swapped-out guts with the potential to cause issue might be more easily resolved.

I would generalize this in to a broader statement: "Environment-related guarantees do need to be further specified on commercial cloud providers, and a better interface given to clients when changes are scheduled".

Other areas of cloud APIs (particularly cross-cloud) that are presently missing: legal jurisdiction, site/availability zone enumeration, available hardware configuration enumeration (including network bandwidth policies), resource guarantees (including network bandwidth).

I recently posted these as bugs to CIMI @ deltacloud's teambox - https://teambox.com/#!/projects/deltacloud/task_lists


I don't think the hardware on the individual instances are changing, it's that Amazon is putting new hardware online in some of the older instance classes and you can't predict which one you'll get when you spin up an instance.


I wouldn't call this immaturity on the part of Cloud especially considering how unique a case this is. Even at its maturity cloud isn't going to get to the 100% perfect platform for all individualized use cases. These kinds of checks/validations have to be handled by the client who relies on them, and it seems like the cpuinfo check is doing just that perfectly fine.


Would you run a service on EC2, that you are planning to run over, say, next decade? And want it to run in a setup&forget mode?


Would you use a hammer to drive in a screw?

Edit: The promise of cloud hosting is to be highly dynamic and let you scale up or down at a moment's notice. In order to achieve that, there are tradeoffs. It's silly to live with the cons if you don't need the pros. Each tool for it's own job.


Have you tried maintaining some infrastructure over a decade? I did. And I sincerely don't know, which way is better.

There are tradeoffs. Physical hardware doesn't change underneath you, but can fail, require expensive maintenance itself and reliable infrastructure around it (colo). Hosting providers tend to phase out services or even go out of business completely. PaaS limits you, providers change the APIs all the time.

Having been a EC2 user since its announcement and first open beta, I'm actually more and more inclined to think that it IS mature enough to be considered. I'm pretty sure that if I'd try, I can find an AMI from 2007 and it would run perfectly well.

On the other hand from costs perspective, reserved instances are not that expensive, and, unlike regular hosting providers, costs pretty much guarantied to go down, in the long term.


I would actually argue that much of high performance computing (which benefits hugely from advanced processor instructions like AVX and SSE) is just as fit if not more fit for cloud computing than a web backend. Much of HPC is for simulation related work which tends to be ad-hoc and bursty and generally speaking, is more horizontally distributable than web services (i.e. double the # of machines and simulation will run twice as fast). Thus a pool of shared resources (read: the cloud) is far more efficient (thus cheaper) than individual companies maintaining silo'ed compute resources.


Doing that is probably fine for 80% of applications.

Even internally, we have a few crusty applications that have been running in virtual environments since sometime around 2006 without any modifications other than security patching and VM migration as the underlying hardware is refreshed.

That said, if you have a real business requirement for a system that's expected to sit and spin for a decade without care and feeding, that's different. You probably want to use physical hardware that is going to be static OR use a PaaS where the provider in obliged to support your app in it's as-is state for the time that it needs to exist.

In the olden days, companies that provided things like accounting and billing solutions to SMBs and vertical markets (doctors, auto dealers, etc) would put it on an AS/400 or HP 9000 that was reliable and would phone home to call a tech in to replace failed components.


I agree. I think, if I'd create an AMI image now, I'll be able to boot it in 2018. But how about 2023?


Actually I am building one such service now. The key factor has been realizing how bad the cross-provider toolkits are at really exposing their feature sets, in the places that they don't like to talk about: some of those listed above. Essentially, right now, you have to build it yourself. There are toolkits, but none of them are effective. I emailed an academic working in this area circa new year to compare notes, but am yet to receive a response.

ie: The assumption is that any given provider may fail, any given provider site may fail, any given provider host may fail. You have to have a programmatic, automated response for that.


>this describes but one example of an edge case that could be elegantly handled by informing the node that it must go down and come up again a while later, after which its swapped-out guts with the potential to cause issue might be more easily resolved.

Huh? They manually and explicitly stopped their own instance and then started it up again. Given that you don't pay for stopped instances it's sort of obvious that they don't really exist but are recreated upon being started again. What does amazon have to do with this?

To me, it's not an edge case so much as missing the meaning of "stopping" and "starting" an instance on ec2.


Amazon has to do with it because their hypervisor is set to disable a feature but still report it as available. THAT's the bug here.


I wrote about hardware change issue and a couple of related issues a while ago: http://www.rotanovs.com/cloud/amazon-ec2-failures/

Also, to clarify, hardware change occurs when you stop an instance (which frees it, so it can be taken by another customer), and then start a new one using the same EBS volumes.


Can distributed systems be ever be fully transparent? It seems they are susceptible to subtle bugs that make it hard for them to be so.

That said, with my undergraduate CS studies drawing to a close, I doubt I can do such thorough debugging. Are there any useful guides/resources one can use to understand and debug the various hardware architectures?


Hi!

> I doubt I can do such thorough debugging. Are there any useful guides/resources one can use to understand and debug the various hardware architectures?

I'm not aware of any good asm or CPU arch books which cover avx more accessibly than the Intel manuals but I imagine someone will correct me if a good reference is available. I'd never heard of avx until today since those instructions are new since the last time I had to look at x86 SIMD.

I would actually just recommend a general text like "Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems" as a first shot. I can't really answer your question in the way you want :( There is no "magic book" which will explain everything you need to know.

I'm an "intermediate level" debugger. I've spent perhaps a few hours a week doing low-level debugging for the last couple of years on x86/x64 Windows and Linux plus maybe 2 embedded archs. I can give some general advice as to what might be a good use of your time if you want to get good at low-level debugging. Obviously, anyone doing stuff like this as their career will need to go a little bit further.

Mainly I am writing this because I think I can answer your question better than saying "Read several thousand pages here: http://www.intel.com/content/www/us/en/processors/architectu.... It's not really a good use of time for most people. It is not an entirely wasted investment in your early 20's, but 5 years down the track you won't remember much of it unless you live in the debugger or work on a code generator.

The way everyone I know does it is to have a good handle on the general case and research (google :) everything else as needed. You can learn the basics in about 40 hours. I know many people who are talented at reversing and debugging, and I can tell you that they do not have (or need) detailed mental models of the SIMD extensions or (speaking for myself) even the FPU. It's cool to know, but with the exception of domain-specific work (codecs, fast math) it is not necessary to keep that information "swapped in". When you are working low-level you will find there is an enormous amount of detail to the world. That's the whole problem, and I can explain this with an analogy: imagine trying to diagnose a cracked engine block with an electron microscope.

The general idea is to step back, know the basics and be prepared to apply detailed analysis as neccesary.

Priorities:

  0. Determination, patience and "can do". Common sense and knowlege of general debugging strategies.
  1. How to operate the debugger
  2. Top 50 mnemonics and references on hand for the rest
  3. Calling conventions and ABI (http://en.wikipedia.org/wiki/X86_calling_conventions)
  4. Basic OS internals (location of key data structures, heap layout, syscall conventions)

  ... (everything else)
0. The number one predictor of success is to have a kind of "I can do this" attitude even though you might not necessarily know what you are doing. The confidence that you can figure it out and the willingness to spend the time to do it. You won't always be right, but you won't get far without it. You also need the general principles of divide and conquer, basic logic and how not to fool yourself.

1. Knowing your way around the debugger really well is more useful than knowing reams about, say, the specifics of CPU architecture. So, how to set up symbols and sources, inspect the state of your process/threads. Most crashes can be resolved with a backtrace and looking at a few locals (assuming you have source).

2. If you need to read asm, you only need to know the top 50 or 100 mnemonics (if that). If you look at instruction frequencies you'll find the top 50 mnemonics make up more than 95% of all code by frequency, so you can work quickly knowing just these and look up the remainder as required. I had a reference for that (which I can't find) but a quick-and-dirty analysis (I did this on x64 Ubuntu 10.04) goes:

  $ find /usr/bin -type f | xargs file | grep ELF | cut -d: -f1 | xargs -n1 objdump --no-show-raw-insn -d > /var/tmp/allops.lst   # dump all instructions from binaries in /usr/bin/ to /var/tmp
  $ egrep -i '  [0-9a-f]+:' /var/tmp/allops.lst | awk '{ print $2 }' | sort | uniq -c | sort -rn > /tmp/opfreq.list               # get sorted list of mnemonic frequency (highest at the top)
  $ head -100 /tmp/opfreq.list | awk '{sum += $0} END {print sum}'                                                                # accumulate frequency of top 100 mnemonics
  30229337     
  $  awk '{sum += $0} END {print sum}' /tmp/opfreq.list                                                                           # accumulate frequency of all mnemonics
  30356097
    
Top 50 is 97%. Top 100 is 99.6%. If you do a more granular analysis (involving addressing forms etc) you'll find a similar conclusion holds.

3. The ABI comes next; basically because you're not going to be able to make sense of function prolog/epilog or the state of your stack without it.

4. Knowing your memory map and OS specifics really help too (so e.g. on Windows how to read the PEB/TIB, syscall convention for your OS, roughly how the heap is laid out, whether a pointer is pointing towards a local, a heap address or a library function). Again, only to a high level really.

---

The normal way to debug something like this (after googling your error, of course) would be to repro the crash, check the call stack, look at the source code for the library and figure out what path takes you to where you crashed. In this case you would work out reasonably quickly out that the crashing eip is in some AVX-optimised LAPACK code and that LAPACK chooses this code at runtime based on the advertised CPU capabilities. Then you would be confused for a bit. Eventually you would figure out you're faulting because AVX instructions don't work but only reach there because they're advertised. Hence Amazon's bug. The whole process is pretty slow, but it's the standard and obvious way of doing it.

However in this case the problem they had really was that the crash was intermittent.

Based on the narrative given, the picloud guys took a more "cloud-like" approach to diagnosing the issue: they ran the "unreliable" code (plus some environment scraping, I'm guessing) across a whole bunch of instances and worked out by google and eyeball what was different about the crashy instances. This is a practical way of doing it :) It's almost a kind of "statistical debugging", if you want to put things into buckets. Most major software vendors now get minidumps when their apps crash and this (statistical debugging) is actually an interesting field of study in it's own right. It could use some more postgrad. See e.g. https://crash-stats.mozilla.com/topcrasher/byversion/Firefox...

---

I'm going to finish this off by explaining what would be better than reading books and references: finding excuses to do it. It turns out that debugging is mostly thankless and only buys you credit in a very limited social circle. Truthfully it's not a good use of your life unless you're the kind of person who enjoys it. Think of it like chess or go problems. If you want to be good at it, you have to find an excuse. Some motivating activities people find for doing low-level work (in no particular order) are:

  1) Cracking commercial software, writing game trainers, hacking online games (Download trials, or your game of choice)
  2) Writing exploits (Say, check CVEs, figure out if you can repro, debug until your eyes bleed, write an exploit) 
  3) Improving open source software (find a bug tracker, repro crashes, isolate the bugs)
  4) Doing crackmes (see e.g. http://crackmes.de/)
  5) Commercial reasons (work on a toolchain, compiler, embedded system ports, your $software)
---

P.S: your complete problem solving breakfast should include repro first, understanding your target, a bit of reasoning and guessing, dynamic analysis (tracing first: Process Monitor/strace/ltrace, debuggers: gdb/ddd/WinDbg/Immunity/OllyDbg, instrumentation: dynamorio/pin), static analysis (objdump/IDA pro) and copious amounts of whatever will make your life easier.

---

If you are interested I can tell you some war stories about debugging problems in distributed systems but this post is already too long.


Thanks! Especially for the part of finding an excuse for doing this stuff. Many times I start learning something, only to find I have zero motivation to continue.


The old version does AVX and the new version doesn't? Thats crazy! AVX can result in a large speedup [1] in codes that would otherwise not vectorized. For already existing codes it can be 20%.

The best strategy may be to work with ec2 or reject the AVX non-compliant instances.

[1] http://www.behardware.com/medias/photos_news/00/30/IMG003051...


Neither version does AVX, the old ones because it's not hardware-supported and the new one because it's disabled in the hypervisor. But some packages apparently don't fully check AVX support: they check that the hardware is AVX-able, but they don't check if it's been disabled.


The older hardware (Intel Xeon X5550) for m2.* instances did not support AVX. The newer hardware supports AVX, but to maintain compatibility with the old hardware, it is disabled (but was still advertised, which is what caused the issue).

AVX is turned on in newer instances like the cc2.8xlarge, which maps to the f2 core on PiCloud.


This is a true story, which happened to me last month:

I am at work. I log into an EC2 instance via ssh. I establish a screen session. I do some work inside of screen. Go home after work, leaving screen running.

I arrive at work the next day. Log into EC2. I type "screen -ls" and I am told that there are no screen sockets. (In my experience, this usually means the server has been restarted.) I am annoyed. I create a new screen session and proceed to get some work done. That evening, I leave the screen session running, and head home.

I arrive the next day at work. I log into the server. I type "screen -ls". I am again told that there are no screen sockets. I am now very annoyed. I start a new screen session and proceed to get some work done. That evening, as before, I leave the screen session running, and I head home.

I arrive the next day at work. Once again, I log into the EC2 instance via ssh. Once again I type "screen -ls". Once again I am told that there are no screen sessions.

This happened 4 days in a row.

I was left feeling angry and I was left feeling like no EC2 instance could be trusted. I also feel like it damages my productivity that I can not rely on screen (I have in the past, on regular Linux servers, had screen sessions that lasted for many months).

Right now I have all of my personal sites on the Rackspace cloud, which I think was taken over from Slicehost. Although this is called a "cloud" service, the "slices" feel like real computers to me -- I can have a screen session that lasts for months.

The EC2 instances are strangely insubstantial, even when compared to other services that promote themselves as cloud services. Personally, I prefer to work with services that are at least solid enough that I can rely on screen sessions.


I'm confused, what exactly do you think is happening? Obviously individual EC2 instances run for years without being rebooted or having processes die or nobody would use it. (I've run services on EC2 since they first launched and have never had such issues)

What's your theory on why your screen instances are dying and how would EC2 be responsible for it?


I really do not know. I have not had the time to investigate how and why this particular service might suffer so much on EC2. I do not know if our EC2 servies were suffering something that was unique to us, or whether this is a general problem with EC2. I do know that I was annoyed as hell. And I know I have not had this problem with other cloud services, such as the one offered by Rackspace.


Are you sure it isn't just a difference in configuration or distribution that you are using? Hard to imagine how screen would be terribly unique here in uniquely misbehaving on EC2.


you never considered to check the uptime, dmesg?


Also:-

The contents of /var/run/screen/ and the "S-<username>" subdir that should be there...

The output of "ps u" and "ps -ef | grep s[c]reen"


I've never had any issues with ec2 instances mysteriously restarting and I've dealt with a lot of them (personal and business). In fact I don't think I've ever seen amazon restart instances or mess with them in any way without warning.

Sounds like you're got a configuration problem somewhere (os on you instances, software you're running, etc.) on your end and are blaming amazon for it when they had nothing to do with it.

Now, potentially the instance is on bad hardware (start/stop if you've got no ephemeral storage which will put you on new hardware) but that can happen even if you're running your own hardware. However, you've done so little investigation that blaming amazon is downright bad IT (is this how you deal with other bugs, blame the first thing you can think off and rant about it?).


Agreed, the one time Amazon needed me to migrate an instance they told me when they would forcibly stop / start the instance and allowed me to do it before hand. How did you set up screen on EC2? I've had problems with it in the past but got it to work pretty well recently.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: