Patching Is Hard

dpiers · on May 13, 2017

One of my favorite stories involving patching and security vulnerabilities comes from Jonathan Garrett at Insomniac Games:

"Ratchet and Clank: Up Your Arsenal was an online title that shipped without the ability to patch either code or data. Which was unfortunate.

The game downloads and displays an End User License Agreement each time it's launched. This is an ascii string stored in a static buffer. This buffer is filled from the server without checking that the size is within the buffer's capacity.

We exploited this fact to cause the EULA download to overflow the static buffer far enough to also overwrite a known global variable. This variable happened to be the function callback handler for a specific network packet. Once this handler was installed, we could send the network packet to cause a jump to the address in the overwritten global. The address was a pointer to some payload code that was stored earlier in the EULA data.

Valuable data existed between the real end of the EULA buffer and the overwritten global, so the first job of the payload code was to restore this trashed data. Once that was done things were back to normal and the actual patching work could be done.

One complication is that the EULA text is copied with strcpy. And strcpy ends when it finds a 0 byte (which is usually the end of the string). Our string contained code which often contains 0 bytes. So we mutated the compiled code such that it contained no zero bytes and had a carefully crafted piece of bootstrap asm to un-mutate it.

By the end, the hack looked like this:

1. Send oversized EULA

2. Overflow EULA buffer, miscellaneous data, callback handler pointer

3. Send packet to trigger handler

4. Game jumps to bootstrap code pointed to by handler

5. Bootstrap decodes payload data

6. Payload downloads and restores stomped miscellaneous data

7. Patch executes

Takeaways: Include patching code in your shipped game, and don't use unbounded strcpy."

source: http://www.gamasutra.com/view/feature/194772/dirty_game_deve...

Godel_unicode · on May 13, 2017

This is my favorite comment. I would like to propose an additional takeaway; validate all content downloaded from the internet.

solatic · on May 13, 2017

> we have to get updated database software from a vendor, and to install it we have to update the API the billing software uses

And so why haven't you updated the API the billing software uses, long before now? Why haven't you updated the database software before now?

CIOs create risk when they don't prioritize keeping their products up to date. When you can't even install security updates without breaking your installations, you have a problem. And your problem is more than some technical problem, it's a cultural problem.

Yes it's risky to update a large number of machines. But as a CIO, it is your job to mitigate that risk. There's no such real thing as "unexpected security fixes" in this day and age. They are entirely expected, and if you cannot deal with predictable occurrences then you are quite simply incompetent.

korethr · on May 13, 2017

I think the reason that the billing software's API hasn't been updated is right there in it's name: "billing". Billing is one of those things in an organization that can't suffer much downtime. Nobody wants to be that guy who who endangered revenue because there was no possible way that the security patch should have been able to break the billing software.

Does billing work? Yes? Then don't mess with it. Billing is only to be messed with when the cost of not messing with it is that it will never work again and all prior data will be lost forever. Nothing short of that risk is sufficient justification to touch anything related to billing.

I'm being hyperbolic there, but conservatism around systems that presently work shouldn't be terribly surprising, something like billing especially.

solatic · on May 13, 2017

> Does [a service with high uptime requirements] work? Yes? Then don't mess with it

Ah yes, the "don't fix what ain't broken" canard. And it would be completely understandable if we didn't understand that, in general, the best way to ensure overall uptime is to encourage small, frequent updates over large, infrequent updates, because change is inevitable and the risk of the update is proportional to its size.

I understand that you're being hyperbolic, but that kind of conservatism is born of ignorance. Expecting CIOs to be educated about mitigating risk in the systems they are in control of is not a high expectation for someone in a CIO role.

csydas · on May 13, 2017

A lot of this is a social issue too - it's IT professionals over-committing with SLAs and being too passive when it comes to discussing terms to set realistic RPOs for fragile systems when the resources aren't available for proper patch testing.

It's very difficult to explain to end-operators of systems the importance of having things like redundancies, test systems, and the ability to have downtime for patching, but it's something that IT Professionals need to be way better about. It's very tempting to throw out goals like 99.9% uptime, but many operations run with an employee bandwidth that in no way can support such a goal for the number of systems they need to deal with.

To be fair, sometimes the end-operator needs require some absolutely antiquated pieces of technology that rely on voodoo like rituals to keep the systems running, and trying to shift organizations off this technology is the diplomatic equivalent of a land war in Russian during winter, and IT administrators [1]want to avoid getting into such a battle.

Hopefully, this Ransomware outbreak will help provide disruption to such pieces of technology that are stuck in the past, but part of it is going to require that the new technology makers be willing to respect why so many organizations hang on to older technology. (This tends to revolve around pricing models) I think there is going to be a lot of opportunity to review major systems that have big restrictions on legacy software and hardware and overtake the incumbents that aren't willing to shore up their products.

[1] Edit: removed too many mixed metaphors from one sentence O.O

tannhaeuser · on May 13, 2017

> (This tends to revolve around pricing models)

Care to elaborate? Do you mean that newer software comes with mandatory maintenance costs that users are unable to unwilling to bear? In that case, paying for security patches and maintenance should be palatable to customers in this context, shouldn't it? Or did you mean something else?

joncampbelldev · on May 13, 2017

All of that thinking is just a bet, against the future possibility of an essential need to patch/upgrade for security reasons. They are reaping the short-term benefit of guaranteed uptime, in exchange for being able to do anything easily when you really have to in future. Because catastrophic events are rare, many companies think they're just being really clever and there are no such consequences.

sasas · on May 13, 2017

The vulnerability used by this worm could have been mitigated by disabling SMBv1, hardening the machine, network segmentation .. the list goes on. If you couldn't patch, you could have still prevented this worm from impacting you organisation.

It's a case of inadequate security management and negligence.

KaiserPro · on May 13, 2017

"We've got to install MS17-010; these are serious holes."

"We can't just yet. We've been testing it for the last two weeks; it breaks the shipping label software in 25% of our stores."

But in this case its bollocks. The patch is easy and doesn't fuck too many things.

at $work, we've had the same issue. We have three estates Windows, Linux and some solaris. The Linux estate is patched within hours of upstream fixes. Staged, starting in dev, and bubbling up to prod.

Windows, I've discovered has auto updates turned off. The servers are not in config management, or monitored.

Its not because patching is hard, its because its not seen as important, despite being repeatedly hit with cryptolocker malware.

Its just utterly pathetic.

equalunique · on May 13, 2017

Any insight on how this is going with Solaris? The boot environment feature seems like it should ease the uncertainty of patching a bit.

KaiserPro · on May 13, 2017

Its not new solaris, so I have no idea what the mechanism is, we just run the patch and reboot...

equalunique · on May 17, 2017

Ah. I believe it's Solaris 11+ where this is best supported.

spydum · on May 13, 2017

He is right, but patching is a single layer in your defenses. Where was network segmentation? What about IPS/IDS?

You would hope one layer of failure doesn't take out an entire business.

scottLobster · on May 13, 2017

The issue is non-technical people don't understand what those are and care even less what some IT nerd is complaining about. The more intelligent ones with more sociable IT staff might even be brought to agree in principle, but talk about taking the network offline for serious upgrades (and the money to buy equipment/software/extra staff for said upgrades) and the door slams shut.

On a micro-scale, I got my sister (a political science major now working for the UN) an SSD for her aging HD-based MacBook two years ago and offered to install it for her, do a complete system transfer. She was all gung-ho until I said I'd need to take her laptop offline for a few hours to do the transfer. She'd rather deal with firefox taking 30 seconds to load, and waiting minutes to switch tabs, instead of taking a few hours on a weekend for an upgrade. She simply thinks she can't be out of contact with her coworkers for that period of time.

Sadly I think she's more representative of the general population than anyone here is.

AdamN · on May 13, 2017

If you upgrade early and often, life will be tough but consistently so. If the operation is big, roll the patches. If you know there's a problem ... it should never take 3 months to fix. If it takes you 3 months to deploy a software update you have 2010 procedures in 2017.

discreditable · on May 13, 2017

> Upgrade early and often.

This is my motto. I have the suspicion that Microsoft and other companies don't QA older software as hard as their latest & greatest.

Some folks say I'm crazy but I auto-approve all security updates in WSUS. In 5 years at my company, updates have only broken important software twice. In both cases I just uninstalled the updates with WSUS and everything was okay. In today's environment, I feel that the risk of not patching is greater than the risk of patching—even when you consider buggy software.

rbc · on May 13, 2017

I think the problem is more that budgets don't take into account for enough integration testing, packaging, or automated patching to support enterprise environments.

There was a time when there were fewer threats to Internet based infrastructures and skipping patches didn't matter so much. Those times have passed and there are now significant threats arrayed against Internet systems. Additionally, commerce and government services are now predominately Internet based, putting far more at risk than before.

It's time to start putting more money into fixing the patching problem. I'm really tired of fixing broken systems that I was trying to patch. No good deed goes unpunished...

stonogo · on May 13, 2017

Ransomware is crippling hospitals. People's lives are on the line. And the tech community is in a frenzy of excuses, whining, and hypothetical bullshit about shipping labels. Sagely pointing out to each other that hospitals aren't tech companies, like only tech companies know how to use computers.

Sometimes this industry disgusts me. "X is hard" -- what the fuck is your profession? Easy shit? Fine, step aside.

We have let our civilization down. Whining that X is hard is not going to fix anything. Take the week off and put in some pro bono consulting time with any nearby organization that got hit. Make things better. Fuck your blog posts.

dasil003 · on May 13, 2017

Sorry, I didn't catch where your volunteer shifts are happening?

In the very first sentence you shame commentators for presuming to be more competent than hospital staff, then you go on to suggest riding in on a silver steed to bless them with your powerful expertise.

Frankly, a bunch of startup hackers showing up and trying to play hero is not going to be any more effective than this blog post. This doesn't get solved with arrogant cowboy antics, and it doesn't get solved in a week even by seasoned experts. You have no knowledge of the ecosystem of devices operating on their network, and the constellation of concerns they must balance, and thus you can't offer anything but the most general of advice with which their IT staff is certainly already familiar.

If you really want to make a difference, go apply for a job there and put in a few years of work—that is likely to have impact. Short of that, there's worse things you can do than write a blog post; at least a few of them are probably providing useful perspectives to those actually tasked with solving this problem long-term.

stonogo · on May 13, 2017

My shifts are happening in two elementary schools, three urgent care clinics, and a local nonprofit's office. Why did you think I was speaking hypothetically? I've already taken the next two weeks off. When I'm done with this batch I've got more to do.

Your entire second paragraph is bullshit. The affected are being affected because their existing technology deployments are broken. This isn't just some nightmare that happens to everyone, and it isn't some abstract structural issue that only affects large organizations. There are a lot of groups getting screwed here. They need help. If they didn't need help, they wouldn't have got hit. I am helping.

You can sit by and armchair-quarterback the incident response, that's fine. We don't need you anyway. Useful perspectives can go pound sand. There is work to be done.

dasil003 · on May 13, 2017

The OP is talking about big orgs, so it's kind of a dick move to hijack the conversation and proclaim your agenda to be morally superior. You don't have to put other people down to make your point. It's childish, counter-productive, and you leave yourself open to criticism that your work is also The Wrong Thing because you're not volunteering for some greater cause like helping sick and dying people in developing nations. But I guess if a desire to be better than everyone else motivates you to do some good then I guess it's not a total loss?

chronid · on May 13, 2017

Patching is not a technical problem, 90% of the time. It's a political one (in medium to big operations). We have let our civilization down? No, our civilization is still asleep believing than when we say that patching is important, that rebooting your pc to install upgrades is important, we are spewing shit.

You need to patch, but big management wants X features implemented before the fix that will actually allow you to patch stuff without creating problems. And the only thing you can do is fix stuff after it breaks and look at your manager(s) when they leave in the day shit hits the fan at 6PM, as usual. And then stay in the night restoring data and praying it's not corrupted or infected too.

And in that moment you are sure of one thing: it will happen again, next time. Our civilization does not care.

dredmorbius · on May 13, 2017

Respecting that much of what you say here is true, and acknowledging that I've got a foul mouth and attitude problem myself at times, as well as an increasing abiding contempt, bred from far too much familiarity, with information technology and the infotech commercial world, the reality is that much of this stuff is, in aggregate and in practice, hard.

For ... various reasons, including a great deal of failure-to-anticipate (or more often: Choosing To Deny) Reality and Consequences. Which Bellovin details, and has been detailing for damned near five decades now.

But that still leaves us Where We Are Now:

* With hundreds of millions of Broken Systems.

* More being shipped daily.

* Strapped organisations and staffs.

After fighting the present fire, some sensible approaches to reducing future threats, including cranking up consequential liability on the firms, and backers, and financers, and stockholders, which produced the present mess, such that future risks are properly costed in to decisionmaking, might be a useful activity.

Thank you for your volunteer service.

joncampbelldev · on May 13, 2017

"Patching is Hard", yes but the brick wall of a massive cyberattack is harder. It's not an excuse, not patching is a bet, its people saying "This is difficult so I'm going to skip it, in exchange for increasing the chance of a huge amount of difficulty in the future"

I bet all the hospitals are really glad their managers/IT staff avoided the difficulty and small uptime impact of patching now ...

Hopefully budgets will be allocated in the NHS at least to prevent future incidents like this.

rietta · on May 12, 2017

I would say that patching is maddeningly difficult in the absence of automated end-to-end tests. It's much easier with those in place.

akkartik · on May 13, 2017

Absolutely. And for a flaw in your OS you need an OS with automated end-to-end tests. Good luck!

(Speaking as someone who's trying to imagine OSs with automated end-to-end tests: https://github.com/akkartik/mu)

rkeene2 · on May 13, 2017

I have automated end-to-end OS testing. It's basically a small bash script which spins up a number of VMs in various network configurations (since my OS is only useful in a cluster, as it provides Ceph and a cloud API to KVM). One of the fully automated tests is even an upgrade test where the previous release is installed and then the currently built version is constructed into a patch, installed, and verified that all the VMs it started pre-upgrade are still running, then builds some more VMs (these are nested VMs since the test systems themselves are VMs). It's pretty simple, it starts by installing the ISO and driving the systems over their console.

w4tson · on May 13, 2017

Woah! That's no small side project. Impressive

RachelF · on May 12, 2017

Microsoft used to do decent testing of their updates.

In the last two years, they seem to be saving money by letting their users to the testing.

When a patch comes out we need to weigh the unpatched security risk vs the risk of the patch breaking things.

cryptarch · on May 13, 2017

Microsoft, training the next generation of operations teams to use staging environments.

monksy · on May 13, 2017

"It's agile."

tempodox · on May 13, 2017

Nice to see a sober reaction to all the victim blaming that's been going on. Reality is eventually more complex than some commenters would like to acknowledge, even on HN.

based2 · on May 13, 2017

http://marc.info/?l=patchmanagement

jimmcslim · on May 13, 2017

It's certainly an argument in favour of adopting PaaS wherever possible, but then you're making the assumption that the people running your PaaS are competent...

dredmorbius · on May 13, 2017

Operational compentency is something which a sufficiently well-funded and -capitalised organisation can accomplish.

Though there are systemic risks involved ehre as well.

I'm quite torn, myself.