Hacker News new | past | comments | ask | show | jobs | submit login
Moved a server from one building to another with zero downtime (reddit.com)
1043 points by huhtenberg on Aug 5, 2020 | hide | past | favorite | 400 comments



I had to search the reddit commits for 'vmotion'. They have it covered.

This anecdote is an amazingly good story for telling at the pub over a few beers. It's a terrible story for a strategy.

If this is a mountain, my molehill is that one night in the late 90s, I got paged cause the SMTP outbound server was overheating. At midnight I drive across sleepy NH backroads, and stopped at a Wendys to get a chicken sandwich and iced tea, for the caffeine.

When I got to the server room, I pulled the 2U Dell server out of the rack and discovered the CPU cooling fan had seized up. Mind you, this is a New Hampshire data center in 1999, and it has a filing cabinet with manilla folders, and carpeted floors. This thing was never prepared for any disasters.

A half hour later, the SMTP server was up and running cool again.

I greased the fan with the mayonnaise from my sandwich.


The real lesson is that the teller of the tale sort of did initally -- fire the customer.

If the story is true, the client is a stereotypical know-it-all small business owner who gets by on bullying. You see them frequently in businesses that pay low-skill workers a small premium that is hard to replace. (ex: cleaning services, pool guys, mechanical contractors that do low-end maintenance work, etc)

As a contracted SME, taking a job like this is dumb. The chances of failure, where "failure" == the server going down is high, and the customer will just stiff you.


Key is to set the price. If the job is a hassle, you're not charging enough. That will also filter out bozos. Sometimes people really do have extreme requirements. And when they do, they're willing to pay 10x for it.

Agreed though, this particular customer disqualified himself as soon as he said he wont pay if the server goes down. He should have offered a big bonus if the move succeeds without downtime.


When I did consulting, we always got unreasonable but technically not impossible asks like this. We never "fired" the customer, because that's just bad business and customer service. What we did instead was tell them their options, our recommendation, and "appropriate" billing estimates. Your job is to consult them to the best of your ability, not stop them from bad decisions despite your advice.

So, 1 hour billed for 5 minutes of downtime, or 40 if you want absolutely none. Happy to do either, but highly recommend the former. 99% of people pick the cheaper recommended option.

In this case, I would've tried to put the server on WiFi which would seem like less a hassle for me. Equipment acquisition cost billed to the customer.


> Your job is to consult them to the best of your ability, not stop them from bad decisions despite your advice.

If you offer a client an option with a 1% chance of a very bad outcome and they understand and accept the risk, sure.

But if they have not understood and accepted the risk? Implementing something you already know the client misunderstands ain't always a smart move.


That's consulting for you.


Yep. Then it's either extra billable hours for you to fix the mistake you knew was going to happen, and/or the lawyers get involved and point out the mistake being requested in whatever statement of work. Even better if you called out the exact failure scenario in the risks section that you suspect they didn't read.

At the end of the day, it doesn't really matter if they listen/understand or not, as long as you documented telling them or it makes it into some legal document and they still pay the bill.


> Key is to set the price. If the job is a hassle, you're not charging enough. That will also filter out bozos.

Sure. Just, there are some verticals where charging a "positive-ROI" amount gets you no business at all, because all the potential clients in that vertical are businesses that operate on such razor-thin margins that they don't actually have the cash-flow to pay for the extreme requirements they also have. They've been getting along until now purely by begging/tricking/manipulating people into doing negative-ROI one-off tasks for them. If forced to get contract all the services they need out on the free market, their business would cease to exist.

(Therefore, you say, they should cease to exist. I'm not arguing!)


> there are some verticals where charging a "positive-ROI" amount gets you no business at all,

if you do, you're just selling dollar bill for 80c. You may drink growth-kool-aid, or someday-monopoly-hope or VC-subsidizes-business.

In the end, somebody pays for it either from stupidity or hope.


To be clear (and I realize this wasn't quite clear), I wasn't talking about pure monetary ROI, but rather "return" in terms of satisfaction of your preferences.

I.e., a job paying a lot of money is "return" on investment; but so is a job that pays a lot of experience, or satisfaction. Meanwhile, a job that takes a lot of time to do requires a larger "investment" (in terms of the set of BATNA jobs you could otherwise be doing with your skills); but so do jobs that don't take very long, but are more physically demanding/emotionally draining, or where the client keeps jerking you around by changing the requirements. If working the job makes your utility go up (i.e. makes your life better), it's a positive-ROI job. If it makes your life worse, it's a negative-ROI job.

So, let me restate: in some verticals, while the clients do pay a livable time-and-materials wage for the region and experience level... they expect far more labor (and far more highly-skilled labor) for their dollar than that dollar "should" be able to get them, such that the net utility from anyone taking that job is negative, even if they're "making money."

These businesses get by mostly on exploiting young IT ingenues who don't yet realize their worth, and so are willing to give a try to e.g. "program a social-networking app" for $2000; or to e.g. "set up a PXE Linux automatic memory-testing boot env and Windows system-image deployment pipeline" for $300.

(These are both things I naively got myself into when I was younger. Amusingly, I was in over my head on the social-networking app, but managed to complete the hardware qualification+deployment env just fine. And my only major obstacle for the social-networking app, was that the client insisted that it needed to be hosted on some free ISP hosting they had, and thus needed to be written in PHP4. Which I didn't even complain about... it was just a language I wasn't very familiar with, so I tried to learn it while working. Sigh.)

In other words, these companies try to get newbies to do senior-level work, paying only newbie costs. Sometimes, surprisingly, that works! And they survive entirely because it sometimes works.

It's basically the "my nephew knows computers, he can certainly repair mine" mindset applied to corporate planning.


Evidenced precisely by all these ransomware attacks where victims have to pay up because they didn't invest in backup resources (maybe they had some tapes, but didn't find the staff or process to test any of it.. because they haven't needed them so far).


> Agreed though, this particular customer disqualified himself as soon as he said he wont pay if the server goes down. He should have offered a big bonus if the move succeeds without downtime.

I don't know, that sounds too close to encouraging the attitude of "it's not worth it, I'll just take my normal pay". 10x vs. 0x is a significantly stronger incentive than 10x vs. 1x.


This is probably the best compromise. Sure if you're moving a server with, idk, top secret government spy files or something then you charge 10x or some multiple. But that would obviously be the exception. The vast majority of these people are just self-entitled dipshits who have an inflated notion of self-worth.

See my comment elsewhere on "the customer is always right."


I deal with a fair bit of customers like this (not via my business), enough that firing them all isn’t an option because it’s a large portion of the market where I am. It’s something. They’ll have low end servers with no redundancy, terrible or no backups, and no contingency plans for anything. They won’t spend a nickel and are the most likely to lose their minds if anything goes wrong.

It’s so frustrating and stressful.


This is why services like Google Apps or managed exchange hosting exist. Most people are terrible at IT management. So bad they're far, far from realizing how bad they even are.

When you consider you're getting like 1000 of the smartest tech people in the world to manage your infrastructure for you, for $5/month per user, it's really such a no-brainer. If people are too stubborn to see that, or want to waste time trying to do it better themselves "because it's cheaper", with redundant power, OS patching, zero-downtime changes/deploys, proper capacity planning, proper redundant connectivity, provisioning the right network around it, physically securing the server room, ensuring things are properly cooled, not wet...I could go on forever, and this isn't even the main focus of the business...I'm sorry, but they deserve go to out of business.

I had a very stubborn client once who ran a hotel chain. Won't say what or where, but I wasn't surprised when their random "security through obscurity" VNC server got compromised. I wasn't finishing migrating to the new PCI-DSS compliant system we built, either, so there go 5000 credit cards "encrypted" with some sweet rot13-level bullshit in Turbo Pascal I cracked in about 30 minutes with no code access.


Seems like some dedicated hardware here would make this go faster if there's a business for it. For example, if the bandwidth isn't high, you could setup a wireless mesh from point A to point B and connect via some appliance to the NIC.

Walk the length with the appliance and verify there's no dead spots, then just hook up to a power supply and get things done.


In the 90’a wifi was even more of a trash fire than it is today.


Presumably at least part of the value of the job from his POV was the excitement of it. I've definitely done suboptimal jobs where I just enjoyed them a little and they were a break from the replicable tasks.

Since the customer is never going to know the awesomeness of it, it's really just for yourself.


There’s a market for everything. The “shit customer” would have hired someone. Sure things could have gone to shit and they wouldn’t be paid, but they didn’t, and there is a living to be had serving this market segment. Every contracting business has difficulty with AR, stereotyping often isn’t even particularly acccurate. Suffice it to say there is a business to be had in catering primarily to difficult customers.


> stereotyping often isn’t even particularly acccurate.

While I agree that it's difficult to predict exact credit risk based upon customer personality, explicit threats not to pay you like exist in the story -are- a bit of a signal that there may be a risk of nonpayment.


Sure, I think the point was that there is a market for customers with subprime credit. Sometimes it is fairly profitable too.


Depends on what you mean by business. The nature of that type of business is that it's not very predictable or repeatable so you get one small chunk of business but ultimately it's not the kind of business (even a consultancy who's job it is to throw hours into the fire) you really want. Scaling that out would be death by 1000 papercuts.


No. If anyone is going to serve the subprime market effectively and profitably, it is someone that can scale it out. That’s the whole point.


You misunderstand what I am saying. I meant that the nature of this type of business is that it that you can't really scale it out


The (probably soybean) oil is a fine lubricant, but the constant motion should cause the egg proteins to coagulate. How long did it operate before you replaced the fan properly?


I can't help smiling at this analysis. It feels like something you'd hear from a sci-fi story engineer working on an rundown ship that just keeps going no matter what ...


That book is Expeditionary Force


books 1 to 3 are awesome. after that the author forgets how to advance the plot while still churning out a new book more than once/year.

I gave up on book 7, so yeah, I tried sticking with it.


So you're saying the plot seized up and the author wasn't able to engineer a solution?


In the forward of one of the books the author mentions he quit his day job after the first book was a hit, so I can understand his financial need to churn out more books.

Unfortunately there is little plot advancement, perilous situations more contrived, and needless exposition/filler the norm.


Expeditionary is my cleaning audiobook.

As in, if I’m cleaning and have nothing else interesting to listen too.

Some occasional good laughs. Dinosaur holding a plunger badge. :)


That or he too ran out of mayonnaise.


That feels like the Honnor Harrington series to me. :-( I thoroughly enjoyed the first N books I read (5? 6?), but the next one or two seemed like watching a series TV show that never resolves tension points, because if they did there'd be no reason for Season N+1.


I read the first 100 pages of the first book and then literally threw it into a fire.

I believe the "Nope" sentence was something akin to "{character} thought that {thing} because {thing}."

Jesus Christ. Would it kill you a little to show instead of tell?

(And lest people believe I'm not a fan of some good * opera, I'm not ashamed to admit I've read my fair share of BattleTech, Barsoom, and even Lost Fleet, among less highbrow works)


Ethbro thought that the book was not good because of the writing.


You laugh, but that's exactly how jarring it was.

It sticks in my memory because I remember being somewhat annoyed at the writing level hitherto, reading through that sentence, realizing what I'd just read a few sentences later, going back to double-check, then chucking the book.

And don't get me wrong, I've got space for some crappy writing in my sci-fi (looking at you, Foundation).


You literally threw it into a fire?


I stand by my decision. The world is a better place.


Well thanks for that recommendation. Added to my reading list


I'm reading the series to the end no matter what.


Reminds me of the foosball table in our old engineering students' society room. You can buy special lube for foosball bearings. OR you can just rub popcorn butter on the bars. Guess which one we had in ample supply.


The egg proteins are already quite coagulated. I'd be more worried about the vinegar component. You need to neutralize that acid with something.


The egg proteins are coagulated but dispersed in the colloidal solution. The motion brings them out of solution. You can try it at home by warming up some mayo in the microwave and then rubbing it between your hands: you'll get a stringy oily mess.


To play this discussion out further: it depends on the heat and it depends on the motion. I’ve made plenty of Hollandaise (which is a sibling of Mayonnaise) in the blender with “boiling” butter poured in, and it stays quite hot... especially when it continues to warm on the stove top.

If I microwaved it from cold, it would break almost instantly.


Haha :-), I want this dialogue performed in Space Janitors or something of the sort


It's also got corn syrup. Would that cause any problems for this application?

Here's the ingredient list for Wendy's mayo: Soybean Oil, Water, Egg Yolks, Corn Syrup, Distilled Vinegar, Salt, Mustard Seed, Calcium Disodium EDTA (To Protect Flavor)


> It's also got corn syrup. Would that cause any problems for this application?

Over time, the server would expand to be 4U rather than 2U


That's the problem with the FAT filesystem, it grows over time.


> corn syrup

That’s how you get ants.


Maybe it was vegan mayo? Oh wait, 1999.


It probably lasted long enough to get a replacement fan installed the next morning.


Why would it need to be replaced now that its working again? ;-)


That reminds me of the time I found an appropriately shaped bolt installed in a fuse-holder - presumably someone did not have a replacement fuse and improvised.

Except that it had been 5 years since the last maintenance in this place and it was a protection panel for a large synchronous generator in a power plant.

After you make a heroic temporary fix, please, ensure the permanent fix is applied later!


> After you make a heroic temporary fix, please, ensure the permanent fix is applied later!

I've known people who would, depending on exactly how bad the failure was, outright refuse to apply temporary fixes precisely because they didn't believe that the business would fix things properly if the issue wasn't forced. And having watched how that particular company handled things, I can't say that they were wrong.


This chart seems relevant in this case.

https://images.app.goo.gl/db84Dmv3sqyVEwfz5


I am not sure how or why, but your link expanded to this:

https://www.google.com/imgres?imgurl=http://i.imgur.com/TtFo...

...it looks like Google is referring to imgur to reddit? In any case, here is the link that it goes to.

https://old.reddit.com/r/funny/comments/26vx0x/handy_fuse_re...

https://i.imgur.com/TtFotWu.jpg


I've seen it said on this forum before and it also aligns with my experience: Most fixes that are 'just for now' are actually 'forever'


All temporary fixes are not.

I also think the converse holds (All permanent fixes are temporary).


There's nothing more permanent than a temporary solution.


It's a slow blow


It´s 2008. A manager that just doesn´t care anymore tells the new IT person to replenish the fan mayo.

-"Why? I don´t know why. It just works."

-"Is Hellmann's ok?"

IT person documents that Hellmann's is preferred.


And then later on Hellmann's is discontinued, so the company solicits quotes for a mayo supplier.


Surely it would be an RFP that insists on SLA guarantees, generating profound confusion at Hellman’s and Kraft.


I'm a young developer, so I've never had the chance to work with on prem servers (and the chances that I will are looking slim), but I've always loved these "war" stories.


In my former service company, the story of the server room which has become much much more important and reliable with massive investment like a diesel generator, but the teams haven’t grown enough in maturity. One day they have a problem with a server. A system admin is granted permission to go to the bay, since remote desktop didn’t work. They discuss the problem in front of the bay. One leans on another bay that was just between two locations. Wheels weren’t locked. It just flew across the room.

It was ok, just the power cord and a fee RJ45 torn. No serious damage besides downtime.


You can buy yourself a pair of old Dell servers from craigslist or eBay for a few hundred bucks. With a $200 membership to VMUG Advantage you'll get all the licences you need to build an enterprise grade cluster.

Build yourself a home lab and learn how systems work. Figure out what is really running your code. Learn how to resource optimize.

Don't end up only being able to work on webapps and small datasets that fit comfortably in the cloud.


I'm not sure if this is still valid advice. I'd just stick 64GB of memory in a desktop PC and have the home lab running on VMs.

5 years ago I left a company running loads of enterprise software and web applications for other companies in a couple of data centers. We had well over 10.000 physical machines. When I left, about 90% of the workloads were virtualized. Running on bare metal is the exception today.

Sure, its cool to have hot-swap HDDs, hot-swap PSUs, redundant network cards and a remote management board but none of that is rocket science. You can learn it in a week if needed.

Networking is a much deeper topic yet almost nobody would recommend you to set up a couple of switches and hardware firewalls in your home network. Today you expect that the hardware just works. In the real world it is extremely complicated to correctly design and run networks, but you aren't going to experience any of these problems in a home lab.


> a couple of switches and hardware firewalls in your home network. Today you expect that the hardware just works.

At the end of the day, you need the guy who makes it "just work". That is a valuable skill as well.

Also knowing how your application works when you just yank the power cord on half your VMs teaches you a ton.


Being able to operate server-grade hardware is a nice skill. There are features in server hardware (like lights-out management, dedicated RAID controllers, etc) that you don't always get in a desktop PC.


That's been a plan of mine. Right now, my home server is just a Pi running SMB and Jellyfin, but the plan is to expand into some used hardware. Seems like used server hardware is one hell of a deal.


I miss the days when the servers came with seating (Cray).


I’ve once used cooking oil as a thermal paste substitute. Worked well enough and nothing went wrong.


That sounds too close to a fire hazard


I doubt there's any component in a PC getting remotely close to the ignition point of oil (which for canola is 424°C). Plus, it's going to be a minimal amount of oil.

I'm more worried that the oil won't get to a high enough temperature and thus won't polymerize, so it'll flow out and ruin some other component, or go rancid, or something. Thermal paste won't move on you. Oil will.


Oil will certainly move on you, but it might not destroy your components, depending perhaps on the specific oil chosen: you can actually buy or build fully oil-cooled PCs.

https://www.pugetsystems.com/submerged.php as an example.


That's mineral oil, not cooking oil. Mineral oil doesn't go rancid.


Presumably there’s enough surface tension from both sides to hold it between the gaps and resist gravity (if it’s a vertical CPU).

Pretty much any Solid-liquid-solid or solid-solid-solid interface will be better than solid-roomTemp&Pressure gas-solid.

The whole point is to conduct heat better than air, and most things will.


Mayo is around 1 part water to 1 part oil ratio..


Mayo usually has around 80% fat content.


Sort of on the subject, i've seen a brochure for a specialty product marketed to law enforcement. It's meant for use with the seizure of live, powered on desktop PCs and similar that have a high likelihood of full disk encryption.

Essentially it's a medium sized double conversion ups, with a really high quality sine wave inverter, and some electronics that can match phase with a live 120vac 60Hz circuit. And a tool kit which consists of the insulated electrical hand tools needed to do a midspan removal of the cable jacket and splice into the wires in an ordinary PC power cable. The person using it is of course supposed to be trained in advance, and competent at the process of attaching the UPS to the live circuit.


In a similar vein, there are USB gadgets that emulate a mouse that keeps on jiggling, to prevent the machine from locking out on user inactivity.

However, there are anti-jigglers too that lock the machine when any new human input device is plugged in.

http://codefromthe70s.org/antijiggler.aspx


That's interesting.

You could have a list of known USB device IDs you trust, and if a newly plugged in USB device wasn't on that list you could lock or power down.


That is a policy I heard to be used in already not-extremely-secure environments like software development at a bank (completely isolated from production environment).

They didn't go so far as to cause alarms on unknown device ids, but devices would just not be mounted if they were not whitelisted.


About 13-14 years ago some parts of the US DoD resorted to hot glue gun filling all the usb ports on desktop PCs, except for the two ports required for the keyboard and mouse.

This was during the windows XP era when it seemed there were an endless number of security problems related to usb devices, no matter how good the group policy and registry settings pushed via active directory membership were.


My company stayed on NT4 until 2008 because it didn't have USB support. Network was fully locked down and any unknown MAC would cause an immediate search by IT.


Did they also remove the MAC address info off the back of everything because spoofing a MAC is fairly trivial.


They probably did. The sort of IT folks that would run a decade old OS are the same kind that would resort to this sort of security theater to "lock down" their network. Capturing MAC addresses off a device is pretty simple if you don't mind a little bit of connectivity loss during the process.


Also, performance must have been amazing using Office '97 on current day desktops.


>About 13-14 years ago some parts of the US DoD resorted to hot glue gun filling all the usb ports on desktop PCs, except for the two ports required for the keyboard and mouse.

Here's a current story:

Someone ordered the wrong desk phones at your large company?

1.) Assemble your crew. Go to various departments and recruit non-technical people.

2.) Task them with disassembling 1000 desk phones.

3.) Hot glue USB port on phone shut.

4.) Reassemble 1000 desk phones.


Is the disassembly and reassembly just for more billable hours? Seems to me you could fill user-accessible USB ports with hot glue without it, same as a user could fill it with an unauthorized USB device.


The procedure was done to meet an audit, less about hours and more about mitigating a mistake (I guess).


What does that solve though? I don't NEED a mouse to copy data.


It solves two problems: one is someone covertly or foolishly plugging in an untrusted USB device (which might be easily missed on, say, the back of a desktop) and it means that checking to make sure that only a keyboard or mouse are attached is as simple as putting tamper-evident seals on those cables.

Attempting to authenticate USB devices is a very hard problem — a sufficiently advanced attacker can spoof manufacturer and device IDs, even if you lock things down to prevent anything other than a keyboard or mouse it's possible to send keystrokes to open the wrong website, there's always a chance of an exploitable flaw in your USB stack, etc. — but anyone diligent can be paid to walk around every week checking to make sure that a seal is solid and the tamper-evident stickers have the same serial number as listed on the inventory. There is a real value in having things where the failure modes are obvious and intuitive.


I'd think guardrails like this also serve at a psychological level - as in "this is a secure machine, don't try to break rules".

While these second order effects are immeasurable, they are quite tangible in my personal experience.


It solves the "I found this USB stick in the parking lot—let me plug it in to see what's on it" problem.


Sure, if they don't have a USB hub sitting around.


The closest thing to a USB hub I've got is one of my external drives for my Mac Mini has a built in USB hub so I can plug stuff into that as well as directly into the computer. The last time I worried about such things was back when desktop computers only had one or two USB ports. Plus, in a DoD situation, I'd imagine that having your own USB hub plugged into a DoD computer would be the kind of thing that could put your job at risk. A friend who teaches at the Naval War College often laments the unusability of DoD IT because of the level of locking down, but any "Why don't you do X?" suggestions have a response of "I'd get fired."

The safeguard doesn't need to be perfect, it just has to be good enough.


If my experience with users holds true, they'll abandon the quest at the first obstacle and the USB will harmlessly sit in a desk drawer for the rest of time.


They'll just unplug the mouse and plug in the drive to see what happens!


It doesn’t solve for an outsider or malicious employee getting access to a machine. What it does solve for is an employee plugging in a compromised usb device on accident since they probably won’t unplug there keyboard or mouse for it.


They could've glued ALL usb ports and simply plugged mice and keyboards into PS/2 sockets.


That's what my alma mater, the University of Waterloo, did for some of our labs when I attended. Then at some point something must have happened and they moved all the electronics into the PC case and only the wires of the mouse, keyboard, and monitor came out of these little openings.


Reminds me of my school when someone booted Ophcrack to recovered cached network passwords - they removed the CD drives. Given the machines didn't support booting from USB (IIRC), it wasn't a terrible solution.


There was a virus directed at DoD machines going around via USB devices. PITA to get rid of too...


I have not yet seen this implemented anywhere in banks. HID devices are fine, but anything else USB (esp. storage) is locked out completely. One of those banks wouldn't even let temp staff send emails out of the bank from their work account.

(Due to various disability acts they can't really do it either, as the employer must provide their staff with hardware they require, e.g. ergonomic keyboards and mice)


That sounds really the wrong way around - the worst offenders in USB malware surely are flash drives that declare themselves as keyboards and input preprogrammed keyboard events (like the USB Rubber Ducky [0])!

(For your parenthetical I should clarify - it wasn't the case that it was impossible to whitelist other devices, it just had to be done on a case-by-case basis. I.e. you would call IT and say "Jen from accounting at machine foo123 needs her new ergonomic mouse to be recognized" and they would remote in, tell Jen to unplug and replug the device and whitelist that exact USB device id on that exact machine.)

[0] https://shop.hak5.org/products/usb-rubber-ducky-deluxe


It may be so, but I'm talking from experience - as a keyboard geek I have, over the past ten years, taken all sorts of weird keyboards (and mice) into various big banks with not a hint of trouble. USB storage, on the other hand, qualifies for an instant termination.


You can do that sort of thing on Linux using USBGuard:

https://usbguard.github.io/


this is a pretty common practice on many (if not all) government networked devices

that...or the USB port is permanently blocked (saw that when I was at a finserv years back: all USB ports (except the one the mouse plugged into) were epoxied


I have seen security minded IT to go so far as requiring laptops with PS2 mouses and epoxying all the USB ports.


So are they keeping a stock of laptops from the 90's? Basically no modern laptops have PS2 ports.


Order enough of them, and manufacturers will give you whatever you want.


If you're big enough, sure. Why not order laptops without USB ports, instead of epoxying them then?


easy enough to fake the device/vendor ID, then abuse bugs in the driver/implementation


Yes, if your attacker knows which device/vendor IDs you have on your list it won't work.


That's where the analog mouse jiggler comes in. Apparently watche faces work quite well for for optical mice.


I've read that on HN before, and tried it a few months ago. It didn't work. At least not with an Apple Magic Mouse and my wife's desk clock.


You will be seen as active (including on comms software (at least the ones I've tried)) if you have any sort of video playing e.g. Youtube in an active tab. Quite handy.


If you ever come across a jiggle-and-click gadget, let me know. Some of the computer activity trackers I've seen lately require the user to click every so often, so plain jigglers are no longer effective.


Get a USB Rubber Ducky and script it to send something like Mouse Button 7. The click event registers but it isn't associated with an action except in super advanced CAD software.


They should call it the Jiggle-No


For the purposes of preventing locking out, I've had some success installing Autohotkeys and sending the MouseMove event every minute or so in a loop. No need for plugging additional USB devices.


HotPlug Field Kit https://www.cru-inc.com/products/wiebetech/hotplug_field_kit...

"With the CRU WiebeTech HotPlug you can transport a computer without shutting it down.

"The HotPlug allows hot seizure and removal of computers from the field to anywhere else. The HotPlug's patented technology keeps power flowing to the computer while transferring the computer's power input from one A/C source (such as a wall outlet or power strip) to another (a portable UPS) and back again.

"We created this product for our Government/Forensic customers, but it has IT uses as well. Need to move a server without powering it down? The HotPlug can do it.

"It's great for digital forensic investigators and techs who can't risk losing access to data on a running computer. With many computers now employing full-disk encryption, shutting them down poses the risk of having to crack a password after moving the computer to a lab for analysis, which can greatly increase the time and expense of an investigation. When combined with a WiebeTech Mouse Jiggler, you also won't have to worry about the computer entering password-protected screensaver or sleep modes."


Time to geo-fence the servers with an external GPS antenna (often useful for time-sync). Or maybe FM signal strength locks?


Wouldn't an accelerometer be a bit easier? Guess it can't hurt to have multiple defenses.


If the police are seizing your PC (presumably following an investigation and a warrant) and you have put an accelerometer to shut it down (or unmount an encrypted volume) when moved in order to deny them access to the encrypted data, would this not count as tampering with evidence?

If I were to do this, I would try and find a secondhand server that already has similar protection built in, so if anyone asks I could say "I did not even know it came with this feature".


You can always claim it was designed to keep it from the hardware being stolen with sensitivive data.


FIPS 140-2 is often used in the private sector as a source of security process inspiration even when there are no legal or contractual requirements to follow it.

Having a good security architecture is not obstruction of justice. Doubly so if the data is still accessible to you after the failsafe is tripped. All you've done is prevent their ability to access the data before informing you of the existence of the warrant, using access mechanisms that - to you - are indistinguishable from an unauthorized access attempts.

> "I did not even know it came with this feature".

A documented threat model and security policy that justifies physical tamper protection and pulls inspiration from FIPS is a much smarter legal strategy than perjury. Consult a lawyer.


How do they deal with the loss of network connectivity?

I could pretty easily write a script that forces my machine to reboot and do all manner of other things if some sort of network change is detected.


You could, but what % of running servers actually have such safeguards in place? I'd say almost none of them.


The next Dread Pirates Roberts would be interested in this safeguard


I don't believe that specific product addresses it at all. Undoubtedly the persons operating the kits have put some thought into it, but given the myriad of possible LAN configurations and types of software deadmans switches, it must be a difficult problem to solve.


Or motion, inactivity, vibrations in the room, etc. But that’s for another product/specialist I guess?


There used to be an OS X program that would lock the computer if it detected motion. As long as a trusted Bluetooth device was paired, the computer was fine. But if the device left range and someone touched the computer, it locked.

There was also one that would use the motion detector to try to detect if the device was falling, and park the hard drive heads before impact.


linked below is an old advertisement/demo video of a similar device or maybe even the one you mentioned :)

https://www.youtube.com/watch?v=-G8sEYCOv-o


Very similar, yes


I'd have thought plugging something into the outlet and unscrewing the outlet to take with you would be more convenient than carefully splicing wires just enough not to disconnect them. All the easier if it's on a power strip.


Technically you don't need to touch the naked wire, you only need to remove (carefully) the outer insulation and have the two (still insulated) wires separated for a few centimeters.

Then there are splitter clamps.

The most common ones are used (low voltage) on cars and motorcycles, they look like these:

https://www.mesconnettori.it/index.cfm/it/ricerca/?flbr=ruba...

But there are professional ones, suitable to 110 or 240 V example:

https://www.techno.it/en/products/all/thb-370-a2a/


Sometimes they are on different circuits.


Wouldn't it be safer to open the case and connect some kind of battery + adapter after the power supply?


Safer for the operator? Sure. But certainly not for the device, if you're trying to keep it operating. An ATX power supply has 24 pins at 5 different voltage levels (plus any auxilliary power connectors for the GPU and drives, etc...), and motherboards are a lot less tolerant of spikes and transients than the PS on the other side.

Dealing with AC power isn't really that dangerous if you're careful.


Even high voltage and high amperage AC isn’t dangerous.

So long as you’re not earthed https://imgur.com/gallery/B2c5FfD


We had an electrician of questionable licensing do some minor work for us (replacing some switches and outlets). I asked him to tell me when I should go down to the circuit breaker to turn off the electricity and he told me not to bother. He did all the work with hot current running through the wires. I stayed close enough to be able to tell if I needed to call 911 but no closer while he worked.


I was once working for a small company building electrical equipment. We mostly worked on "medium voltage" equipment, you know 2400 to 69000 VAC.

For one project we had large banks of ultracapacitor in a cabinet. Fully charged it was around 1200 VDC. This thing was in the prototyping stage, and we were testing a control system on a Saturday morning.

So we charge it using a large AC/DC converter, fully charged, everything worked beautifully. We start a discharge cycle converting the DC back to AC. Uh oh, it starts pulling way too much current. Flames start to shoot out of the AC/DC converter. Fuck. BANG. Fuse blown.

We assess the damage... the AC/DC unit is totally shot. And someone (me) is going to have to analyze what caused the failure. Otherwise everything with the capacitor cabinet seems okay, but the thing is still charged to 1090 VDC and the fuse is blown. Check with the mechanical engineer that designed the cabinet. Turns out the fuse can't be changed (can't be accessed) while the cabinet is charged and the cabinet can't be discharged because the fuse is blown. Well that isn't good.

The only thing we could do was discharge it into a load bank (think large toaster) by connecting something directly to the copper busbar live at 1090 VDC. So one of the commissioning guys volunteered. He put on some high voltage gloves, stood on a plastic mat, and connected some jumper cables someone had in their car to the bus bar. He stepped back and someone else threw the switch on the load bank and it discharged without incident.

There were some design revisions after that.


You would think if you guys were working on those AC voltages, you'd have an arc flash suit on hand and he would have also put on an arc flash suit to do that.


Ffffuuuuuuuu....


I've done a ton of electrical work for my own benefit over the years and I'm perfectly comfortable doing things like swapping switches with live wires. I've never once had a problem. The one and only time I've fucked up was when I cut a run of romex cable that I thought had been turned off.

Lesson learned: electrical wiring is like a gun. Always treat it like it's on, and if you have to do something would be unsafe if the wiring is energized, make damn sure it's de-energized before proceeding. When you're working in that mindset already anyway, flipping the breaker for something as simple as swapping a switch/outlet hardly has any benefit.


I apprenticed with my Dad. The first two rules he taught me have stuck with my my whole life:

1) Treat every wire as if it was hot. Even if you know it's not. 2) A good electrical connection must first have a good physical connection.

Not sure why that second rule sticks with me :) but there has been more than one occasion when I'm fairly sure the first rule has saved me from a bad shock. And you're right - treating the wires as if hot means you can actually work with hot wires for a lot of simple things.

I still turn off the breaker though :)


The second rule is a great one that so many people doing their own work miss.

The wire nut is only there to stop the wires loosening over time and provide some basic insulation. It is not there to actually attach the wires. When you twist your wires together, they should be attached well enough on their own that you'd be comfortable throwing a piece of electrical tape over them to stop them shorting to the box and leaving it as-is (but don't do that). If the only thing keeping them together is the wire nut and you being very gentle when you manipulate them back into the box, they're not actually connected.

The poor physical connection creates a poor electrical connection. A poor electrical connection has resistance which creates heat. Heat creates fires. Even better after a few years when enough traffic has driven past your house and enough people have moved around inside of it and the wires have wiggled to just barely in contact so occasionally when someone walks down the hallway the lights will all flicker as the wires create some pretty electrical arc light shows, adding carbon buildup to the wires and further increasing the resistance and heat concentrated in the one tiny point of the copper where they're still sometimes connected.

No reason at all for this rant. Definitely not a real example at all. Definitely didn't waste an afternoon with a toner, a drill with a pilot bit, and a borescope to hunt down the six octagon boxes someone had sealed into the basement ceiling hiding away some of the shoddiest wiring I'd ever seen. Nope.


100% confirm on the wire nut thing. It's possible to get a good twisted connection with a wire-nut without pre-twisting, but conditions have to be just right, and must result in a properly twisted wire pair in the end, or it's just trouble waiting to happen.


One packet of wire nuts I bought came with a drill bit made to twist them on. I found it works way better than twisting the wires by hand, it creates a tight twist that's very hard to undo.


This makes me feel bad. As a kid, I remember holding light switches at just the right point to hear the buzzing (arcing)? inside. At least if the contacts were carbonizing, there wasn’t a lot flowing through them closed.


Shit! Same.


In Germany this is called "Arbeiten unter Spannung" and perfectly legal if qualified (https://de.wikipedia.org/wiki/Arbeiten_unter_Spannung).


The electrician was Croatian and, I presume, learned his trade there. It still terrified me.


I had an electrician add a breaker to the main panel while it was still live, no protection or gloves, nothing. I was also terrified.


Sometimes you do what you've gotta do.

I'm not a nut that does everything with the power on--I kill any branch I'm working on and double and triple check with a non-contact voltage detector before I stick my fingers into anything (which saved my bacon the one time when the hot from a different branch of the same phase ended up connected to a neutral wire for a plug with no connected ground leaving it showing 0V on a multimeter in any configuration and still being live with the breaker off; that house was a mess). However our current dwelling has no main cut-off for the power. If we wanted to turn off power to the panel we'd need to get the power company out to pull the meter from the socket.

In a mostly full panel the bus bars are pretty much completely covered by the breakers anyway. You'd have to work pretty hard to come in contact with them. And the wires you're working with (besides the ground) are insulated anyway so no issue if they brush up against something.

The only thing that's _slightly_ butthole puckering is chasing the uninsulated ground wire through the panel down to the neutral bus.

And yeah, done without gloves because weighing "safety when I make a mistake" versus "greater dexterity so I'm much less likely to make a mistake" I prefer the latter. The protection is rubber soled shoes and keeping one hand tied behind my back so the electricity has no path through me.


Ha, that's nothing. I once watched a stubborn guy replace the bus bars in the input panel of a house. He did wear rubber gloves and boots and stand on a plastic stool. But, this is a kind of job where you are operating a socket wrench on the clamps holding down the bare ends of the thick direct-burial power cables, then wrestling the ends of the cable out of the way to unscrew and remove the bus-work from the panel chassis.

He did this without notifying the power company, so those supply lines were hot with 240V residential service. The weather shifted and a light mist started falling before he was done. Like another poster above, I was thinking I need to be ready to call 911, but wanting to be far enough away not to be hit by splattering metal or any surprise voltage gradients in the soil.


I accidentally replaced an outlet and added a switch to a circuit that was still energized. I had turned off the wrong breaker, and failed to confirm it before I started work.

But, careful work habits and some tools that happened to be insulated anyway, meant that I was never bridging two different potentials. The job went flawlessly and I only noticed when I plugged the outlet tester into it at the end, expecting to go turn the breaker on and come back and look at the lights... but the lights were already lit up.


Working on hot wires is no problem. Ground wires scare me and I'll turn off the main breaker before I touch them. You can never be sure what ground is really at.


In more than a few dilapidated rentals I've been in...

Ground is ... all the metal bits in the bathroom and there's earth leakage happening somewhere.

The safe work procedure is then: get the shower to the desired pressure and temperature before you get in / while you're still wearing your shoes then try not to touch the taps while you're in there.

But don't tell the guests cos hearing them yell "FUUUUCK!" is amusing.

Bonus points if they pass out from the shock and knock their head on the way down.

Caring bunch us Aussies.


Cases can have "case open" switches that tell the machine to switch off. You can't necessarily tell beforehand.


Splicing into the many wires that is an atx+12v power connector, between the output of the power supply and the motherboard is way more fiddly than just dealing with the hot and neutral on an ordinary $5 PC power cord. You could also never be certain what weird ziptie and cable management system (or lack thereof) might exist in a home built x86 PC case, or if there's any room for hands to work at all...

I think the thing I saw is also meant to deal equally well with a commodity x86 PC built from parts, or an Intel NUC size thing, or a corporate desktop machine with proprietary internal wiring like a slimline Dell, Lenovo, HP, etc.


Case intrusion alarms (built in or Homebrew)


Search for "HotPlug" on YouTube.


I thought about HotPlug too. And the obligatory Seinfeld Frogger scene (become much less familiar to younger folks).

HotPlug must only work in countries with terribly designed plug outlets like the US and Canada. Our NEMA 5-15 plugs are live when the plug's hot (electrons be here) and neutral (return to sender) blades are still visible. I don't think this device could work in the UK I'm not from there but I think their plugs can't be live with exposed plug blades.

https://www.cru-inc.com/products/wiebetech/hotplug_field_kit...


Just need to carefully expose the wiring in the cable itself then. Or yank out the socket and connect to the wires there before snipping and shipping.

Unplugging just enough to expose the prongs is risky because the point where contact is lost will vary from receptacle to receptacle.

Chances are things are plugged into a multi-plug hub anyway. European homes are especially lacking in sockets in my experience.


I recall sometime in the mid 2000s there was a fever for achieving five-9s (99.999% uptime, I think -- it became fodder for a few episodes of Mr. Robot). Not that the metric ever went away, but back then a lot of BigIron(TM) vendors advertised achieving five-9s by replacing hardware while the OS remained running and continuing service. Sun 15K and 25K series (Gilfoyle had a used one in the garage running his network) were behemoths whose mem/cpu boards you could swap out wholesale while the entire frame and backplane was powered on, and while the OS the board came out of remains functioning. There were many caveats around the procedure but it worked. Execs and sales guys loved those demos. These monsters were expensive and banks and energy conglomerates were buying them by the dozens. There was also a big todo about hot swappable drives. The idea that you could be doing hardware maintenance while the machine was still running was a novelty, something like brain surgery while the patient was not only awake, but awake and eating, driving his car, talking on the phone, etc.

A decade later I look back with deep surprise that we didn't think to abstract out the service instead of the hardware. I don't know how many of those behemoths are still being bought, now I work almost exclusively with small server instances that can come and go on the fly. Micro services and AWS have taken five-9s in a different direction. I frequently think of Sun as a failed Hephaestus, in a Christopher Nolan film he would be brilliant but could only turn out clumsy tools because of his deformity, he hates the things he makes so he throws them away before completion. Men find these cast-offs and temper and refine them.


> [In the mid-2000's], a lot of BigIron(TM) vendors advertised achieving five-9s by replacing hardware while the OS remained running and continuing service... A decade later I look back with deep surprise that we didn't think to abstract out the service instead of the hardware.... Micro services and AWS have taken five-9s in a different direction.

In the mid-2000s, enterprises were (and in many cases, still are) running proprietary software with proprietary RPC protocols that had no available source code or other means of modification, and most had no support for application-level high availability, access control, or any other operational quality-of-life feature that people take for granted today. Rather, that functionality was handled at the infrastructure level, through things like the aforementioned Big Iron.

The world looks different today, but those machines made sense for the environment at the time.


I think it kind of makes sence in general, and the obly questuon is whether it could be achieved at lower cost. Complexity of Todays commodity machines is conparable to big iron kf yesteryear


> a lot of BigIron(TM) vendors advertised achieving five-9s by replacing hardware while the OS remained running and continuing service.

AS/400's were capable of that in the 90's (possibly the 80's as well). Heck, they'd call IBM for replacement parts on their own. You'd show up for work and there'd be an IBM guy waiting to be let in. He'd swap out a part with no downtime, and be gone. I've seen machines with uptimes of over a decade with zero on-site IT.


We had one of these at an old office of mine. I actually think it's really cool.


I recall working on some Sun machines with hot-swappable CPUs (and, I assume, disks and other peripherals). If they somehow made memory hot swappable (I'm sure it's possible, just uncommon and/or verrry expensive), with hot swap CPUs and disks, and redundant power supplies, you could tear the machine half apart and it would still keep running. Of course, at that point, once everything is hot swappable, there are generally multiples of everything, so your one machine is really more like multiple machines inside one box than a single discrete machine.


AWS single region SLA isn't 5 9s though. If you want 5 9s in the cloud, not even multiple AZ is enough - you need to go multi region or even multi cloud.


> Multicloud

This sounds like something they'd make up on NCIS.


It means using aws+azure+do+gCloud


Well as I recall there were a few reasons that people focused on reliability in hardware in the late 90s:

1. Shared state storage systems that supported replication were rare (I think Oracle and Informix maybe?)

2. Virtualization software was in its infancy (did SunOS have something before Solaris?)

3. RAM and hardware were waaaaay more expensive, meaning you often had to buy more pure metal just to answer questions fast enough

At least that's my take on it based on my dim faded memories


i remember those. A friend of mine was a network engineer at a local datacenter ( UUNET then MCI pre-scandal ) and said companies were buying Suns for everything no matter how trivial.

He worked a night shift and i use to go hang out with him in the noc and download movies (residential bandwidth was not what it is today). Odd he nor i ever get in any trouble for that heh.


> a novelty

not a novelty. the true bigiron vendors (not sun) had been doing this for decades.

mainframe reliability puts the upstart unix systems to shame.


Up to that moment, hardware maintenance meant having to power cycle the server. True HPC systems like those that ran at the US National Labs didn't find its way to the general market, and still haven't as far as I know.


even the AS/400 didn’t require power down.

this is actually so ancient it’s hard to find docs. here’s something from 1976. (the report is 1990 but the hardwares dates to ‘76). https://www.hpl.hp.com/techreports/tandem/TR-90.5.pdf


No downtime is acceptable, but they have only one server?

What if a technical failure happen? What if there's a fire in the server room? What if there is an earthquake and the building collapses? What if... many things can happen that can result in a long, long downtime with this tactics.

If uptime is so crucial, the system should be setup in such way that moving one server should be a peace of cake, not a spec-ops mission.


> Should have been a 5 minute job if done correctly. Owner ended up paying for over 10 hours of work. Stupidest thing I've ever had to do.

You can see the common sense ship has sailed.


You’d be shocked how rare downtime is with modern hardware. A redundant power supply and SSDs in the right RAID configuration typically will not have any issues for years until it can be replaced by a newer model. Also, hardware monitoring is significantly improved to the point where you’ll typically know if something will fail and can schedule the maintenance.

In the past power supplies and spinning disc hard drives would fail much more often.

It’s basically a solved problem, outside of extremely mission critical, 5 nines kind of stuff, that we all forgot because of AWS.

HN ran, and may still run, on a single bare metal server.


> HN ran, and may still run, on a single bare metal server.

I bet HN wouldn't do a 10 hours high-risk operation for moving their servers because they can't afford an outage. (But well, running stuff on a single bare-metal server is expensive enough that even if they could, I expect they don't.)

What would that company do if a pipe broke inside the datacenter? Besides, if you never restart your servers, you are guaranteeing that the one time when the power goes off on the entire city, they won't come back online.


> I bet HN wouldn't do a 10 hours high-risk operation for moving their servers because they can't afford an outage.

HN is probably not business-critical and could probably affort a 10 hour downtime without much hassle.


The point is that they probably also wouldn't then insist on a consultant doing an unreasonable migration and threatening to not pay them if there was downtime. And they probably wouldn't call around to other consultants with the same requirements, apparently telling them that the first consultant refused to do the job.


> apparently telling them that the first consultant refused to do the job.

While I don’t think they informed them of this in good-faith, it is a nice heads-up. In this case, it meant Consultant2 consulting RefusingConsultant that probably knew the IT better.


It would be legitimately interesting if a 10 hour downtime of HN was at all correlated to an increase in github commits.

I hope there wouldn't be a correlation, but I wouldn't be all that surprised if a somewhat loose one was found.


Quality hardware has existed for years. At a ford motor plant they were doing an inventory and couldn't locate a 10 ton mainframe. It was working so well for 15 or so years the tribal knowledge of where it was physically located was lost.


Wow, that's impressive losing that big a piece of hardware.

Though it was likely easier to find than that Novell Netware server that was sealed behind some drywall, with only a stray network cable leaving any clue as to where it was.


Depends on how big the building is that houses it – manufacturing IT can deal with impressive floor spaces.

I once only half jokingly suggested finding a missing data closet in a two million square foot distribution center by pinging a known IP from three or four aggregator switches across the building and triangulating the location on a floor plan. Sadly the people crawling around the ceiling found it before I could put my idea into practice.


2Msqft is c.430m x 430m for a square floorplan. Ping resolution is 1us (microsecond). Speed of electrical signal in cooper is about 0.8c. Gives a max resolution of ~240m by my reckoning. If there are variances in the switch+network delay it seems like you're going to struggle to even say which side of the building it is.

Good job they found it!


Hah! Good math. Based on the switch placement and the building being more of a rectangle I figured "north side or south side" would be as close as I could get. And when we really dug in it was a classic last mile problem: the first several core switches were well known, we just needed to figure out where the last aggregate switch went.

Turns out a door was closed and a new one built to a hallway to another hallway and not properly labeled on the updated drawings. Had one of the boxes running a conveyor belt not have died, we'd never have looked.


This is all true, but you still can't rely on increased hardware quality if you can't afford any downtime due to moving (a one-time event) a server.

Also, that doesn't cover other problems mentioned here, like natural disasters, ISP problems, etc.


Often these kinds of SLAs are decided upon based on blame rather than what is reasonably required by the customers of that system. In this case, moving offices means the downtime is due to internal reasons. But if an ISP goes down or there is a natural disaster, then that isn't in their control.

Also cost does come in play as well. Multiple physical links in would be very expensive for what sounds like internal services. Likewise a natural disaster might cause bigger issues to the company than those internal services going down. They might still have offsite back ups (I'd hope they would!) so at least they can recover the services but the cost of having a live redundancy system off site might not justify those risk factors.

The customers requires are definitely unreasonable though. I'd hope those systems are regularly patched, in which case when is downtime for that scheduled and why is that acceptable but not when you're physically moving the server? I doesn't really make much sense; but then "not making much sense" also quite a common problem when providing IT services for others.


You are right, their SLA can be a bit different from what we're talking about here (and expect).

In general, we don't know much about this case. It's a post on Reddit, might not even be true. As is, it doesn't make much sense, but we don't know all the details, so maybe we jumped to conclusions.


> can't rely on increased hardware quality if you can't afford any downtime due to moving (a one-time event) a server.

Mainframe is not just a server. You can hot plug RAM on these things.


Still, sooner or later, the data center will be hit by a natural disaster, a DoS attack, a network problem, or the like, and you'll have to be ready to move to a different one to get your service back online. Or you'll have to reboot your server to apply a critical kernel security update, in which case you need to be ready to fail over to a hot standby. So, since relying on a single server with high-uptime hardware is penny-smart and pound-foolish, might as go with a cloud-style architecture with commodity hardware.


I use to be fascinated with datacenters and would masquerade as a customer prospect to get a tour and see all the cool gear. I was asking one engineer about what they're plan was for a tornado (this was at ThePlanet in Dallas TX way back when) and they basically scoffed at the question. A week or so later one briefly touched down about 1/4 mile from them, I wonder if they thought about me when the sirens were going off hah.


Even in modern hardware there are plenty of single points of failure.

Single server and "can't tolerate any downtime" are mutually exclusive.


AWS and older hardware is no different. Set it once and it keeps running for many years.

I've came across old AWS account (startup have been using AWS for the longest). All the network traffic or VPN goes through a single instance with 3 years of uptime.


AWS EC2 instances or their host machines can fail at any time and it’s out of your hands.


True fact! I recently had EC2 migrate my VM when the physical server it was on reached EOL. If they had fired my VM up again, I wouldn't have even noticed. They didn't. Fortunately it had an EBS volume and I was able to manually restart it without data loss.


Physical servers can fail at any time and it's out of your hands. ;)


Human error is a bigger cause of downtime than technical failure or natural disasters. And in practice, a single server like this tends to be a hand managed one-off which only exasperates the human error component.


s/exasperates/exacerbates/


It's probably a bit of both, TBH. ;)


Unfortunately complacency about how reliable modern hardware is can lead to neglecting things like off site backups. And other issues. Yeah your one big critical on premises server may be super reliable. But what happens when the building is flooded with 6 ft of water, catches on fire, is leveled in an earthquake, or anything else?

If a function is super critical to business, it also deserves to have some thought put into the blast radius of its failure.

The sort of places that would insist on rolling a live server 700 ft across a parking lot probably don't have any real disaster recovery plan.


>hardware monitoring is significantly improved to the point where you’ll typically know if something will fail and can schedule the maintenance.

There's SMART for disks... what else?


And multiple power supplies. I have been running a single physical server like this for ~10 years and the only downtimes were me restarting to boot a new kernel and when people at datacenter messed up BGP routing (their fault). HW is really very reliable now, especially in datacenter environment. But still not 100% of course. There is still low, but more lower than most think, probability of it failing. IC chips most likely won't break, only some capacitors degrade over time and flash memories with bios normally guarantee only 10 years. Bios upgrade (new write) would prolong that, though. I had one disk fail in RAID. Changed the drive without any downtime.


ECC for RAM is the other big one. A single-bit error will trigger warnings, so that you can replace the faulty DIMM before it progresses into uncorrectable errors.


Is there a tool that can randomly take 128mb chunks of memory out of the pool and test them around the clock?


>HN ran, and may still run, on a single bare metal server.

HN also has downtime fairly often.


Yeah that's how you end up with 3years uptime on some forgoten servers... :)


Which is why AWS instances should be no more than minions in a load balancer pool, and any permanent state on an EBS volume or a managed storage service.


What's the current advice on SSD RAIDs?


From an ISP perspective this seems like the sort of company that orders one $250 a month business DIA circuit (at a price point where there is no ISP ROI for building a true ring topology to feed a stub customer) and has no backup circuit. Then the inevitable happens like a dump truck 2km away with a raised dump driving through aerial fiber and causing an 18 hour outage.

Some circuits might average 5 to 7 nines of uptime over a year, but the next year is dump truck time... You can never truly be certain.


I worked at my last job for a place with a single rack mounted set of Windows servers at a data center - with no backup power supply, no backups of any kind for that matter, no UPS and no redundancy of any system, plus they didn't even have an admin for 6 months. The CEO refused to spend money on a 2nd anything. The company has 2000 employees. One server held all of the companies photos (which is basically the core of the business) and of course was not backed up.


This is the kind of company that could benefit immensely from a ransomware attack.


My boss refused to use UPSs for years because he bought one once and couldn't get it to stop beeping.


Of course it can work, you can get far with one server and no spending on anything like backups, UPS, etc.

Whether it's smart and good for your business/reputation is a different question.


You wrote one server but describe the failure modes of having one data center. I think it is very very uncommon and hard to allow for data center level issue. After all Instagram and 100 other site failed when one AWS data center went down. I would interested to know how/whether anyone's backend will work if any data center and its databases completely fails due to fire/earthquake/networking etc.

Second thing is having multiple machines for server. In theory it might help in increasing the availability but in practice I haven't seen any random issue due to machine which occurs just based on probability. I think almost all failure modes that exist, they are correlated between machines. eg suppose you have data loss on one machine, you could more likely than not, blame it on code and it would be similar across machines.


Re: single datacenter. At the basic level, you need a second datacenter with enough machines to provide your service (or a emergency version at least), replication of data, and a way to switch traffic. It's doable, but expensive in capital and development. If you're dependant on outsourced services, they also need to be available from both datacenters and not served from only one. In an ideal world, your two datacenters would be managed by different companies, so you would avoid any one company's global routing failure (IBM had one recently).

Re: multiple servers. Power supplies fail, memory modules fail, cpus fail, fans fail, storage drives fail. Sometimes those are correlated --- the HP SSDs that failed when the power on hours hit a limit (two separate models) are going to be pretty correlated if they were purchased new and stuck into servers at a similar time and then on 24/7. Most of those failures aren't that correlated though. Software failures would be more likely to be correlated though, of course.

The key thing is to really think about what the cost for being down is, how long is acceptable/desirable to be down, and how much you're willing to spend to hit those goals.


> In an ideal world, your two datacenters would be managed by different companies, so you would avoid any one company's global routing failure

I can't understand this. I think transferring servers would be the the least of problems. Its the transferring of database and maintaining consistent version of databases in both the locations. Moving the snapshots after every X minutes doesn't maintain consistency. I would like to read about any company that is able to do this, as honestly it sounds really hard to me. Is there any writeup of IBM thing you mentioned?


Re: IBM outage

https://news.ycombinator.com/item?id=23471698

TLDR is connectivity to and from the IBM cloud datacenters (which includes softlayer) was generally unavailable, globally, for a couple hours. If you were in multiple IBM datacenters, you were as down as if you were in only one (mostly, I was poking around when it was wrapping up, and some datacenters came back earlier than others).

> Its the transferring of database and maintaining consistent version of databases in both the locations. Moving the snapshots after every X minutes doesn't maintain consistency. I would like to read about any company that is able to do this, as honestly it sounds really hard to me

The gold standard here is two-phase commit. Of course, that subjects every transaction to delay, so people tend not to do that. The close enough version is MySQL (or other DB) replication, monitor that the replication stream is pretty current and hope not a lot is lost when a datacenter dies. There's room to fiddle with failover and reconciliation; I recommend against automatic failover for writes, because it gets really messy if you get a split brain situation --- some of your hosts see one write server available and others see another, and you may accept conflicting writes. A few minutes running like that can mean days or weeks of reconciliation, if you didn't build for reconciliation.


He should have taken it offline without notifying this brain-dead manager. Probably wouldn't have noticed lol.

And then charge for those 5 hours for good measure.

In general, this stupid trend of wanting 0 downtime makes no sense to me. If you're not NASA, police or other emergency service you 100% can afford a few hours of downtime with scheduling it be forehead.


We used to have one server for a website I was a content guy on - it was in a standard PC case, plugged into a switch in the IT team's office (this was not a tech-centered org).

The main IT guy went on holiday and one of the cover guys from another office decided to tidy up. He unplugged the server and thought (and told me after his thought process) "if anyone was using it, they'll let us know".

This was the one, single box for the whole website - no one else was monitoring (even though the central office had a proper, dedicated web team) and the assumption was I was sysadmin.

An hour later I'm sprinting down the corridor to find out what the hell happened and why I can't even SSH into the box.

We put a sticker on the case saying not to unplug it after that...


Remind me of how IBM positions mainframes: they are so highly available that you simply never let them shut down.


IBM Mainframes are designed to be serviced while running so if you have multiple CPUs you can offline one at a time for upgrade it without the whole mainframe going down. Big Sun Solaris boxes where built like at as well.

If your mainframe had only one CPU, you did have to turn it off in order to service it. But you could upgrade the OS without turning it off. While they aren't cool tech now, mainframes are a marvel of hardware engineering.


plus, i would imagine turning them on and bringing them online isn't just a press of a button.


It's not. https://web.archive.org/web/20190324191654/https://www.ibm.c...

(archive.org link because ibm.com apparently isn't hosted on a mainframe.)


Never mind these less common scenarios... What do they do about Windows updates?


or even better, how do they apply OS patches?


I once basically spent a summer doing this, not over a parking lot but to consolidate the remaining equipment in a large number of racks into a few new ones - this was a former sales office of a megacorporation that had been built to have its 1970s-era computer room proudly displayed through windows into the main conference room, a very weird setup without the context that in said '70s that conference room was used to pitch prospective customers on business automation.

Anyway, by the time I was there it was still a '70s-vintage large computer room but now massively overprovisioned on space, cooling, etc, particularly with most IT functions having moved to corporate. A decision was made to repurpose part of it as a test lab and move all the actual remaining equipment to three racks in the corner.

I'd do about two servers a day in between other things, taking advantage of redundant power supplies to transfer the PSUs one at a time to extension cords, swap to a long network cable fast enough that TCP sessions probably didn't time out, and then unrack onto a hydraulic lift card and do the same procedure the other way.

I presented this at the start as far from a guaranteed strategy - that it would minimize downtime but there would inevitably be some due to mistakes. None of this was really that critical. There were a few devices that were pretty old and poorly maintained, we agreed up front that if these lost power for some reason and then failed to boot, we would just say they'd lived long lives and purchase replacements.

I guess the point is that this whole situation was kind of unusual and I would generally not recommend doing this, we were lucky that all the equipment left had stakeholders that acknowledged it was legacy stuff and they could tolerate losing it.

The irony is, of course, that it went perfectly. So far as I know there was not a single problem experienced through the whole thing. I even managed to swap the phone lines to the (surprisingly busy!) legacy fax server when each was out of use.


> There were a few devices that were pretty old and poorly maintained, > if these lost power for some reason and then failed[... we'd] purchase replacements.

And the sysadmin let that opportunity pass??


Heh, I know of a couple of people who would have been far happier if the fax server specifically was somehow accidentally tipped down a stairwell... by this time there were enough dirt cheap "cloud" fax services that it really didn't make sense to keep the on-prem system, particularly since the office had been almost entirely migrated to VoIP except for a few oddball devices like that fax setup. But that whole thing just becomes a story of the internal woes of that particular megacorporation, from the computer room to the front office there was a whole lot of stuff just being kept over from the '90s.


> swap to a long network cable fast enough that TCP sessions probably didn't time out

I'm just picturing the networking equivalent Indiana Jones swapping the idol with sand in Raiders Of The Lost Ark.


I am so scared to imagine what would happen if there was any issue during the move (very likely when dragging live cables and powers over hundreds of meters).

The client would immediately refuse to pay anything because he was very clear he wouldn't pay a thing if there is downtime.

Then, the next contractor would be super quick to judge you and the situation, reinforcing that you were an incompetent idiot and the client was right to kick you away on the spot and not pay a dime.

Glad it went well in the end. There is so much to lose for the person trying to help.


This is a junior sysadmin I suspect. With a bit more experience you'd learn to say something along the lines of "no downtime, sure, that will be 30 grand" and the ability for downtime will suddenly materialize. Him and his friend did this big song and dance, took a huge risk and only got paid for ten hours worth of work in the end.


> sure, that will be 30 grand

I am having trouble finding a reference to it now, but I've heard patio11 refer to this as "the Japanese no". Don't ever say "no" directly, just quote an astronomical price.


There is an art to this. These situations come up because you want to continue an ongoing relationship into the future.

So you quote a price that is high, but not so high as to destroy the relationship. I call it "plausibly-deniably-high".

You also have to gauge the context of the other party in the negotiation. This technique works best when you accompany the quote with some kind of description matching the personalities. Some people are swayed by a description of the additional time it takes (the billable hours mentality). Others are swayed by a description of the additional risks you are bearing on their behalf to deliver the outcome. Still others are swayed by a description of the de novo technical challenges that no one else has ever attempted before. The list goes on, and is a fascinating study into people.

This is where a real salesman (as opposed to an order-taker) earns their keep, where they know how to read a room and craft a response, messaging and after-meeting socializing that takes into account all those perspectives simultaneously from the point of view of the other party.


I don't believe I've said that.

For what it is worth, if a customer of my previous (salaryman-heavy) employer asked for this, we'd tell them an actual no, which is extremely rare in client relationships in Japan. A contextually appropriate "no" for something which is less absurdly wasteful of engineering time to no purpose would be "That sounds difficult. We could explore options to do it, but perhaps you could accept an hour of downtime in the dead of night" then bargain down to 15 minutes.


I am very sorry, I must have been mistaken.


People in the trades world do it too. If a job won't provide the margin they are seeking, or the job is more difficult than it's worth they will up the price. If the consumer chooses them to do the job, it's at a pricepoint that's worth the trouble but they are really hoping to be passed over.


The Rolling Stones tried that with Microsoft and "Start Me Up", they quoted what they thought was ridiculous $10M. Microsoft said sure, no problem.


That's a fun story. Looking more into it, it seems that $10M is based on rumors and it was more likely $3M. [1] Doesn't change the point of the story though.

--

[1] https://www.networkworld.com/article/2220097/what-microsoft-...


$10M is a debunked urban legend; the actual figure is only $3M, which is pretty standard. Microsoft's whole ad campaign for Win95 cost about $200M after all.


My dad's in construction and frequently gives out "fuck off" quotes. It isn't so rare that the client accepts them.


That works.

A friend of mine with a consulting biz was requested by IBM to handle a job in Turkey. He didn't want the gig & told them so repeatedly. He finally decided to tell them the most ridiculous price he could think of (like appending two zeros to the number). He said they didn't even flinch and he was on the plane to Turkey the next week for six months. But he did say that it was pretty much worth it in the end (but only because of the pricing).


Welcome to the market economy!

Seriously, this sort of dynamic is why the world works as well as it does.


Yup, market economies are fantastic for rapid resource allocation!

Yet they are not a panacea.

They suck at preventing problems related to:

* tragedy of the commons - tend to create & magnify it

* long-term disaster planning / tail risk - e.g., stockpiling resources for natural disasters, pandemics, etc.,

* preventing foolish development, e.g., on cheap land subject to flooding

* self-creating safety systems for workers, consumers, environments, etc. -left to their own devices, markets always do too little-too late

Market systems literally often need to be saved from themselves, e.g., when overfishing will literally kill an industry by driving extinct the very thing it depends upon


Actually, stockpiling does happen if there are no laws against price gauging. Because that's how the capital bound in the stockpile gets it's ROI.


I hope you are not seriously suggesting making price gouging in disasters legal as a method of preparation.

Price gouging is nowhere near a reliable method of disaster preparation as actual expert planning.

The stockpiles you speak of are usually just ordinary current inventory marked up by an order(s) of magnitude.

Also, stockpiling goods is not the only thing needed for disaster preparation. One must also stockpile services, i.e., have the right people recruited, trained, equipped, and ready to respond. Prime examples are military and firefighters, who spend a much time & resources training, and little time actually fighting the wars or fires.


Unregulated price gauging will likely end very badly, yes. I'm aware, and just didn't mention it in detail for brevity's sake.

Yes, but funding allocation is hard.

That'd be the case only if it was sudden. If entrepreneurs had the time to think and plan, they'd come up with stockpiles that they rise the sale price for when the time comes, calculating to use that future sales price increase to offset the increased bound capital and storage expenses of their large(r) inventory.

Military is a bad example, but firefighters do train a lot. But that's also due to them needing to respond within hours at best, instead of weeks/months for most wars. I'm referring to the majority/bulk of them, not the leadership hierarchy.


I had a client who wanted me to write some code in Adobe Coldfusion of all things. Not wanting to say no to an otherwise good client, I quoted some insane hourly.

And now I know that Coldfusion is absolutely miserable to code in (and the client tried to dodge their bills!).


I've heard people say that the right way to say 'no' in Japanese is more along the lines of "it is very difficult." I have no idea how much linguistic truth there is to that, but it definitely rings true culturally.


my grandfather called this the "asshole quote"


$30k, 80% up front, strict liability waiver that says I’m not responsible for loss of business or anything else if there is downtime.


You can't get paid upfront and at the same time get a liability waiver. For a 100% guarantee with full liability $30k doesn't actually sound ridiculous because it would require obtaining 100% identical hardware and doing at least one test run on that hardware before actually doing it on the production hardware. What the contractor did is basically "wing it", explain a way to get zero downtime to the client and then not actually offer a guarantee by doing the operation straight on the production hardware. Really this was more about convincing (ie bullshitting your way through) the client to let you do the work than actually doing it properly and for a huge sum of money. It wouldn't surprise me if there was actual downtime for a few seconds and the client simply didn't notice it.


Now you're over-charging massively. If you have no liability and are guaranteed pay, charging for just double hourly rate is more than enough as a "stupid and non-standard requirements" kind of thing.


4.5 hours of a consultants billing rate can be much more than 10 hours of your regular hourly rate working a similar job. A good consultant will have a contract. The client saying I won't pay if XX happens doesn't mean anything unless it was in the contract.

Networking/spanning tree loops, arp table mismatch/corruption, the switches at the destination being misconfigured are all realistic problems that would result in downtime here. The normal way you do this is with live migration from hyper-v or vmotion from ESXI. If the initial migration is not successful, you just leave the server powered on while you address the issues. Once the VM has been migrated you can do whatever you want with the original server without worrying about downtime.


This reminds me so much of when I joined vmware in 2006. vmotion had already been around for a few years - but I believe this was the first release of vCenter with DRS.

A couple of months I joined, a room full of customers chewed us out for not publishing our vmotion compatibility tables. After 4 hours of chewing out - they then told us they reverse engineered the compatibility tables and reorganized their entire data center to conform to vmware vmotion. Then (of course) we worked with intel to make sure the compatibility matrix worked in the future.

I realized at that point that I joined the right company.


If there was downtime during the move and the client was there and declaring that they would not pay, you just walk away. You'd be surprised at how fast they can cut a check in that situation.


Also, moving a server with spinning disks? What could possibly go wrong.


Wasn't here a story about Sun (or HP or someone like that) where they moved a bunch of disk servers across a parking lot to another building and found that many of them had died from the vibrations on the trolley cart used to transport them.


it was Yahoo, IIRC.


I had a spinning disk in my car back before we had all these cool embedded PCs. The disk was never an issue, these things can take a lot of abuse (Even New England roads). I had it mounted sideways so a large pothole wouldn't push the heads into the platter.


>a spinning disk in my car

https://www.consumerreports.org/cro/news/2014/04/record-play...

"The stylus did not jump the grooves even when the car was moving at various speeds over broken pavement, cobblestones, and deep holes."


Reminds me of the video where they yell at hard drives and measure disk latency. https://youtu.be/tDacjrSCeq4


Disks aren't that sensitive to motion.

At my last job, we had 2 airplanes with 5 computers each with 6 disks each mounted in an aircraft. These were regular servers from Dell, not special hardened or resilient hardware or anything. So 60 or so hard disks flying around. Takeoffs, landings, turbulence. Two flights per day, 3 hours each flight, 6 days per week. So 626 landings per year.

Disk failures were not particularly common.


As a counterpoint, I worked for a place that used Mac Minis inside spinning displays and the hard disks absolutely did not like it one bit.

(They also tried spinning disk machines on buses which also failed quickly but that was more the grime and electrical noise than the motion, IIRC. Then they tried mini-servers running from CF and the motion would slowly work the CF cards out of their sockets. The company did not last long.)


Spinning disks can take a surprising amount of shock and vibration before they fail.


I don't know. If the "boss" was charged "4.5 hours of work, 2 hours of consultancy, and 4.5 hours of consultant", and assuming he would have been charged half of that with downtime, maybe the boss did get a good deal. We don't know the cost of downtime for him.

I mean if he had access to technical resources who were willing and capable to do this for him, he chose to do it.


It's also possible that "downtime" has different meanings to different people. The client may be seeing "downtime" as the net result of what happened the last few times the server was "down," which could have been for any number of reasons (potentially even unrelated to the server itself).

When you get clients describing things like this, it's possible they've been promised things about this server before by other consultants that didn't pan out. They don't want to give you the full details because then you'll recommend a different route that they don't want to take (justifiably or not).

It's easier for them to frame the problem to a consultant in a way that allows for only one potential solution, even if perhaps better ones exist, because the guy in charge of making the decision isn't technically skilled enough to assess whether others proposed by consultants are as viable.

And, of course, one might read a little into why there exists a "boss" with such a highly-critical IT need that is hiring a consultant to do work like this, and thinks that threatening to not pay at all if there is any downtime is the best way to do it.

I mean, what if they opened the door to this closet and it grazed a power cable on the floor and the machine just shut off? Why even bother staying around to bring things back up? It wasn't your fault and there's already downtime: you're not getting paid.


Someone upthread was talking about how, as a Salesman, you have to read the room and know how to talk to clients. I did that for awhile, and always got a lot of mileage out of asking the customer what they ultimately wanted to accomplish, which usually revealed that what they were asking for was a solution to a self-made problem, and there was a better alternative altogether.


I personally find it hard to believe that a rough estimate of $450 for the job (spitballing $45/hr for 10 hours) is less than 5 minutes of downtime and they only have 1 server.

Then again, could easily be wrong


You cannot compare it to zero. You have to compare it to the cost of doing it with the downtime. There would be cost to that as well. It will not be free.


> Stupidest thing I've ever had to do.

I don't really understand the "ranty" tone. The client had very specific requirements and the author came up with an effective solution and was fully paid to deliver it. Sounds like a win for everyone.


Reddit (for reasons related to user demographics and feedback loops) rewards certain types of writing and implied viewpoints. Following best practices and rules is one of those things. This server migration clearly runs counter to established wisdom so OP using a writing style of "look how terrible and asinine this was" will be rewarded and gain traction much more than a "look how interesting this was" writing style.


It's reddit /sysadmin, the channel is dedicated to rants and horrible experiences from sysadmin and helpdesk folks.

It's quite sad IMO, don't recommend to go there unless you want to have a bad day reading about the most horrific work environments and bad practices in the world.


Perhaps somewhat similarly, r/TalesFromRetail is devoted to kvetching about your job in the retail sector, but it's really not a depressing place. There are a lot of rules and expectations about how you tell your story. You aren't supposed to outright dox anyone or veer into genuine trash talk.

It's not supposed to be negative per se. It's supposed to be entertaining.

It's an art form. It's not everyone's cup of tea, just like horror isn't everyone's cup of tea. But people often watch horror movies for catharsis, not because they want to be depressed and wallowing in self pity.

Storytelling is often about educating people about things you can't speak about more directly. It's often a way of sharing wisdom in an inoffensive manner and one that will stick because people will actually pay attention, unlike when you are giving them some dry lecture about some problem they haven't yet had and don't yet care about.

But if you entertain them, they will read it anyway and that story may stick with them. And then six months or a year later when they have the same problem, they will actually remember how someone else handled the same issue and it will turn a potentially nightmarish scenario into "Meh, I just did the same thing that guy on Reddit did to his shitty boss/customer/coworker. Worked like a charm. Moving on."


<any group of 2 or more people> (for reasons related to user demographics and feedback loops) rewards certain types of writing and implied viewpoints.

This is literally the basis of human interactions, thats how we humans work at every scale to form friendships/families/societies/nations.


Correct, though I don't think the comment you replied to intended to imply that such pressures & rewards didn't exist elsewhere or that this particular outcome was either general or not.

It just stated that the specific pressures and rewards present in most reddit communities tend to encourage this specific style of writing.


Maybe you are right and it was just a plain statement. But it sounded quite snarky to me. As if it was condescending reddit for biases given subreddit might have, like HN has none.


^

The flair for the piece is "Rant." That's an official category for the sub. There are going to be expectations surrounding how you write when using a tag like that.


Its funny you mention the rewards on certain types of writing within Reddit. I was thinking about it the other day & couldnt quite put my finger on why I dislike a lot of the stuff on there - even across Reddits. I think this is probably the cause...


> Reddit (for reasons related to user demographics and feedback loops) rewards certain types of writing and implied viewpoints.

Just like Hacker News! Here's a clue--it's in a subreddit where these types of stories are welcome.


So true. Reddit loves these "pro revenge" type stories, all one-sided and unverifiable, where the author is a lone hero, toiling against an uncaring world.


probably because proper architecture (clustering, HA etc) and planning would have never made this an issue. This is still an extremely risky operation, hot swapping power and switching interfaces on the fly all while sitting on a cart in a corridor. In any disruptive work there is never a guarantee of no downtime for affected assets. I know the OP came in as a consultant, but If I was the MSP tech, I would have demanded a paper trail a mile long to cover my ass if this went sideways and If i was the account manager for the client, I would have refused the work. Its not good business to agree to do work where you know there is a better than good chance there will be an outage and your client is saying they wont pay if there is an outage. Even agreeing to it puts you in a bad spot for future work. I guess as an outside consultant, bewilderment is a better reaction than ranting, but this is the kind of shit that drives ops folks crazy


My SOWs leave zero room for 'and it will go flawlessly or you wont get paid at all'. If you are occupying my time in a way that makes me unable to serve other clients you will definitely pay for it.


> planning would have never made this an issue

Hard work is wonderful stuff. Days and weeks of it can save you whole hours of planning.


> Me: You didn't notify them of scheduled maintenance like we discussed on Friday?

It appears that the client didn't have the specific requirements on initial consult.


The client was an asshole who demanded 100% uptime and stated that they wouldn't pay if there was any downtime at all. The rant is entirely justified.


I believe the expectations of having 0 downtime was not expressed until the day of the transfer.


In addition to that the customer runs a single server but expects the guys to maintain a property not even feasible at Google scale: Zero downtime. Overall the whole thing was just ridiculous, but luckily the customer got a nice bill in the end.


To be fair, Google maintains zero downtime for small time scales like this a lot. Most of the time, actually.


I’m having a hard time following what “zero downtime most of the time, actually” really means.

https://m.youtube.com/watch?v=IKiSPUc2Jck&t=81s


> zero downtime for small time scales... Most of the time

I read it as "if you take small enough discrete time intervals they won't overlap with any downtime". Or in other words "no downtime between downtimes". Yes, it's very in line with your video.


It means that downtime is chunky.


Are you talking about Google Compute Engine? In that case yes, because by default VMs are live migrated between physical hosts. This can be done for schedule maintenance or upon signs that the machine is likely to fail. Furthermore there are no physical disks for a GCE VM which is one of the more common failure points. The result of this is that GCE VMs often survive for months or years without downtime. Note that the SLA allows more than 3 hours of downtime per month. https://cloud.google.com/compute/sla

For physical servers the uptime is typically quite small. Of course Google isn't optimizing for server uptime so it isn't fair to say "well even Google can't do it".


I see reddit so i assume this is the sysadmin subreddit?

They're famous for not being a cheery bunch. Because reddit's demographic does swing younger the sub used to be filled with endless posts about being socially incompetent or possessing 0 business craft.

Does anyone know if it improved?


Reminds me of this - https://www.youtube.com/watch?v=vQ5MA685ApE

'Moving online webserver using public transport'


The Indiana Bell building move is pretty impressive. http://www.paul-f.com/ibmove.html



Wow this is literally the exact same thing as the OP but for an entire building. Insane.


In the rain!


had the same thought :)


That reminds me of the Pixar incident where Toy Story 2 was accidentally deleted while in production and had no working backups.

Luckily one employee was working from home (rare at the time!) and had a copy of the entire movie on her desktop computer. Which they very carefully moved back to the office and were able to restore from that.

https://www.youtube.com/watch?v=7MAedEXri7c


It's been done 7 years ago even using public transport.

https://www.reddit.com/r/uptimeporn/comments/1kf26r/moving_a...


The most dangerous part is them expecting the 3G to be available during the subway ride.


in Germany mobile networks work just fine in the subway as ISPs have deployed hardware there. I actually have more issues with the network when using classical railroad transport...


I'm surprised they weren't stopped by police to investigate a very suspicious heavily loaded cart on the Subway. It easily could have been 300lbs of explosives on that cart.


I really thought this post on HN was going to be that story. Thanks for digging it up.


Good thing the server had two power supplies. There was a YouTube video (which I can't immediately find) of people moving a server across town, on the train, without powering it off, and, IIRC, they had to splice the UPS into the power cable.

When it's done for pay rather than for fun, and payment is conditioned on zero downtime, I hope they charged a premium to make up for the risk of no pay. Offhand, I don't know what's a good way to do that -- I've never had a consulting client demand terms like that for billed-by-the-hour work.


Effective hourly rate = base hourly rate * risk.

Risk = client risk * task risk.

Client risk is based on your past experience with the same client. If they're prone to demand last-minute changes or stupid stuff, they get charged a higher rate on every project afterward. Jacking up the client risk factor is also a nice way to fire a client you don't want.



At my first job we were starting up the company and didn’t really know what we were doing; one early server was sitting on a folding table and its power cord was wrapped around a leg, so just to replace the table with something more robust involved downtime.


the careful application of a saw or an angle grinder would have made it possible to remove the folding table without unplugging the power cord. :-)


I've been there, solidarity for cheap furniture based maintainence windows.


I heard an anecdote about a company splicing some fiber cable in the middle of a utility van and having to cut the van apart at the end.


Reminds me of the OpenVMS clusters.. Police in Amsterdam celebrated in 2007 an uptime of 10 years of their cluster. In this period, all hardware was replace, and half of it was moved to another location 7 km away. All data moved from DAS disks to SAN without one application needed to be stopped. Also VMS was upgraded from 6.2 to 7.3-2. The VMS cluster did not go down during all of these changes. I <3 OpenVMS


During Y2K I've also had to shutdown various OpenVMS servers with uptime over 10 years... Only because of company policies, not because OpenVMS required the reboot.


Would be interesting to know if it is still up and running?


I'm picturing that Seinfeld episode where George tries to move the Frogger arcade from a restaurant that is shutting down but doesn't want to lose his high score.


HOLES! I need HOLES! :)


I'm surprised that part of the story wasn't to drill down into the requirements. No downtime ever? Not even at 3 AM on a Saturday?

I've found that when people are being unreasonable it is because they haven't split out their true needs from their first idea of how to meet those needs. In this case the true need is zero impact to users. The owner translated that to "zero downtime", and then didn't accept alternative solutions that still would have met his true business need.


I needed to restart a server where I worked. My boss was complaining about the revenue loss during the down time. I knew the revenue loss (if there even was any, as opposed to a couple of minutes of revenue simply shifting to a few minutes later...) would be well under a dollar.

So I listened to him whine for a couple minutes, then tossed a dollar on his desk, told him that would cover it so he could shut up now, and rebooted the server.

Warning: you should probably only try this if you are good friends with your boss. That boss had been my best friend for years before I came to work for his company.


I didn't want to lose my many months of uptime for a lan party back in 1999/2000 and we used the UPS to migrate my linux box across town for some Quake 3 Arena action.

Things were so much simpler back then.


Lower stakes, but ~15 years ago a friend had a Linux box in the corner that had huge uptime. I want to say the uptime started shortly after the kernel patch that fixed the 400-ish day overflow of the uptime counter. He moved to a new home and very carefully moved the running server using it's UPS. He didn't have to worry about keeping networking up though.

I used to be all about long uptimes. I eventually started seeing long uptimes as a negative though. A long uptime probably means patches have not been applied.


I also did that once, about the same timeframe, specifically to preserve an uptime.

I think the cult of runtime came about simply because it was impressive that a personal computer could stay running for more than a few days when most of the world ran Win95. And because development cycles were longer and there weren't a lot of network threats.


I haven't read the article, but I'm reminded of that episode of Seinfeld and the frogger arcade game


I have pondered this exact scenario (server move w/0 downtime) - because of watching that episode - wouldn't have thought about it otherwise.

..It's interesting how pop-culture and your chosen profession intersect, at times.


Reminds me of the time where IT at a previous employer told us that due to a "new IT strategy", our production cluster that had been sitting comfortably in the basement for years had to be moved to an "approved IT hub facility"... in another office 500 km away and across the North Sea.

There was downtime.

Promptly after our cluster settled into this wonderful new facility, a cooling pipe in the ceiling leaked on it, frying 1/3 of our nodes.


On a personal selfish level I was quite happy to see our workloads moving to datacenters that we couldn't (reasonably) physically access, because it replaced "can you go drive to the DC and replace a failing disk" with "we put in the request for smart hands to replace the failing disk". Of course, there's some notable tradeoffs, but it makes me feel better when the business decides to do such things...


When I was younger (read 20 years ago), I did crazy things like that, not over that long distance, but moving live servers in different racks.

Now that I am older, I don't think I would do it anymore, too much stress for a small reward. Also today, most of the time, I am able to "talk out" customers of crazy requirements, while I would just have said "OK let's do it" in my younger years.


The moving server on cart part made me nervous. If there was any rotating rust in there, bouncing across the parking lot would make things difficult for flying heads. I'd have hand carried it from stage to stage, setting it on a padded cart each stop, treating it like sweating TNT.


Slight topic drift - Any thoughts on how the pandemic might materially change assumptions about an onsite/onprem being better than cloud or manage data center when the code people are now actually remote to the “Local” infrastructure. Something specific to the reality of the pandemic strikes me as something that would make the die hard local only folks have to start rethinking the position.

(Not to suggest it’s bad, just different now that a primary assumption about people work in the office is less true)


As someone who works in a very anti-cloud company culture (which I happen to agree with), this incident has had no effect whatsoever on that mindset. We don't dislike cloud because it is accessed remotely, we dislike cloud because of the lack of control we have over everything running there. If something happens and our local systems have a problem, there are people here, like myself, who's highest priority will be fixing it and second highest priority will be communicating the status of that. Your problems are never a priority to a cloud vendor and communicating with you is even less of a priority. That's before we even get into the absurd expenses and reliance on big fat pipes.


I feel a lot safer knowing I'm controlling all the variables during a global crisis, actually.

This article provides an example of how when you operate on prem, literally any crazy option remains on the table for you. If you asked your cloud provider to do this, it'd be a no.


Sorry but this is ridiculous. It's a great story of a feat of sysadminery, but the client should have just accepted some down time, even a few hours. The level of entitlement from some clients people get is just infuriating. Even down to calling him back for not agreeing to help, what an infuriating person.

That was my main take away from this. Endeavor to be the sort of person who can refuse clients, the entire idea that "the customer is always right" enables so much ridiculous behavior.


Decades ago working in a sysadmin role at a hosting company I had a similar situation.

The solution I came up with was to fashion a custom male<->male power cord, like a gender changer, from some broken ATX PSU scraps we had laying around. By rearranging the power sockets from multiple donors, two male power cords could be connected on a single enclosure. Internally the sockets were simply bridged, otherwise the PSU was basically gutted.

With this goofy metal box having two male power cords dangling from it in hand, I just used a very long extension cord plugged into an outlet on the same AC phase as the existing server's power source. The extension cord powered one of the bridge cords. The other bridge cord plugged into the server's existing - and hot - power strip, forming a redundant power source. Now the power strip could be unplugged from the primary power source without losing power, and we just moved the server to the new location with the bridge box and power strip in tow.

If memory serves the only tricky part was determining which outlet at the new home was on a compatible circuit. We didn't have much in the way of electronics tools, no oscilloscopes or anything. Even the soldering involved to make the bridge box was done using my personal soldering iron, which just happened to be in the office because some of us raced RC cars there after hours.

I think I just used an incandescent desk lamp to verify a normal brightness on the bridged circuit before proceeding with the server, but it's been a while.

I wonder how many people have fashioned AC power cord gender changers throughout history... :)


I always remember this post by the Amsterdam Police who managed to maintain their uptime on a VMS cluster despite moving data centres in the middle: http://web.archive.org/web/20120229042903/http://www.openvms...


Interesting read, makes me wonder as a thought experiment if it counts as downtime if the latency of commands on the machines rises to 5 minutes?

You could clone the VM to another instance and record commands going to VM1 and replay them to VM2 after 5 minutes.

This whole brain fart of mine doesn't make much sense but if you play along with it, does it still count as a downtime or just very high latency?


It depends on how downtime is defined in the contract.

That sounds like I'm being snarky but I mean it - whether an actual legal contract or just the documentation given to users, any system where downtime matters should have some discussion of what impacts downtime can have and how it's measured and managed.

That documentation is what defines "downtime".

I'll add that what you've described is a sort of low-fi manual version of DB replication (https://en.m.wikipedia.org/wiki/Replication_(computing)).


Wouldn't requests time out on the client side long before five minutes?


I don’t know whether it’s the software in general, but ever since I’ve started using Three 4G broadband in the UK; all of the software started behaving really weirdly (lots of lockups, hangs, etc). Apps often need to be restarted.

If you do a ping during “bad weather”, you can see that they buffer up to 5 minutes of packets (i.e. there will be no communication for some time, then you’ll receive a bunch of them with a huge latency with sequences intact).

So I would assume a lot of software could even work that way. I think a lot of software don’t set any (TCP) timeouts at all.


That works where you have control over all of the timeouts and failure detection at every level and layer. TCP keepalives, for example, could thwart you. Or client side timeouts, or firewall connection state tables, etc.

5 minutes of unplanned downtime in a pub/sub setup could easily go unnoticed, since that setup is typically tuned for long timeouts and/or repeated retries.



Decades ago an ISP I was colocated at did the same thing. I don't remember the exact details, but it was a DNS server and they either couldn't log in or were relying on the zone files cached in memory or something but for some reason they couldn't power it off.

It was already plugged into a UPS, but they had to cut one of the posts off the rack to get the server out without unplugging it, then they plugged that UPS into a bigger UPS on a cart and wheeled it to the new data center they built out in the building next door.

The world was much different at the time -- this coloc provider had a good reputation, yet.... they had a keg of beer in the corner of the server room and a stack of adult magazines in the men's room.


I once shut down a PC, moved it to another desk, and it wouldn't power back on. Another time I moved a server to another rack. It had 2 years uptime. Had to power it down, and it wouldn't power back on. Both required PSU replacements. Had I moved them while powered on I can only imagine the fun times.

Perhaps they should have just told the customer they couldn't find it: https://www.theregister.com/2001/04/12/missing_novell_server...


This is the kind of content I've only ever seen previously in TDWTF (which is entirely this sort of content...)

https://thedailywtf.com/


Really disappointed they didn't use a wireless network of some kind.


My first thought as well. Set up WiFi along the path, basically turn the machine into a laptop. But I think there might be a disconnect when you change base stations? At least when I move my laptop between rooms in my house there's often a momentary problem while on video call.

The other way I'd do it is more similar to described. Create redundant network paths to the server, then cut one.


I wouldn't, the risk of disconnects is high.


But the risk of sometime tripping over your looong cat6 and breaking the network is not negligible either...


This reminds me of a small company I joined many years ago that did deployments by RAID - find a working server (possibly at a customer site) swap in a blank HD, wait for it to rebuild then take it and put in a new server and repeat the process.

Like finding people who argue against revision control systems, it's really quite a challenge convincing people why things like this are a bad idea - after all "it works!".


That's... actually fascinating, if in a slightly insane way. There's pets, there's cattle... and apparently there's a herd of cloned pets, which I'd somehow never considered before:)



I once was called in to export data from a DOS program that had no export option. Single Author died of heart issues and the company needed the data for the migration.

After several attempts to understand the binary format I gave up and ended up printing tabular reports to LPT1 which I connected my laptop to, extracting it and rebuilding CSV files.

Lucky enough, printing those days were the most important feature of a business app.


Interesting story and one that has played out a few times, I'm aware of a couple verbatim to that. Another - used power extension leads to cover power. Key being systems with dual power units (most servers do) and networking so you can switch from one run to another.

But have known some large companies who have in their history, done things like this and other creative solutions to impossible problems.


The consultants really should told the client that if all you have is a single server then there is no such thing as "zero downtime".


Why wouldn’t cloning the VMs to a second server, then split the traffic between the primary and secondary server work? Once traffic to the second server is confirmed, you could shut off the second server and haul it off to the new location.

I would probably still charge a much higher rate since the owner was an arse, but at least you would get back your 7-8 hours.


You're making assumptions about what's running on the servers. Let's say it's a VoIP conference server with a shared dedicated room - effectively you have an ongoing session shared between multiple connection and you cannot stop it. Or you have stateful local processing so you can't "split the traffic". Or a number of other limitations...


Not all services can be load balanced in this way

Live migration of VMs would have been a better option, which was brought up in the reddit comments and dismissed because HyperV live migration is spotty. While I'd have to agree with that assessment, it isn't so spotty that what they actually did was less risky.


It sounds like there was no second server.


Database inconsistency for one thing. This works for frontend web services but how do you reconcile the writes between the two servers?


Dude has ONE server and talks about having 0 downtime for his clients? What the hell?!

In a way, this is Darwinism for the IT industry and I'm happy the people involved got paid well. Due probably paid as much as it woulda costed him a new server. I bet he'll never forget this lesson.


Setting up a new server at the new location and moving the VMs one by one to the new server as they become idle should be possible without downtime. But maybe there were other requirements (like no new/additional hardware) that weren't mentioned in the article.


A real classic comes to mind: https://www.youtube.com/watch?v=vQ5MA685ApE

Moving a running server about 7km through public transport without downtime.


Sometimes it's better to seek forgiveness than to seek permission.

Saturday 3AM shift with a 5 minute downtime would work just as wwll. Unless this server has had historically 100% uptime this would go unnoticed.


10 hours investment for no downtime seems like a good deal for the owner


Depends on if he really has customers accessing the system "all the time."

Besides, as pretty much everyone has noted, running a zero-downtime system on a single physical machine in what sounds like is just a normal cable room is kind of nuts. Those 10h would have been much better spent to move that puppy to someone else's data center and get some redundancy.

Although reading between the lines, maybe the lease was up and they were waiting to the last minute to move it.


I've done something like this - server running off of UPS moved from one building in Manhattan to another about 1/4 mile away, in snow... Not for someone with weak arms.


You realize the client is condescendingly mocking the guy for saying it can’t be done now, and will expect this next time they run updates on the server, which is to say never


Seems very risky. Not something I’d want to do if minimum downtime was the goal. One wrong piece of gravel ends up with catastrophic failure instead of 5 minutes of downtime.


But the goal was zero downtime, not minimum downtime. The client made it clear that 5 minutes of downtime was equivalent to catastrophic failure. So they correctly found a solution that reduced the chance of "5 minutes of downtime", at the expense of an increased risk of catastrophic failure.


I understand that. I just doubt that the risk was worth it, if downtime is such a big deal.


i wonder if its really possible to do the initial setup of the ethernet failover without interruption. i have never done this, but i would expect the interfaces themselves will become unavailable for direct use and you get a completely fresh virtual ethernet interface which represents whatever physical interface is currently active... at least this is what happens when you add an ethernet interface to a bridge in linux...


I guess servers have gotten a lot more robust in the last decade...there's no way any server I ever managed would survive something like that.


A lot of server are SSD-only these days which make them less fragile. Still, I really wouldn't see myself pushing a running server in a cart.


Yeah, there's certainly still things like riser cards and connectors that could come unseated due to vibration.


That's probably a problem for the next guy that takes an ops job there. Loose pieces often don't disconnect right at the same instant, and even when they do, memory caches usually postpone the failures.


On a parking lot, no less! Let's hope it will not rain on the way!


umbrellas over the switches and the cart...

(means extra billable hours for the extra manhours needed to hold the umbrellas)


I wonder if there was any legit reason to require no downtime. Otherwise the owner doesn't understand what downtime means for his business.


This reminds me of the Seinfeld episode with Frogger


when i was younger i was super proud that i could replace my disk while i kept working on the device. i would put the new disk into my LVM volume group moved all extends to the new disk and dropped the old disk out of the VG afterwards, when done i could just unplug it and be done without halting work except for kicking off the process.


I seem to remember a similar story from another site. I'm thinking it was thedailywtf.com, but I can't find it.


Fun reading. But my advice is never accept a job like this. This could easily become 2 weeks down time


meanwhile in germany, german telekom have their connect ip lines (leased lines..., company internet..) shutdown since tuesday morning. so a downtime of over 48 hours, besides a sla that no downtime will be longer than 8 hours and a availability of 99,9%.

what a crazy world.


Reminds me of the "hot slide" technique used for old telephone switches


They did some crazy stuff in the old days. Like when they moved a telephone exchange live... the whole building.


Could've been cheaper to buy/rent another server, put it on the new location, set up redundancy/replication, power off the old server, move it to the new location, return the new server. Or just keep it for sanity.


Incredible. Five minutes of VM downtime were not acceptable. And yet, they had a single VM host. Should it catch fire (hardware FAILS!!!), what then?


It's called vMotion


Pictures please!


This reminds me of a famously obtuse and obdurate boss who asked for things that were utterly impossible. He had delusions of grandeur which left him convinced that he and only he was qualified to challenge the “cheap, fast, good - pick any two” triangle.

Naturally, I did my best to explain the laws of physics to him, but he wouldn’t hear it. In a spectacular display of Stockholm syndrome I did my best to appease him for four years, but, as many of you can surely predict by this point in the story, I failed in every possible way and eventually gave up. Just wish I could have my four years back.

I was glad to read that OP at least got paid well for his efforts.


I applaud you for being able to stand four years of this.

I usually get fired from such positions in less than two.


I usually walk out about six months in if not sooner. Maybe it's just because I spent so much time freelancing that I had enough experience to recognize a no-win situation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: