Hardening the registers: A cascading failure of edge induced fault tolerance

yardstick · on June 23, 2022

> Although Autobahn contained all the item data, the 404 responses were interpreted by the SDM Proxy as an indicator that the item was missing in Autobahn and the SDM Proxy retried the request to the central ILS API in the data centers.

This is why I never design web APIs to use the HTTP status code to indicate the application response. Always embed the application response within the HTTP payload. It should be independent of the transport mechanism. I’m ok with it not being a proper REST/RESTful service.

{ “status” : 1000, “message” : “Item not found” }

And intentionally don’t use the same status numbers as http (Ie don’t use 404 as not found, because someone will mix them up!)

nerdponx · on June 23, 2022

Sounds more like the server was misusing the 404 status and/or the clients were mishandling it.

I am inclined to agree that for this particular usage, an "in-body" response makes sense. 404 should be reserved for when the actual HTTP endpoint is unavailable. But in REST semantics, you would only return 404 for an endpoint like /users/12345 when user 12345 doesn't exist. So the two usages line up. Returning 200 with a body that says "user 12345 does not exist" makes a lot less sense to me.

A good example of overdoing it is when GraphQL servers return a 200 HTTP response that contains nothing but an error message, instead of returning a suitable HTTP status like 400.

yardstick · on June 23, 2022

The problem is most REST services, because they are fundamentally HTTP services, are subject to how the underlying HTTP application server/proxy/middleware handles HTTP requests. Which are valid to return 404 in many more cases than where REST would allow you to return 404. In Targets case, it should have probably returned 400 Bad Client, since what the client tried to access wasn’t a REST endpoint.

The real problem with REST and HTTP is that it’s too easy to put middleware in between that doesn’t understand REST, just HTTP. As a software engineer or architect you can design the API to be perfect to your needs, but, when deployed you often lose control of how the client and server actually connect to each other. Proxies, caches, IDS, WAF, all can get in the way and don’t respect REST semantics.

hinkley · on June 23, 2022

As someone who tries to steer teams toward designing for middleboxes up front instead of trying to spackle caching into the system once it's too baroque to implement properly, I find that a little time up front avoids a lot of pain and anger later on. 400 bad request is a much better way to signal that there is no REST endpoint for that request versus there is one but we didn't find any data.

Also worrying about cache expiry is often a red herring. It's not fixing a problem, unless the problem is that the Product Owner keeps noticing that our code doesn't actually work as advertised. If you can generate a useful etag for a response, you can add client or middlebox caching any time it becomes useful, or take it away when it doesn't. But if you can't generate a useful etag, then any bespoke caching mechanism you build is unsound, because correct etag and correct cache invalidation are isomorphic.

I'd rather know we're on the road to unsound sooner rather than after customers rely on a bunch of misfeatures.

nerdponx · on June 23, 2022

Do you feel like it's a mistake to conflate URL paths with resources?

Because it sounds like you take issue with the conflation of "404 because user 12345 doesn't exist" and "404 because this url path is malformed", and (if I understand your post) are suggesting that the latter should be 400 so as to allow the former to be 404.

What about instead using 204 to signal "does not exist in database but request is otherwise valid"?

hinkley · on June 24, 2022

Well I think 404 is only the first problem you have to solve and people don't even do that very well, then declare success and move on. I've seen so many developers try to reinvent things that are in the HTTP 1.0 specification.

I've definitely had to deal with problems of people returning 200 not found and those leaking up into higher layers of the system, and I agree with the people who don't like that solution. 204 is probably better, but I've seen too much code that looks for 200<=status<300 so you're still dealing with the same problem, but at least that's a client bug not a roll your own thing issue.

On one project, I put the kibosh on someone trying to negotiate clock skew detection between the client and server. I don't recall what they were going to do about it but I pointed out that information is already in the request headers, and has been from the start. So if your timing decisions are tied to an HTTP request, you don't need to do anything out of band to negotiate a common notion of time.

I don't recall when I stopped seeing timestamps of January 2, 1970, but I know I saw enough of them that it stopped being a novelty and I stopped looking for them. In 1995 when they were still working on the spec, nobody used NTP yet, and many, many people had a dead or dying CR2032 battery on their motherboard and either hadn't noticed or didn't know where to buy a replacement. The dying ones were the worst because sometimes they would hold the time but not increment it properly, so the clock skew was proportional to how many hours a day they turned the machine off. "Expires on March 1" didn't mean anything, unless you knew that the server thought it was February 28th. I suspect Tim was thinking of time zones between scientists in Switzerland, France, England and Germany, but it worked really well for crap hardware too.

dragonwriter · on June 23, 2022

> The real problem with REST and HTTP is that it’s too easy to put middleware in between that doesn’t understand REST, just HTTP.

REST isn't a protocol, but a major point of REST’s direction to use the underlying protocol as specified is that if your are doing REST, middleware doesn't need to understand REST, only the underlying protocol.

And the problem here wasn't middleware that wasn't aware of REST, it was middleware that was misconfigured and requesting data from the wrong remapped URLs and relaying the responses faithfully. There was nothing failing to respect REST semantics involved.

cptskippy · on June 23, 2022

> In Targets case, it should have probably returned 400 Bad Client, since what the client tried to access wasn’t a REST endpoint.

The trouble with that approach is it puts the onus on the Server to response accordingly. In the case of a Client misconfiguration you might point the Client at a valid HTTP Server, just not the one you anticipated.

I agree with nerdponx that they were miss using 404. Instead they should have used a different HTTP Response Status Code to indicate something was removed from the system. Perhaps 410 Gone or 417 Expectation Failed.

layer8 · on June 23, 2022

417 is specifically for use with the Expect header, so would be a misuse as well. 410 would be fine, but is technically a sub-case of 404, so if 410 is fine 404 should arguably be too.

TrueGeek · on June 23, 2022

I think the important thing is to be consistent. I worked with a microservice system for years (as a front end engineer) where some teams would use 404 to indicate record not found, some teams would use your system, and a couple teams sent back the response in the header! Of the teams using your system it would quite an ordeal to find out the meaning of "status: 1000", especially if the system was 10+ years old and the original team no longer around.

yardstick · on June 23, 2022

Fair points. My APIs have an associated constants classes/header files to define all these values.

So long as you have the source you are fine - and if you (or the third party/maintainer) don’t have the source then it doesn’t matter the approach because there will be bugs you can’t fix throughout the service.

tornato7 · on June 23, 2022

Reminds me of an online ordering app at my university - when the API went down one day the app started reporting to everyone that the wait time for their food was “503 minutes”

throwaway894345 · on June 23, 2022

That's just awful programming, and the parent's suggestion wouldn't have addressed that. It would have instead resulted in "1000 minutes" or whatever their status code was.

reaperducer · on June 23, 2022

That's just awful programming, and the parent's suggestion wouldn't have addressed that.

It would have if the API used the HTTP status number to indicate the number of minutes.

I've seen crazier things in web APIs.

thedougd · on June 23, 2022

You do have to consider middlebox and client caching when you do this. Returning a 200 with a 'not found' would be cached, and that may or may not be desired for the use case.

yardstick · on June 23, 2022

That’s I guess an issue for cache header instructions to solve.

With 200 codes being cached you still will have problems like stale data. Wouldn’t want the Target registers using yesterdays prices for today (especially if yesterday was a super sale day like Black Friday etc).

kerbs · on June 23, 2022

Targets mobile apps were down over Black Friday many years back for a very similar reason – logic done on status codes.

A 403 in the API had a very specific meaning, and when the proxy layer started returning 403s everyone had a really bad time.

(That was a long day)

marcosdumay · on June 23, 2022

> A 403 in the API had a very specific meaning

And that meaning wasn't "you are authenticated as a user that can not access this resource"?

kerbs · on June 26, 2022

It meant the client was expected to then make a request to refresh their session token.

Because of the middle layer sending a 403 instead of the API, clients would request refresh tokens in an infinite loop.

a-dub · on June 23, 2022

also maybe counting any sort of retry or fallback as a trackable and alertable failure.

auto/silent fallbacks seem like clever ways of avoiding beeps and remaining resilient against failures in supporting systems, but in practice it they always tend to just cover up real issues until it's too late.

i think the ideal is to have a nice easily included retry library that includes reporting/alerting, configurable backoff schemes and logging that can be used everywhere on things that can have transient failures.

henry700 · on June 23, 2022

Yeah, this. If they just had a custom counter for "ILS datacenter fetch" retries after receiving a 404 from the local cache and that spiked to, say, 50% of the "every scan counter" value, something would already be seriously wrong: How does the local Target store cache have less than 50% of the needed data in it??

jaywalk · on June 23, 2022

I think a better solution would be to use the 404 code but include a body with the detailed error. That way the response to an invalid URL looks different from an actual item not being found.

hn_throwaway_99 · on June 23, 2022

But many clients don't even look at the body if the response is a 404.

jaywalk · on June 23, 2022

True, but we're talking about fully internal systems here, where one party is in control of both the client and server.

joeframbach · on June 23, 2022

The proper response should have been 400 Bad Request, not 404 Not Found. Because the client's request contained bad request data. Not that the item wasn't found.

layer8 · on June 23, 2022

If you place a standard web server in default configuration (serving an empty directory tree of static files) on that host, it will also return 404 for any GET request. By that I mean, the client has in principle to expect that 404 means “I asked the wrong server” or “the server is misconfigured” or “my endpoint is gone”.

hn_throwaway_99 · on June 23, 2022

This is one (of many) reasons why I prefer GraphQL. The error codes are clearly defined and contained in the body of a 200 response.

haroldl · on June 23, 2022

This was really interesting both in exploring the architecture of a retail system and looking at how systems fail. Better to read about it and learn than to live it.

I'd call it a 4 hour outage because the initial "recovery" was a result of cashiers manually typing in prices for items. Then when load decreased and they discovered that scanning items worked again the problem came right back.

Maybe returning 404 for both a cache miss and a "there's no endpoint at this path" error is an issue too. For other status codes there's a distinction between temporary and permanent failure; e.g. 301 versus 302. It would've been good to use HTTP 400 Bad Request for the misconfigured URL and 404 for a cache miss.

In the 10% of stores with the early roll out of the config change the cache hit rate went to 0 right away, and that started 12 days before the outage. Alerts on cache hit rates and per-store alerts would've caught that.

Then there were 4 days where traffic to the main inventory micro-service in the data center jumped 3x which took it to what appears to be 80% of capacity. Load testing to know your capacity limits and alerts when you near that limit would've called out the danger.

Then during the outage when services slowed down due to too many requests they were taken out of rotation for failing health checks. Applying back pressure/load shedding could have kept those servers in active use so that the system could keep up.

jabart · on June 23, 2022

204 no content is an underused http status. 404 should be monitored as an error, 204 as, well no content available. If a status code has two responsibilities that will be a monitoring issue waiting to happen.

magicalhippo · on June 23, 2022

So your suggestion is that if I have a /invoice endpoint, a "GET /invoice/abc123" should return 204 if it's an invalid/non-existing invoice number?

Seems reasonable.

bombcar · on June 23, 2022

It seems absolutely insane to me that a system was designed and developed that allows taking down all registers in the country at once. I would have thought it would be designed to be much more "batch" oriented and the worst that could happen is you lose price updates and sales info unto the batches can get through again.

jaywalk · on June 23, 2022

If the local system doesn't have the item data (or thinks it doesn't have the item data, because it's looking in the wrong place) where exactly is it supposed to get the item data from if not the central system?

bombcar · on June 23, 2022

The total size of all item data for a store like Target can't be much more than what, a few gigabytes? Or at least the "UPC -> Price" dataset. So download the whole dataset each night, if you can't get delta changes to work.

And if that had failed somehow, it would have been noticed immediately upon the new code roll-out.

The internet was designed to be extremely resilient to host/route losses, we've made it so reliable we assume all machines are reachable at all times.

(To be fair, apparently they "do" have this but the dataset is printed on the items and the cashiers had to enter it by hand)

jaywalk · on June 23, 2022

Their system was basically designed the way you're saying, with a fallback to grab the data from the central location if it's missing locally. What you're asking for is the same system without a fallback, which doesn't make any sense.

kevan · on June 23, 2022

It's counterintuitive but when you're dealing with distributed systems lots of things are: https://aws.amazon.com/builders-library/avoiding-fallback-in...

bombcar · on June 23, 2022

Exactly - they had a fallback system that worked well enough for the testing to pass, but not well enough for the main system to operate on it.

Interestingly enough the Amazon example there is basically exactly what happened to Target.

bombcar · on June 23, 2022

The fallback was the problem - design it without it, or with a manual window that pops up saying "ITEM NOT FOUND, QUERY TARGET ORACLE" or something, and the fault wouldn't have taken down the whole company.

If suddenly every cashier is being forced to hit OK on every item, people would hear about it immediately from the test rollout instead of when it hit everything (of course, assuming you have good methods for detecting things like this and don't just completely ignore associates' complaints).

jkaptur · on June 23, 2022

It's very interesting that by building a system that's more resilient and reliable:

> high profile processes (such as POS) implement their own fallback processes to handle the possibility of issues with the SDM system in store. In the case of item data, the POS software on each register is capable of bypassing the SDM Proxy and retrying its request directly to the ILS API in the data centers.

... the system as a whole became much more complex and difficult to observe. The system was running in a degraded, abnormal, less-tested, fallback mode for days without anyone caring.

This is also a point about the normalization of deviance. When there is a background rate of the POS using the fallback path, who is to say how important an increase in that rate might be?

bombcar · on June 23, 2022

Buried under another thread was this post: https://aws.amazon.com/builders-library/avoiding-fallback-in... which is the exact same issue - a cache miss was backed up by a direct query and it took down all of Amazon trying to display shipping times.

grok say complexity bad

Fallback is not always necessary (sometimes it is, you can't just say "whelp the engines on this plane went out, time to die") but when you have a fallback system you should think about why you have it and how bad it is to fail, and if it could be worse to succeed.

reaperducer · on June 23, 2022

I find all of this fascinating.

A few years ago, the guys who built Chick-fil-a's POS fog were on HN talking about their fault-tolerance and transaction queueing. It was quite interesting.

There's a lot that you can learn from high-volume POS system design that applies to just bog-standard every day programming.

joedissmeyer · on June 23, 2022

Glad to see that a big US retailer like Target is using the same types of "de-facto" observability tools that I've been using for a while at all of my various employers over the last 5+ years - which are Grafana, Prometheus and Elastic Stack (specifically the Kibana UI for the logging analysis screenshot).

ngc248 · on June 23, 2022

>>> Grafana, Prometheus and Elastic Stack

Those 3 have almost become the industry standard for observability. Everywhere I have worked have used the same and it is almost a no-brainer.

EricE · on June 23, 2022

"It’s not enough to implement redundant systems and failovers, we must monitor and alert when those systems are being exercised."

My air conditioner in my house has a secondary drain pan under it. The outlet for that drain pan is right above a main window outside. If the primary condensate drain gets plugged/fails and the water overflows into the backup pan there would be a stream of water in front of a window that shouldn't otherwise be there. They want you to be able to readily notice it as you are now at risk for significant water damage if that secondary drain manages to plug up too.

Always something worth considering when designing any system - how to make it fail in a way that is noticeable!

at_a_remove · on June 23, 2022

I don't work at that level, or even want to, but I did detect a dark pattern that I often complain about, but have never managed to get people to pay attention to: do not collect data unless you have attached to it a decision with two or more distinct outcomes based on that data.

drjasonharrison · on June 23, 2022

Do you mean "don't collect metrics unless you monitor them and have alerts" or "don't collect data on products, customers, sales..."

at_a_remove · on June 23, 2022

tylerrobinson · on June 23, 2022

Can you give an example of what you mean?

at_a_remove · on June 23, 2022

So when I ran the university website, the homepage naturally had links to other sites. One guy had this inflated sense of importance. If there weren't a lot of clicks over to his site, we should MAKE THE LINK BIGGER because people weren't seeing it. If clicks to his site went up, we should MAKE THE LINK BIGGER because it is that important. His flowchart had only one distinct outcome: MAKE THE LINK BIGGER.

All of the effort that went into collecting the information was for nought, because the outcome was always the same. That was collection with a flowchart, but without two or more distinct outcomes.

A second example would be search engine logs. Nobody wanted to make decisions on them, but "we could always trawl them for data later." A decade on, this had never occurred. That was collection with no flow chart. Offloading the logs, parsing them out, making the data available, week after week, month after month, year after year. Wasted effort.

So part of it is "don't waste effort," but the other part is, if there is decent information to collect, you should be doing something with it.

drjasonharrison · on June 23, 2022

Thank you.

londons_explore · on June 23, 2022

I'd like to see staff training for major outages in retail like this.

For example, if the shop loses power, do they have the ability to sell goods still?

One approach is to let staff members estimate the value of goods - for example at the register, the staff member looks at the cart contents, estimates that it's about $120 worth of goods, charges the customer $120, and hand writes a receipt saying "$120 of goods sold, Date, store name, signature". The staff member then uses a phone to photograph the cart and the receipt.

At the end of the shift, the shaff member drops all the photos into a big store wide Dropbox account, that the accounts department can use to pay taxes.

You'd probably want to practice this process ahead of time with every staff member.

I imagine it might actually be a good process to use on very busy days too - it is probably quicker than scanning every item at the register.

Avshalom · on June 23, 2022

>>For example, if the shop loses power, do they have the ability to sell goods still

To some degree yes, we can check people out with a handheld (which has swappable batteries) and the self check registers are on the emergency power circuit.

Couple years ago when the system went down nationwide we just told people to put their name on their cart and we gave them 10% off if the came back the next day.

prithvi24 · on June 23, 2022

Target has 250k SKUs total - why is their inventory system so complicated? Why the hybrid on-prem store + data center cloud model - isn’t it easier if there is one source of truth? Seems like it would reduce the need for even dealing with all this eventually consistence cache sycning and whatnot

I ofc don’t know what I dont know, but super curious if anyone has insight into why such a complex system is required

Also, if this microservice is used for brick and mortgage mortar, can’t imagine more than a couple hundred per second? ( 2000 stores, 5 registers a store - and humans manually scanning items ) - why did that overload the micro service (guessing it wasn’t an endless exponential backoff)

lalaland1125 · on June 23, 2022

> I ofc don’t know what I dont know, but super curious if anyone has insight into why such a complex system is required

Because it's much more efficient, which allows them to use simpler tech that doesn't need to scale as well.

You are also underestimated the throughput the system needs to handle. 2000 stores * 10 registers per store * 1000 scans per register per hour = 5000 scans per second.

prithvi24 · on June 23, 2022

I’m not sure the throughput is that high - scans take quiet a bit of time, I would doubt that a register scans an item every 3.6 seconds - don’t have data on this but would easily triple that estimate as an average (so in the hundreds)

Also , I get the simpler tech, but complexity breeds failure - if you have a hybrid on prem / cloud model, especially with only 250k skus, at that point doesn’t it make sense to keep that exclusively in the cloud.

It’s a system that scans a barcode and returns an item at its core - this is still well under the limits of using an off the shelf system like Redis behind an endpoint

EricE · on June 23, 2022

"I would doubt that a register scans an item every 3.6 seconds"

Indeed, that sounds WAY too slow for me - traffic like this is bursty. Ever try to scan five of the same thing at some self check out registers? On some it's instantaneous (an awesome customer experience) on others there are one second or more delays (horrible customer experience).

Latency = friction and friction is the ultimate deal killer.

prewett · on June 23, 2022

Walmart is the worst, it takes multiple seconds to even register that the scan worked.

kevan · on June 23, 2022

Not being able to check out customers is a really bad customer experience. It's a double whammy of wasting their time and they don't even get what they needed so it's worth investing to make that less likely. Things are probably better now but when I worked at a Sears our network connection to HQ wasn't reliable enough to depend on completely for checkout operations.

ncmncm · on June 24, 2022

Wasn't Target where hackers were running loose in their POS ("point of sale", not the other meaning) system for months or years? Was that before or after this incident?

aftbit · on June 23, 2022

Why don't the ILS services have their own cache in front of them? Supporting a per-store cache already requires good discipline on timeouts and invalidation, so adding an additional caching layer in the datacenter between the inbound requests and ILS itself seems like it would provide for a cheap extra layer of scalability in case the per-store caches become unavailable.

InCityDreams · on June 23, 2022

I presume "guest" means "customer"?

rjbwork · on June 23, 2022

Yes. Same way they call their employees "associates". I don't quite understand the rationale, but if I had to guess, "customer" and "employee" are a bit too on the nose, and they wish to cultivate a more human-feeling relationship between the customers, employees, and corporation in the minds of the former two groups.

numbsafari · on June 23, 2022

Not just in the minds of the former two groups, but in the minds of their staff and leadership as well.

If you refer to your team members or employees as "associates" you're much more likely to treat them as equals.

Similarly, if you refer to your customers as "guests", you are much more likely to treat them as such rather than simply treating them as people in your store looking to spend money. It gets to the whole sense of trying to create an experience. As a store that sells a significant amount of home goods and goods for the home, referring to customers as guests instills the sense that employees are creating a home like experience for the customer.

Neurolinguistic programming isn't just for hippies. It's a very popular pseudoscience in corporate America.

sokoloff · on June 23, 2022

It is for this reason that I doggedly push back on the use of "resources" when talking specifically about people; I semi-frequently correct this mis-use (IMO) of language.

If you ask "do we have enough resources to compete in segment X?" and you mean resources of all types [including people], that's fine. If you ask "could I have two additional resources on this project" and you mean exactly people, I'll speak up every time.

drjasonharrison · on June 23, 2022

Not all associates are at the same level. Some people unfamiliar with this American Business Vocabulary might jump to conclusions.

Some associates are the customers of the systems that you are responsible for and you are the customer for services other associates maintain.

Unfortunately rather than talk about the importance of respect and what happens when respect between members of groups within the organization is violated, these sorts of neurolinguistic fashions are used.

formerkrogemp · on June 23, 2022

Ugh, some retailers still call their employees "partners." Kroger would write that every check was 'brought to you by customers' on every paper and digital pay stub. The rosy language is always used to obfuscate the exploitation going on. It's fascinating to see Target slightly improve security over the years after multiple hacks and problems with register security.

SilasX · on June 23, 2022

I'm not necessarily against that in general, but if it's a technical article for a technical audience, which this looks to be intended as, they really need to drop the marketing jargon.

jaywalk · on June 23, 2022

Anything published by a corporation of Target's size is, at some level, marketing.

Avshalom · on June 23, 2022

Walmart has associates, Target has team members.

Avshalom · on June 23, 2022

Some real fun horseshit is that the stockers are "designated business owners" and cashier's are "guest advocates"

spelunker · on June 23, 2022

I'll have you know, back at my high school job at Subway, employees were known as sandwich artists. So fancy!

marcosdumay · on June 23, 2022

I always assumed that "associates" was created to encode the idea that people's salary was mostly commission based. But with you talking about those giant corporations that call everybody by that name, this is either anachronistic or plain wrong.

EvanAnderson · on June 23, 2022

My wife worked for Darden Restarurants for awhile and corporate training materials always referred to customers as "guests", too.

On one level I suppose it's just silly terminology, but it grates with me. I guess it's supposed to imply some kind of familiar relationship, free of the gauche trappings of economics. To me a customer demands more attention than a "guest".

It shocks me how many people don't recognize that their employer wouldn't exist if not for customers. That should be front-and-center in the minds of anyone working for a for-profit entity. I don't think there's anything gauche about economics.

barbecue_sauce · on June 23, 2022

The history of food service goes hand-in-hand with the hospitality industry, so referring to a customer as a "guest" is very traditional and common amongst almost all restaurants.

bluedino · on June 23, 2022

Some 90's thing that a couple retail stores started doing. Must have been popularized by whatever executives took advice from before Gary Vee and Seth Godin.

sydthrowaway · on June 23, 2022

Surprised they’re still not on IBM Mainframes