> Although Autobahn contained all the item data, the 404 responses were interpre...

nerdponx · on June 23, 2022

Sounds more like the server was misusing the 404 status and/or the clients were mishandling it.

I am inclined to agree that for this particular usage, an "in-body" response makes sense. 404 should be reserved for when the actual HTTP endpoint is unavailable. But in REST semantics, you would only return 404 for an endpoint like /users/12345 when user 12345 doesn't exist. So the two usages line up. Returning 200 with a body that says "user 12345 does not exist" makes a lot less sense to me.

A good example of overdoing it is when GraphQL servers return a 200 HTTP response that contains nothing but an error message, instead of returning a suitable HTTP status like 400.

yardstick · on June 23, 2022

The problem is most REST services, because they are fundamentally HTTP services, are subject to how the underlying HTTP application server/proxy/middleware handles HTTP requests. Which are valid to return 404 in many more cases than where REST would allow you to return 404. In Targets case, it should have probably returned 400 Bad Client, since what the client tried to access wasn’t a REST endpoint.

The real problem with REST and HTTP is that it’s too easy to put middleware in between that doesn’t understand REST, just HTTP. As a software engineer or architect you can design the API to be perfect to your needs, but, when deployed you often lose control of how the client and server actually connect to each other. Proxies, caches, IDS, WAF, all can get in the way and don’t respect REST semantics.

hinkley · on June 23, 2022

As someone who tries to steer teams toward designing for middleboxes up front instead of trying to spackle caching into the system once it's too baroque to implement properly, I find that a little time up front avoids a lot of pain and anger later on. 400 bad request is a much better way to signal that there is no REST endpoint for that request versus there is one but we didn't find any data.

Also worrying about cache expiry is often a red herring. It's not fixing a problem, unless the problem is that the Product Owner keeps noticing that our code doesn't actually work as advertised. If you can generate a useful etag for a response, you can add client or middlebox caching any time it becomes useful, or take it away when it doesn't. But if you can't generate a useful etag, then any bespoke caching mechanism you build is unsound, because correct etag and correct cache invalidation are isomorphic.

I'd rather know we're on the road to unsound sooner rather than after customers rely on a bunch of misfeatures.

nerdponx · on June 23, 2022

Do you feel like it's a mistake to conflate URL paths with resources?

Because it sounds like you take issue with the conflation of "404 because user 12345 doesn't exist" and "404 because this url path is malformed", and (if I understand your post) are suggesting that the latter should be 400 so as to allow the former to be 404.

What about instead using 204 to signal "does not exist in database but request is otherwise valid"?

hinkley · on June 24, 2022

Well I think 404 is only the first problem you have to solve and people don't even do that very well, then declare success and move on. I've seen so many developers try to reinvent things that are in the HTTP 1.0 specification.

I've definitely had to deal with problems of people returning 200 not found and those leaking up into higher layers of the system, and I agree with the people who don't like that solution. 204 is probably better, but I've seen too much code that looks for 200<=status<300 so you're still dealing with the same problem, but at least that's a client bug not a roll your own thing issue.

On one project, I put the kibosh on someone trying to negotiate clock skew detection between the client and server. I don't recall what they were going to do about it but I pointed out that information is already in the request headers, and has been from the start. So if your timing decisions are tied to an HTTP request, you don't need to do anything out of band to negotiate a common notion of time.

I don't recall when I stopped seeing timestamps of January 2, 1970, but I know I saw enough of them that it stopped being a novelty and I stopped looking for them. In 1995 when they were still working on the spec, nobody used NTP yet, and many, many people had a dead or dying CR2032 battery on their motherboard and either hadn't noticed or didn't know where to buy a replacement. The dying ones were the worst because sometimes they would hold the time but not increment it properly, so the clock skew was proportional to how many hours a day they turned the machine off. "Expires on March 1" didn't mean anything, unless you knew that the server thought it was February 28th. I suspect Tim was thinking of time zones between scientists in Switzerland, France, England and Germany, but it worked really well for crap hardware too.

dragonwriter · on June 23, 2022

> The real problem with REST and HTTP is that it’s too easy to put middleware in between that doesn’t understand REST, just HTTP.

REST isn't a protocol, but a major point of REST’s direction to use the underlying protocol as specified is that if your are doing REST, middleware doesn't need to understand REST, only the underlying protocol.

And the problem here wasn't middleware that wasn't aware of REST, it was middleware that was misconfigured and requesting data from the wrong remapped URLs and relaying the responses faithfully. There was nothing failing to respect REST semantics involved.

cptskippy · on June 23, 2022

> In Targets case, it should have probably returned 400 Bad Client, since what the client tried to access wasn’t a REST endpoint.

The trouble with that approach is it puts the onus on the Server to response accordingly. In the case of a Client misconfiguration you might point the Client at a valid HTTP Server, just not the one you anticipated.

I agree with nerdponx that they were miss using 404. Instead they should have used a different HTTP Response Status Code to indicate something was removed from the system. Perhaps 410 Gone or 417 Expectation Failed.

layer8 · on June 23, 2022

417 is specifically for use with the Expect header, so would be a misuse as well. 410 would be fine, but is technically a sub-case of 404, so if 410 is fine 404 should arguably be too.

TrueGeek · on June 23, 2022

I think the important thing is to be consistent. I worked with a microservice system for years (as a front end engineer) where some teams would use 404 to indicate record not found, some teams would use your system, and a couple teams sent back the response in the header! Of the teams using your system it would quite an ordeal to find out the meaning of "status: 1000", especially if the system was 10+ years old and the original team no longer around.

yardstick · on June 23, 2022

Fair points. My APIs have an associated constants classes/header files to define all these values.

So long as you have the source you are fine - and if you (or the third party/maintainer) don’t have the source then it doesn’t matter the approach because there will be bugs you can’t fix throughout the service.

tornato7 · on June 23, 2022

Reminds me of an online ordering app at my university - when the API went down one day the app started reporting to everyone that the wait time for their food was “503 minutes”

throwaway894345 · on June 23, 2022

That's just awful programming, and the parent's suggestion wouldn't have addressed that. It would have instead resulted in "1000 minutes" or whatever their status code was.

reaperducer · on June 23, 2022

That's just awful programming, and the parent's suggestion wouldn't have addressed that.

It would have if the API used the HTTP status number to indicate the number of minutes.

I've seen crazier things in web APIs.

thedougd · on June 23, 2022

You do have to consider middlebox and client caching when you do this. Returning a 200 with a 'not found' would be cached, and that may or may not be desired for the use case.

yardstick · on June 23, 2022

That’s I guess an issue for cache header instructions to solve.

With 200 codes being cached you still will have problems like stale data. Wouldn’t want the Target registers using yesterdays prices for today (especially if yesterday was a super sale day like Black Friday etc).

kerbs · on June 23, 2022

Targets mobile apps were down over Black Friday many years back for a very similar reason – logic done on status codes.

A 403 in the API had a very specific meaning, and when the proxy layer started returning 403s everyone had a really bad time.

(That was a long day)

marcosdumay · on June 23, 2022

> A 403 in the API had a very specific meaning

And that meaning wasn't "you are authenticated as a user that can not access this resource"?

kerbs · on June 26, 2022

It meant the client was expected to then make a request to refresh their session token.

Because of the middle layer sending a 403 instead of the API, clients would request refresh tokens in an infinite loop.

a-dub · on June 23, 2022

also maybe counting any sort of retry or fallback as a trackable and alertable failure.

auto/silent fallbacks seem like clever ways of avoiding beeps and remaining resilient against failures in supporting systems, but in practice it they always tend to just cover up real issues until it's too late.

i think the ideal is to have a nice easily included retry library that includes reporting/alerting, configurable backoff schemes and logging that can be used everywhere on things that can have transient failures.

henry700 · on June 23, 2022

Yeah, this. If they just had a custom counter for "ILS datacenter fetch" retries after receiving a 404 from the local cache and that spiked to, say, 50% of the "every scan counter" value, something would already be seriously wrong: How does the local Target store cache have less than 50% of the needed data in it??

jaywalk · on June 23, 2022

I think a better solution would be to use the 404 code but include a body with the detailed error. That way the response to an invalid URL looks different from an actual item not being found.

hn_throwaway_99 · on June 23, 2022

But many clients don't even look at the body if the response is a 404.

jaywalk · on June 23, 2022

True, but we're talking about fully internal systems here, where one party is in control of both the client and server.

joeframbach · on June 23, 2022

The proper response should have been 400 Bad Request, not 404 Not Found. Because the client's request contained bad request data. Not that the item wasn't found.

layer8 · on June 23, 2022

If you place a standard web server in default configuration (serving an empty directory tree of static files) on that host, it will also return 404 for any GET request. By that I mean, the client has in principle to expect that 404 means “I asked the wrong server” or “the server is misconfigured” or “my endpoint is gone”.

hn_throwaway_99 · on June 23, 2022

This is one (of many) reasons why I prefer GraphQL. The error codes are clearly defined and contained in the body of a 200 response.