Hacker News new | past | comments | ask | show | jobs | submit login
Write better error messages (wix-ux.com)
566 points by noch on Oct 19, 2022 | hide | past | favorite | 253 comments



I don't disagree with any points but they missed a big one. If at all possible, include some application (or attempt at a globally) unique error code on each of your errors - i.e. YCOM-HN-9021. When you provide a clearly googleable string you can help your users independently resolve the issue and you can also set up google alerts on the string - if you roll out a new feature that took 3 months to develop and a week later google tells you that YCOM-HN-9021 is up 9000% you probably broke something. If at all possible make yourself open to client communication but most users won't reach out about an error - users have very low trust in customer care in the modern world (and it is, honestly, often more trouble than it's worth) and are more likely to turn to reddit/technical forums for a solution. It is extremely advantageous to try and track these users.


If you do this, please make it alphanumeric, like the above comment's example and make sure your prefix is unique. Modern search engines, especially Google, are very bad at finding literal strings if searched without quotes (many users don't know to do that) and have all sorts of gotchas that make finding errors near impossible.

Some bad examples (both from Microsoft): - "Code -4" (putting a minus in front of a word makes Google exclude that word from results), - "0x00071153" (search engines love to omit the 0x and give you a bunch of phone numbers instead)


This one helped me recently to access my UPS MyChoice account. Got an error message on login attempt, but really didn't give enough of a darn to waste my time calling their support (as was recommended in the message). Reddit was full of reports with a tip that it was related to forced password reset where the old password doesn't meet new complexity requirements, and I was able to download the mobile app and reset my password.

UPS owes that reddit user a beer for helping out at least 30 people (by up votes) at zero cost to themselves.


IBM has done this for decades. For many products, especially on the mainframe, every message has a unique identifier, typically with a prefix for the module or subsystem that generated it, a message number, and an action or severity code, such as “I” for information or “E” for error.

A series of Messages and Codes books is supposed to list every message, each with an explanation and suggested user, administrator, or programmer response.

This was more important when messages were practically limited to a single line on a CRT or teletypewriter, and customers had only the printed or microfiche documentation, not online help or websites, but it’s still valuable today.


Likewise Digital, all VMS error codes are "SUB-S-NAME" (so a permissions error when dealing with a file woule be something like "%RMS-E-NOPERM" (I think that's "rotating mass storage" rather than "Richard M. Stallman"). I think that even harks back to various OSes on the PDP-11, but cannot say for sure.


Im always amused when I see the HN community get excited about something that the greybeards figured out 40 or 50 years ago. Turns out grandma knew what she was doing after all. If only there was a better way for the wisdom of sages to be transmitted to this generation without eyerolling over IBM and DEC.

Now get off my lawn!


Error -41: sit by a lake

https://xkcd.com/1024/


Another one is to generate a globally unique id for that failure. In a web application the user can share the id with you and you can look in the log and see the associated error.


That seems like a good idea, but the problem is that any ID long enough to be sensibly globally unique is likely to be a long string of random-ish characters, and unless the end user has at least a screen cap – highly unlikely – there's about a 0.0001% chance you'll get the code correctly.

On the flip side, if you're willing to give up globally unique for "unique enough within and reasonable time frame" then you can go with just a few characters or even short words.


We did this. Basically showing the semi-unique request-id / correlation-id to the user, or include it in our response headers in our APIs. So when people contacted us with a screenshot or a dump of a request that failed, it was easy to find the exact one in our logs.


Yeah, just like there is a high fraction of people who can't see 3D movies there are a lot of people who can't cut and paste.


Just use both. A globally unique but static "error code" to google plus a serial number of that distinct instance of the failure.


This is really important if you have multilingual software. It lets users find help in all the languages they understand, not just their OS language.


I'll add a few for developer-oriented messages.

* Say what the program was trying to do.

* Make the message unique and searchable.

* Make it detailed.

* FFS, include the filename or whatever else the program is having trouble with.

* If possible, include the source code location.

* If possible, include useful contextual information.

* Quote strings. Once in a while, some unexpected whitespace sneaks in somewhere and this can be hard to figure out.

Eg, don't just abort with "Open failed: NOT_FOUND". Abort with "job.c:2105 Failed to open job description file '/var/spool/jobs/125.json' when processing job #5 for user 'alice': NOT_FOUND".

This way I don't have to strace the damn thing to try and figure out what's it looking for, and know which user it was for, so I don't have to dig around and try and figure out which entry in the database might contain the wrong information.

Also, context-free, generic error messages are awful. A large enough codebase may be impossible to search for some very common keywords.

If possible, googleable error codes are great to have, but they shouldn't replace the error message. It's ideal if you can search the source code and instantly find where the error message originates.


Yup, all of these. Sometimes I look "around" the problem, like, "I found THIS directory but the file 'z.txt' was not in it!" or "Not only could I not find 'z.txt' I could not find THIS directory it was supposed to be in." Check to see that it is really a file, not a directory. "I found 'z.txt' in THIS directory but it was zero bytes in length!"

In terms of "fail early," my larger programs have a section called Pre-Flight Checklist, which looks for files (and that they are files), databases, that the databases have the expected tables and the correct columns, and so on. Are the files sufficiently recent? More or less the expected length? Because this is ETL stuff, it's usually okay to push this stuff up as early as I can.


For Saas products, this plus use structured logging so you don't have to grep-parse log messages when searching your log collectors.

Ie all the meta/log context in a hashmap alongside the error message.


A couple+ years ago my then employer required I take (what amounted to) Security Training 101 for Software Developers. I believe one of the client orgs expected everyone to go through the program.

That said, Ppetty much everything you're suggesting was considered a bad idea (for security). Mainly because the more details you give away, the more a hacker can understand about the underlying system. The more they probe and possibly break things, the more you're showing your cards.

It was then the bland cryptic error msg made perfect sense to me.


Well, everything is depends on context of course.

I'm talking here mostly of user-facing local applications -- like what would be in your mail client's logs, or the logs of a corporate service, where the logs are there for the admin's/dev's use.

Of course if you're sending feedback to a potential attacker things change considerably.


I understand. But I'm going to assume the rule would be. Do X. No exceptions. As you know, doing sec means living with a healthy amount of paranoia. Imagine giving an exception and being wrong.

Sec = better safe than sorry.


I'll also add to make them easy to copy to clipboard in the case of a GUI-based program.

It's easier to search and store in an incident management system.


For developers, maybe. But they need logs, not error messages.

For users they need brief, is it fatal (restart) temporary (try again) or just this part - do something else.

Adding words is unhelpful. More information means less communication.


Yes to all - and also don't include boilerplate text, or at least limit it to the minimum polite for your audience.

So if you must show a stack dump:

- don't put in lots of whitespace - it might look pretty but it makes it harder to read/parse.

- if you're giving the error file:line, don't bother showing the source code. If the source is meaningful to the reader, they've probably got access to the code, or are using an IDE.


One thing I've recently started doing when a file related error happens is to retry the command with strace and see which file the program is trying to access


Also, make sure that sensitive information like user's passwords, emails, credit card numbers etc, is filtered out of the logs and not sent to your servers.


At a previous job, writing unambiguous error messages was discouraged. Everything just had to be "Oops! Something went wrong"

The reasoning was that "users can't do anything with information we tell them anyways", despite the overwhelming number of help desk tickets we'd get from "Oops!" appearing in a million different scenarios with no clear way for us to tell what error actually caused the message to appear.

Users naturally report the messages that they see because they're helping us to see the problem. I didn't get why that was such a hard concept to understand


That seems like peak uselessness. Even "Error code 0x00ad4829" is a more useful message, because even if it's useless to the user it is useful to somebody.


There is some logic, the "you don't want to expose your internals". Really useful messages might contain a lot of details about the tech stack you use (giving a nice hint into which CVEs to try).

That said, this is an easily solved problem. The best solution is to aggressively log errors AND prioritize having dev teams push that error count to 0. If an error happens, it's a bug.

The next way to solve it is simply a report button. Let the users click a "I'm mad at you for not working" button and embed something like a session ID that allows internal queries into what went wrong.

Error codes are a terrible solution, but perhaps an OK option if this is not hosted software. That said, a more user friendly approach would be a QR code with all the relevant details embedded.


> Really useful messages might contain a lot of details about the tech stack you use (giving a nice hint into which CVEs to try).

Nope. Useful messages contain details about what your software does. Anything about your tech stack is redundant and can be removed.

> The best solution is to aggressively log errors AND prioritize having dev teams push that error count to 0.

Many errors can only be replicated talking to users. And on the cases your dev team is not all capable enough to remove all errors, you will still want to provide customer support and work-arounds.

> The next way to solve it is simply a report button.

A report button is good. But neither session ID nor any data that you can reasonably add to your logs will be enough to let dev know what went wrong. Besides, your report button will have errors too.

And anyway, anything that you said applies exclusively to people that create web applications. Many other types of application exist, and everybody writing them are better off not following any of your recommendations.


Why are error codes a terrible solution? I rather have an error "bad request f12793b2" than a "bad request". Obviously I prefer a "bad request, 'expiresAt cannot be after 2022-12-19'. Code f12793b2".

Having a unique ID to be able to search in documentation or even source code is -IMO- preferable. It's still rather technical and helps only those who can search such docs, but at least it gives something unique to google/search for."


> The best solution is to aggressively log errors

Until one day you find some random dev is logging failed authentication attempts and including the email and password in the logs…

(and the most amusing part of that incident was tracing down the offender by finding the earliest of those particular log lines, and getting his real email address password out of them… “Hey Phil, what’s ‘Dragons87!’?” “Ummm, what? That’s, errr, my gsuite password. How did you know?”)


This seems to be the approach that Android takes. If you try to connect to a WiFi network and it fails, it just gives up. It won't tell you why it failed. This makes it very frustrating to figure out what's wrong. Maybe I wouldn't understand the error message, but at least it would provide a starting place for me to look up more information or ask for help from someone knowledgable.


> The reasoning was that "users can't do anything with information we tell them anyways",

I mean, I feel like the focus of the OP was on giving them something they could do something with. Like the information that their information was not lost; and the recommendation to change X or try again in Y way; and the fallthrough to contact customer support with a quick link.

The OP was definitely not recommending giving more specific technical info without thinking about what the user could do with it, but instead specifically thinking about what hte user could do or would want to know (about their data/account, not about your under the hood services), and giving info to that end.


I have only worked at one place that wanted informative error messages.

All the others wanted to hide the reason because "if we know the reason and tell the user, we seem incompetent" or "then hackers will know which API call isn't working right" (apparently the network console in Chrome is beyond hackers) to wanting customers to be dependent as they paid for support.


People who don't know anything about computer security use it as a bludgeon to not do the thing that they didn't want to do anyway.


The epitome of uselessness: making an error message so "user-friendly" that it doesn't help anyone.

At least a "Details" button to unmask the technical details would be useful in some way, while hiding the "ugliness" to the end-user.


The error user see and the one you log shouldn't be the same, you still can log complete information about an error, while the user will see only "Oops, something went wrong"


As long as you are logging the error with the context somewhere that's fine. You could always include a timestamp or request ID with the user message to not give away information, but be able to easily search your logs for the occurrence.


20 years ago I was working on Acrobat at Adobe. I was mostly the "Windows guy" but also worked and tested on the Mac.

When I tried to install Acrobat on my Mac, I got this message:

"Your hard disk is too small"

My what is too small?!

Later, on Windows I got this unexpected popup:

"You are not here"

WTF?

I searched the code for that string and found it in a function named "CantHappen()". This function was called in numerous places where the programmer thought there was no possible way for the code to get to that place. But of course CantHappen() did happen.

As I looked through the code I found many other messages that were bizarre and incomprehensible and sometimes downright offensive.

So I started a project to go through all our messages and make them more clear and informative - and even better, when possible to not have the message at all but just take care of the situation.

The underlying cause of these bad messages was twofold:

1. Programmers never got raises for writing great error messages or finding ways to avoid them in the first place. We were just rated on how much work we got done.

2. We did have a product designer who was supposed to specify all user-facing messages. But the designer mainly considered the "happy path" and didn't think about edge cases. It was left to developers working under time pressure to handle those.


> Later, on Windows I got this unexpected popup: > > "You are not here"

The absolute best I had in a Microsoft product was this (paraphrasing): "An error happened because your computer may be turned off". I still have a screenshot of that somewhere. What it meant was that an hypothetical computer I may be trying to connect to (which I wasn't, it was all local) was off, but that wasn't the case. This was seriously WTF.

The second most beautiful one from another Microsoft product was whatever software generating a password and asking me, in a pop-up window, to write it down. The problem was the password was something like:

    9mZOvy9E(4)?6b(w(<$KcTU%>9T6cz0Z4YxgQ-<tw035X6S.dLE0[2n0"42`/S=S1{q5{)61s190':&6UHT.4hZXjO6b%l#X7v]~4tIT2Y0._ebFH,>2:G>%*P]7n4"
I probably also still have a screenshot of that somewhere.

Haven't used Microsoft stuff in two decades so it was a long time ago. But it's still seriously WTF.


The best error message on Windows is “The operation completed successfully”: https://www.google.com/search?q=the+operation+completed+succ...

…which is the text for result code zero, which is used to mean “no error”.


Long time ago I ran into a "catastrophic failure" in Word, which sounds quite serious.


“Guru Meditation Error”


There's also that hilarious Windows Phone error that prompts users to insert their Windows installation disc and restart their "computer".


The paradox of CantHappen is that if the programmer truly thought it can't happen then there would be no need for it in the first place. The only reason to include it is because of a fear that it may in fact happen.

Rust funny enough has unreachable()! for that case, but it also has unreachable_unchecked() for actually unreachable code. The latter has undefined behavior and exists to help the optimizer.


I’m guilting of writing a “can’t happen” branch (as I’m sure many of us are).

I stripped it out before production after verifying it couldn’t actually happen but it was something like an artefact of my thinking process while writing the code.

It feels like a kind of assertion of underlying assumptions, and I know enough to know I’m fallible.

I’m always careful to make the error message something reasonable if it ever did actually come up in a tech demo or something though. Anything else is tempting the Fates.


I don’t understand why the message would be ‘Can’t happen’ in the first place.

I’d always make it something like “If you see this, something entirely unexpected went wrong.”

Showing and triggering the error is helpful in itself since it’ll generate a trace and all the attendant stuff.


"lp0 on fire" and "PC LOAD LETTER" are two good ones, too.


What does unreachable()! do actually? I had no idea that was a thing.


It terminates the program with panic!

https://doc.rust-lang.org/std/macro.unreachable.html


So is the usual flow to throw these in places where code shouldn't execute but then create tests to try and trip it up to see if that is truly the case? I would hate to be running a release build with this, or does the compiler do something different depending on build type?


> I would hate to be running a release build with this

The usual argument is that the program would be in an invalid state if the condition was reached, so the only option is to crash. If it turns out it's a valid state, then the programmer can treat in a branch. I don't think tests would capture this because they would operating under the same assumption that such states cannot exist. Maybe fuzzy testing could surface an issue like this.


Rust has a few of those, they all panic but with different default messages: panic!(), todo!(), unimplemented!(), and unreachable!()


ESLint complains if no `default` option on `switch` statements. Sometimes it's not possible. I have been trained to add it regardless. While developing, I add some message like "not possible" and sure enough it hits once in a while in dev due to something I didn't consider.


Probably just me, but I am less concerned with how good my error messages are, and more concerned with trying very very hard to make the errors happen closer to the cause of the problem, rather than further away.

"Fail early, fail hard"

i.e. if I can make the error message happen near the beginning of a process, I can get away with making it a hard error.

Hard errors in the middle of a multi-hour operation tend to annoy people.


This is an attitude I really try to build up in junior devs. Soooo many people seem to default to writing code like, "if input is null return null" (when input should never be null) or "if valueThatShodBePositive < 0 silently skip the code that was going to use the value". If the app detects that something is in an invalid state _I want it to break_. The worst problems to debug are the ones where you have to work backwards through miles of strange behavior and corrupted data to find the root cause, because the program tried valiantly to soldier on long after it had been shot through heart with bad data.

I guess this is because no one really teaches error handling. I assume a lot of students end up with a mindset of just make the errors go away instead of, deal with the errors effectively.


Agreed; I've often wondered if this is a result of early CS classes usually expecting students to handle weird/bad inputs. It's only natural for a programmer to want to write a program that gracefully handles all reasonably bad inputs, like nulls. So we're taught early on to write defensive code that handles those. And that's fine when you're writing short, academic programs. But when the complexity goes up by a few orders of magnitude trying to gracefully handle that null value 10 levels deep in some parsing logic maybe isn't the best thing to do. Old habits die hard, however.


Yeah, this is a great point. Both overly defensive programming and (my personal least favorite) overly-commented code are instilled in students at a very early point in their careers by irresponsible teachers trying to find something to grade students on (Didn't handle negative values? 5 points off! Didn't leave a comment on every line? 1 point off per line!)


I think this is a symptom of using weakly typed languages as well. If your argument types are declared to be options/eithers, then you need to handle the empty case, but usually it's easier and better to just move that optional handling further up the callstack or type system.

A lot of `if (input == null)` checks are because you're just not sure whether the argument being passed in will have a value, and it's too much work for your small feature PR to refactor the whole codebase to resolve it.

Use typescript/python-with-mypy/haskell/rust/whatever and this problem mostly disappears.


> A lot of `if (input == null)` checks are because you're just not sure whether the argument being passed in will have a value, and it's too much work for your small feature PR to refactor the whole codebase to resolve it.

Null checks are totally fine, but it should be clear whether or not null is a valid input to the method. If the answer is 'no' then you should throw ArgumentNullException (or whatever's appropriate for the language), not silently ignore the bad input.


When I was a jr dev, getting exceptions was a synonymous of ”me messing something up”. Null exceptions were specially annoying, so the naive approach is to check for nulls and avoid the code that will cause the exception. And it “works”! You don’t get exceptions and your code keeps running. It’s just when you need to fix difficult bugs while you go through logs when you understand the value of having the right exception with the right message. And you learn to love them and start caring about them.


Exactly. Software must crash as soon as possible and include some context information which is necessary to further debug the problem.


After it fails fast (thank you!), we also want to fix fast. So we need info.


With the important corollary that you need to check for the errornous condition both early and late.

Otherwise people start e.g. checking in the frontend and don't enforce it in the backend in the worse case, or TOCTOU bugs in the best case.


That's not really a respectful practice. Error messages should be clear and actionable.

Users don't care if you consider an error soft or hard.


I think the point is that the higher up you fail, the harder it is to identify why you errored in order to give the user clear and actionable feedback.


That's possibly indicating a bad UI / information architecture if you are unable to tell that.


When you have nested exceptions being caught by other exceptions, how do you determine what level is correct to show the user? Especially when it's a service class or something that is used by a lot of calling code.

It's implied that it would be the upper top-most exception handlers in that code path but those are gonna be more generic in their messages, and anything more detailed has to be manually wrapped to add useful description (that's not some internal developer exception).

Error codes may be the least bad solution, to fallback on.


It's hard to give a generic answer to this. I just see way too many bad error messages that could be solved with a little more thought and copywriting skills.

Error messages are part of the user experience and they should not be an afterthought.

If errors are nested, list them all. Give a generic feedback then, and also provide a technical explanation that would help debugging. Most importantly, we should make the user feel safe and in control as much as possible.


I actually do like "collecting" the errors when possible, and having them return in the API response (for example). Instead of the common pattern where there's just singular "error:" in the top-level json.

Works great for things like validation.


Fail Fast means your logging infrastructure is going to report to you more quickly to get the problem fixed.

As opposed to 6 months down the road when someone finally notices an uptick in complaints by customers and now the potential problem sites is literally the entire software stack.

fail fast is how stable software is made, the question is whether or not you think customers appreciate stable software.


I would, if i had any evidence at all that they would be read and acted on. I’m convinced even seemingly competent people are just rendered contextually blind by the appearance of any error at all.

In the past month, i’ve had about a dozen interactions like this:

  developer: your service crashed, here’s a screenshot of the last 5 lines of the crash

  me: do you see where the final text you just pasted is “RuntimeError: Did not find ENVVAR, ensure this is set to the proper value (see <internal wiki link>) and then restart this service”

  developer: yeah?

  me: well, did you do that thing?

  developer: what thing?

  me: <headdesk>
and this at work, where the developer in question is intimately acquainted with the context and purpose of the project.


The goal of writing better error messages isn't to help the people who never read error messages, it's to help the people who do and who you never have to hear from.


The trick that I've found is that each error message needs to be unique... not just the stack trace, but the actual wording of the message leading up to that.

Get a screenshot or the exact verbatim of it, and you can identify exactly where in the code it originated.

User reports are unreliable, but when I can pinpoint where the message originated from, it massively cuts down on the troubleshooting time.


About that, the number of developers that can’t read, or even understand the value of, a stack trace is also astonishing.

If only I had a penny every time someone sent me a “log of the error”, that only contains the final line with the unhelpful message saying nothing but KeyError.


At prior work we removed stack traces from the default error output because it was thought to "scare" too many users.

Then for years almost without fail when an error was pasted into a GH issue it would include the big "If submitting a bug report, please include the full stack trace at /var/log/stacktrace.out" message--without the stacktrace. I added some whitespace around it and all caps to it and still nobody read it.


Forget stack traces.

I've met multiple "web developers" (actually working on the backend or "full-stack", building API servers and whatnot) who came complaining about this or that server being "unreachable" and could I check it's up / whether the firewall allows them through. Only to find they were getting HTTP 404 errors or the like. Which were explicit in the errors they'd show me.


A useful thing here is not just to include a unique error code for the type of error (usually numeric), but also to generate some kind of short Base32 or similar hash and print that right next to the error message while logging it to your normal back end. Then whether people send you a screen shot, copy/paste, whatever, you can easily search the logs to find the exact event that occurred.


Better still: add a unique prefix to the error code, so it's googlable.

The Typescript team does this with compilation errors, like `TS12345: frobulating types cannot be transmuted`.


Yes, that type of thing is pretty useful for linters. These error codes act as identifiers if you need to google them and whenever you need to configure the linter the way you like it or for one-off exceptions.


> each error message needs to be unique

Include random numbers. "Error 7743929" is super easy to track down (grep -r 7743929 takes 2 seconds to type), you don't need a NATO alphabet to understand what they're saying on the phone in order to be able to search it correctly, its general purpose is understood internationally, and it won't change between versions (like when you'd encode a file name and line number, for example). When I first figured this out at, idk, 17 years old and mentioned the idea in a game making forum, people called me crazy, but I still use it and don't know of any better system.

Of course, this is alongside an actual error message to help the user help themselves. This is just to trace the line where it originated, which already helps a lot for small software projects like I make.


In RFC 7807 all errors get an unique URI. Message texts might change or be translated into a language you don’t understand.


It turns out translating error messages is controversial.

Users, upon hitting an error, often go check Stack Overflow. If you localize your error messages, you Balkanize the collective wisdom on how to address the error (which will always be larger than your team's ability to troubleshoot errors and offer correctives in your documentation and FAQs).


To be precise, each error type gets a unique URI.

A good way to take advantage of that is to have a central database of all error types, but not many companies bother to do that.


> have a central database of all error types

do you have any example?


here's ours for pytype (a python type checker): https://google.github.io/pytype/errors.html


I used to lean on line numbers, but those quickly fall out of sync with deployed code and what's currently checked out and available for immediate debugging. I've also switched to using unique text you mention as it will always find the place in the code regardless if it has been moved.

I wish I had learned that earlier than I had.


I am reminded of the classic non-intuitive survivorship bias example from WWII re: armoring bombers: https://en.wikipedia.org/wiki/Survivorship_bias#In_the_milit...


Or, in the anecdote above, to help yourself, when you are inevitably contacted by the person who never reads error messages.


How many interactions didn't you have, because the developer read the error message, read the Wiki, and ultimately solved the issue themselves ?


I have managed to get a lot of notoriety in my company by just:

1. Paying attention to error messages

2. Reading documentation

3. Looking up stuff I don't fully understand(including googling error messages)

That's it.

Some people don't even read error messages at all. I understand non technical people doing that, but I've seen far too many engineers doing it. If anything doesn't go exactly as expected, they freeze. I have no idea how a person gets so far in their careers without reading error messages. Actually, I do, those people ask others to figure out stuff for them. That's way prevalent in enterprise settings. Sure, collaboration is good, but I've seen a lot of instances where there's a massive imbalance – you'll have 10 people pinging a single person to 'unblock' them. They could have spent a couple of minutes trying to figure out yourself.

I'll move mountains to help someone that comes to me after having done some basic homework to try to fix (or at least triage) an issue. It very rare though.

It's also amazing how many people will just go ahead without having read a single line of documentation of the thing they are working on. I've even had a developer dive in a Golang codebase without having _ever_ worked on the language. That would have been fine – that's how I learn new languages, just get accustomed, before doing some more formal training and exercises – except that he continued to not read the language documentation before asking a bunch of questions. Needless to say, the questions weren't good.

And number 3... just rubber ducky everything. If you can't explain it, you don't get it. Go read up on the topic. Sometimes I'll find out that I don't fully understand something as I'm writing an email to others.


> I'll move mountains to help someone that comes to me after having done some basic homework to try to fix (or at least triage) an issue. It very rare though.

These are rare, but they also tend to be the really effective ones. We have a couple of teams who understand the stack, read documentation and read error messages. We generally don't hear of them for months and months, because they are too busy being productive.

But when we hear of them, it's usually time to push boundaries of the infrastructure and the processes. They tried everything and nothing worked and now it's time to make it work.


> I'll move mountains to help someone that comes to me after having done some basic homework to try to fix (or at least triage) an issue. It very rare though.

This. I actually am OK with people not figuring out even basic stuff. But please, at least try to give the impression that you've put some effort in, instead of just trying to have me do your homework while you browse facebook or whatever.


> except that he continued to not read the language documentation before asking a bunch of questions

Can't really blame people for that too much, most language documentation is utterly unreadable unless you already know exactly what you're doing. And even if you do get it, it's in one eye and out the other. Most people just don't learn very well from reading technical information you don't need to use right away. You might be a happy exception and got to build up your notoriety that way


This doesn't hold for any popular language. For those, a bazillion tutorials in various formats, books and example projects have been written.

Language documentation is for looking up nitty-gritty details. You go there if you already know what you're looking for. It works for some languages and for some people, but reading it from top to bottom is usually a horrible way to learn a programming language.


It's error message blindness, similar to ad blindness. Even if you make a great banner ad with some very useful information, or the perfect and affordable product for my life I won't see it because I mentally filter out ads because they are junk most of the time.

Some people develop the same with relation to error messages because most of them are not actionable, other than "stuff broke somehow, [gibberish] blabla". Even if your error message is impeccable, it's in the class of things that are noise.

If you come up to me at some busy tourist location, where I'm used to lots of scammers, I won't listen to you even if you are actually a nice person and just want to have a nice chat and we would be compatible friends.

Often it is a good strategy to just ask people. Documentation and comments get out of date very fast. If you are the kind of person who reads everything meticulously and googles around, reads manuals etc. you may be wasting a lot of time. Of course there is a right balance to find. Some people err too much on the side of not thinking themselves and immediately asking for handholding, but overall it's often the right thing to do.

In many cases I found that trying to reason out what was going on was hopeless, because when I eventually gave up and asked someone, it turned out that the solution was unguessable, something like "ah of course, that things is out of date, do this magic incantation, then this and that, yeah we should update the docs sometime!".

A lot of knowledge is locked up inside people's brains and just spreads around as "rumors" on the grapevine. Is that state of affairs ideal? No. But it's realistic and people are going to adapt by asking first, thinking second.


Asking people is mostly bad habits from a culture too ingrained into the whole 'ask first' thing, and often times it is the people trying to help that are to blame.

I had this recently. Many individuals like to play hero and make sure I don't get stuck because their business is an undocumented mess. Before I even read the thing and tried, they are already trying to give me the answer. When I ask 'is this documented and if so, how would it be discovered easily' their first reaction is 'no' followed by a lengthy explanation which should be in the wiki and easy for newcomers to find.

And it shows when I forget a few days later because my brain never put in the effort to get to the answer and my memory is that of a fruit fly's.


There's also the situation where the program creator likes changing functionality on a whim, and every time you google up your problem, you find a solution for a version of the software that doesn't have the particular menu or whatever that you had the problem with.

(This is a big problem if you've ever had a problem with Android.)


This just means that the error message needs to be more clear. For example, after the error itself, it could give direct advice: “PERFORM THESE STEPS: You must define ENVVAR. Go to <wiki link>. Set ENVVAR to a proper value and restart the service.”

Notice the direct language. It reads like an order. The less direct the message, the higher the chances that the user will not act upon it.


>it could give direct advice: “PERFORM THESE STEPS: You must define ENVVAR. Go to <wiki link>. Set ENVVAR to a proper value and restart the service.”

Really, should logs also be documentation now ? Just mindlessly logging the same "advice" over and over again each time the error happen ?


Logs can definitely be a form of documentation.

I write software that is generally run low in the stack, quietly doing some mundane tasks that are business-critical but rarely thought about. If one of our clients has to mess with our software beyond the occasional update, that was a failing. Not all software is like this, but lots of it is -- its value is that no human needs to be involved.

I need to write log messages with the expectation of an audience who doesn't know much about the software -- it's been running uninterrupted for months or years and suddenly something has gone wrong. If the log line doesn't tell the user how to solve their problem, I will end up getting a call.


If it is that simple, the why doesn't the code fix it itself? But no, usually there is 1/2/3 likely things, but it also could be anything else.. and that kind if unexpected errors even often have no default-fix.

No, the most best thing is to point to the documentation which has that, and not printig out manpages of docs in error messages now.

> I write software that is generally run low in the stack

What stack, how low? Me too.. that low that I usually cannot return or even log a " see error code doc at http.." string for various reasons (bandwidth, mem, performance) but only have error codes ;)


In the case at hand, where an environment variable isn't set, how exactly should the code fix itself? Human interaction is necessary, which is the reason the log message should spell out what the human needs to do.

If I'm starting a service and see a pointer in the logs to documentation, that seems like an incredibly broken approach to me. Why would I look at missing or out-of-date documentation that may or may not be at hand when the code that knows the problem is right there and can just tell me? A log message like you're describing might as well say, "Something went wrong, but I don't want to tell you what. Instead check page 43 of the document in the third file cabinet from the left in that room over there on your right. No, your other right."


Similar issues arises with such documentation in error messages. There now has to be a process to make sure that all such information is always accounted for and updated correspondingly when the system changes.

> Something went wrong, but I don't want to tell you what.

is somewhat disingenius of an example. Error logs should tell in exhausting detail what went wrong. Ops needs that to analyse the situation, and the vendor will have much less trouble reproducing the error. However, suggesting specific fixes could be disastrous. Furthermore, documentation should already be in a form that operations can be expected to work with also in crisis situations.


I don't want to have to hunt for documentation if it breaks. It may have been 30 years and everything but the binary has been lost, and the vendor is out of business. If in that situation all I get is an error code and a link to documentation that doesn't exist, I'd have to start reverse-engineering. And while doing so I'd definitely be cursing the coder who decided that saving a couple hundred bytes of space in a log file in the event of an "abort the program"-severity event was worth dumping this in my lap.


Running such software is asking for a disaster already. At least documentation should still exist, and operational frameworks like ITIL insist on that. It can happen, but is usually telling of an operational culture that disregards maintenance, counting on being able to kick the can down the road as long as possible.


It will be so much fun when the implementation is refactored and half of these comments are forgotten about and no longer meaningful.


Exactly. At one of my previous workplaces there was a cumulative effect of misattributed error messages so the actions to perform were often of no help.

Not even to mention the fact that new or changed error messages caused a landslide in costs in translations to various languages. I guess this product has no localization? At that time, when I was working at such a product that had it, we had to go through a deliberate process to describe why we want to change it, what the impact is, etc. Tell me you want 100 new messages and you will be stuck in meetings for the next month.

In their case, though, it seems they at least have the support in management for it. I hope it turns out better for them than it did for me.


I had an error message a few months ago that instructed me to reinstall the AWS CLI, I filed a ticket when that didn't work, and the team was annoyed with me because obviously the real problem was a Python configuration warning with no suggested action 10 lines up.


It depends who, what, and when the error is about. Failures are generally a bathtub curve. You have a high rate at start (usually configuration issues), some fairly fixed rate during operation, and then more at end of lifecycle (exhaustion, service hiccups on scale-in).

If it's in the early lifecycle, absolutely, because it's most actionable. X is set wrong, Y can't be reached, etc, guide whoever is operating the system how to fix it.

If it's mid cycle, it's often post-hoc, but context is worth its weight in gold. Less about telling the operator how to fix and more about why it broke, to avoid in the future.

End of cycle, whatever.


Yes!

There are people who don't read formal documentation but do read logs, after all.

If the advice is the same over and over again, then yes, give the advice over and over again. I wouldn't want to assume that someone has read every line of the logs, or has started to read top-to-bottom, so the advice should always be among the most recent lines in the log, and the only way to ensure that is to give the advice again each time the error happens.


Yes! We have tools to filter what gets saved and compression that handles repeated text very well.

So why not provide docs on how to solve the error along with the error.


Logs actually are a form of documentation. Documentation can provide instructions on how to diagnose and fix problems, and that's what logs do: tell a human being what a problem is and how to fix it.

Remember that often the person reading the logs is not the person who wrote the software. Maybe it's an Ops person at 2AM trying to fix a broken deploy. Maybe it's a developer who joined the company 3 years after the software was written. Maybe the log is passing through an error message from 3 layers deep in the stack. The more literate your logs are, the better.


Errors on initialization, fatal errors, and non-recurrent errors that require human/support intervention should be documentation.


If the error results in the program shutting down, it’s once per fatal interaction.

In other words, yes.


Should logs more clearly let the user know how to fix problems? Yes.


This is fairly common in good error logs.


I think you're correct. To add to this (and I think it's the point that the article was trying to make), errors written in fragmented language or "developer speak" I feel are likely to get glossed over. The “Write it like you’re talking to a friend.” advice the article gives I think is spot on. Making the message more conversational is to invite better understanding and comprehension.

I feel there's a trend when it comes to disseminating messaging like this that we adopt an attitude of our audience "is smart, and should figure the rest out". They may be. But they already have lots to do any plenty to figure out. Any opportunity we, the requestor, can lighten their mental load, is going to increase the odds that they'll be inclined to take action right away.


I’m not seeing how what the message already is any less direct or clear than what you’re saying it should be? It straight up tells you it can’t find the var and what to do about it.

Can you help me understand what isn’t clear about the message as is, or maybe point out the ambiguity to someone who just isn’t seeing it? I want to write better error messages but I share the frustration of the above poster. The message tells you specifically what to do, but you’re coming back saying it’s not clear.


I think the original error is quite clear, under normal circumstances.

Not OP but I've noticed that people often get brain fog when something goes wrong and are often need BIG, SHORT, WORDS to shake out of it. Or really anything that can shake them out of the 'idunno' state of mind.

But maybe if something like that became standard ut would no longet be a context switcher..


I think you're spot on, and I made a similar comment above.

It's easy to say "they can figure it out". Sure, in a restful state. But the people we're asking to take action already have a lot on their plate. Using plain, conversational language whenever possible with exceedingly clear steps means less mental exertion on the receiver. And since we need their help, anything we can do to make it easier on their end helps us.


These are fascinating responses to me, as with the example given my mind first went to someone for whom English is a second language. that group having trouble with this message I would understand, or at least have an easier time understanding having trouble, if even a very little amount.

For someone who was born speaking English and spoke it their entire lives, the example provided couldn’t possibly be more to the point in my opinion.

Though I agree overall with the general idea and that yes there are some pretty baffling and downright awfully written error messages and log entries that take a minute to grok (I just don’t think the example replied to is one of them).


Conversational errors can also be fatiguing. Often what you want is something short and dry that can be pattern matched. Compilers are pretty good at this because all their errors start the same way.

    Error in file foo/bar.c, line 32, missing semicolon. 
No conversation needed. These can then be complemented with more conversational language on the next line to explain why semicolon is needed. Rust is quite good at this.


Then there's the delightful (no, I actually mean the opposite) errors that g++ emitted (back when I last wrote C++ and compiled using g++), where I basically could go "OK, there is an error that was detected at line L, in file F; and I think it may be a type error", so a recompile with clang, so I can actually understand what the error was, so I could fix it.


Some people don't read anything that isn't an all-caps command. They have learned helplessness from seeing too much useless error text in the past.


There's a type of error for which the user can be given detailed step-by-step instructions (permission issues, etc). But to some extent, errors should handle situations the programmer didn't expect. If it is possible to provide detailed step-by-step fixes, then the program should do those steps itself.

Adding a URL might not be a great plan, never know how long an old copy of a program will stick around, might not control that website forever.


I can't tell if this is sarcasm or not, this is obviously highlighting a deeper issue in developer culture.

The example given was clear compared to 90% of other error messages, and saying that it needs to be "more clear" is almost dismissive


Don't blame developer culture, if that error cannot be acted on, attribute to incompetence and not culture.


Some of the errors that Gentoo portage can encounter do exactly this - and they do it with beautiful terminal colors that make it easy to figure out what you need to run, or where to go to figure out which of the three options you need.

The problem can come when there's a wall of "useless" logging/error messages, and the last one or near the last one is the actual important one to look at. You have to explicitly call it out on a clear screen and make it obvious - and even then, people won't always read it.


It more likely means that the developer views the service as OP's responsibility. They'll view an order as something OP needs to do.

The clarity of the error message doesnt really matter if the recipient believes it is intended for somebody else.


The problem is people are not rational… and we try to solve that with software.

Many people just lock up when software doesn’t do what they expect.


Not rational people must be fired from IT.


Generally a pipe dream in my experience.


Lots of people find ways to irrationalize being rational.


This is a context where people are used to seeing errors that they don't know what to do with.

If a web app pops a well written error it is much more likely to be acted on than an unmotivated dev seeing a some (probably badly formatted) text.

Every time I see an error in terminal with a link to documentation I'm delighted. And surprised.


Once upon a time, I worked at a financial startup (the company is irrelevant). I created a little harness around a static analysis tool. It would fail builds when a library had an outstanding vulnerability scored as HIGH or SEVERE with a patch available. The harness put a friendly error message around it. It ran roughly as follows:

> Hi! If you're reading this message, it's likely because this tool failed your build. To understand why and fix it, please click this link <link_to_internal_doc>. Below is a table that lists the packages you need to update and the version you need to update them to.

The doc had at the very top in big flashing red text with siren anigifs a link to the portion that explained that they needed to update their libraries with very clear copy-paste-into-Dockerfile actionable guidance. The page also explained the broader context, such as the point of the tool and why we were doing this despite having a firewall and so on.

This is where you might be delighted and surprised.

What was perhaps less delightful and surprising were the consequences for me. About 4-6 times a week, I would then have a Slack conversation akin to this:

    Dev: Why did you break my build!?!

    Me: Can I see the error message?

    Dev: <pastes message above>

    Me: Thanks! Looking at the message, is there something unclear about the documentation? Does it not work?

    <ten minutes pass>

    Dev: Nope! Docs are great!
At this point the conversation would end.


So? That's no excuse for a developer to disregard the content of an error message in their own application.


It kinda is. Kinda like when documentation is so repeatedly outdated and incorrect, that when you need new information you just skip documentation entirely.

Are you wrong for skipping documentation? Yea, maybe. Is it entirely expected? Yea.

Based on the parent comment, at least.


And yet developers do disregard the content of error messages. Try to figure out why they disregard it. I doubt the answer is "because they're stupid". The answer probably also isn't "because they just aren't trying".

What could it be? Why do people read things and react in similar ways, even if they have different jobs? If only there was some field of study that could answer these mysteries.


This is a lesson I learned while being system owner of the primary user interface that runs on a semiconductor factory floor. No amount of confirmation/warning dialogs will actually stop someone from doing a wrong thing. Doesn't matter how scary the language is. Here's an approximate sample of one:

  "DANGER! Confirming this action may result in 8 figures worth of scrap!!!"
Even if you are super careful and make sure your error messages are terse in all cases, you will still succumb to things like muscle memory among your users. I've caught myself mindlessly dismissing these while testing. How can I expect my users to be better than the person who developed the UI? That is unreasonable.

It got to a point where we started removing these alerts/confirmations because it was training people to do the wrong thing in a few places. If you have part of a UI where all actions are immediate and final, the game theory changes. The moment a user enters into one of these spaces, they are much more cautious.

If the user thinks the UI will save them, they may eventually tire of these protections and forget why they are there in the first place. I feel like this is very similar to the problem of driver assistance and partial self-driving capabilities today.


I like how GitHub asks you to type back the name of the repository you want to delete.


For GitHub scary actions, I will not hesitate to copy and paste the expected repo name on the UI. I can do this so quickly my brain does not process the consequences in time.


Some developers are just lazy, and will likely need some kind of negative feedback to force them to confront their own laziness.

Which can be tricky, because the degree of negative feedback that is appropriate to the person in question can range from

"Polite one-on-one suggestion that you read the error message more than once before calling me"

to

"Full on yelling at the person in the middle of an open-plan office".

Thankfully, type II is rare, but they do occur.


> Some developers are just lazy

I'm really lazy: if I were on the receiving end of emails with error messages that included instructions about how to fix said error I'd automate Freshdesk (or whatever ticketing system I was using) to respond with instructions specific to that error message, in the first instance, along with a note to get in touch again if that didn't solve the problem. I'd also set the ticket to autoresolve after a set period of time.


Send a link to wiki. Last line of page is "if you have questions, reach out and include the keyword $THIS_PAGE_KEY in your message."


I've seen this constantly over the years, people who absolutely refuse to read the simplest instructions, but instead require step-by-step hand-holding from you personally.

I have no idea how these people get through life at all.


I have no idea either.

For example when someone asks me "how do I get a German work visa" and I reply with a link to a page titled "how to get a German work visa", which is the among the first results on Google. A literal minute later, they ask me more questions that the page clearly answers.

Some people can't be arsed to read a 5 minute article you hand-delivered to them, and would rather have you type it back to them.

I think that some people just have zero respect for other people's time.


I suspect that, at least subconsciously, they're to some extent doing that to punish you for writing 'bad' software that they have to struggle with. If they're going to suffer, you're going to suffer right along side them.


Hey, let's jump on a quick call, so we can go through this together and maybe update docs if they're out of date?


> evidence at all that they would be read

I just had an idea: Put tracking info in the error URL. If your company has an internal URL shortener, that could do the trick.

More practically, I feel like it helps to put an empty line before the call to action. For many people, a traceback is just noise. The empty line helps split the useful info out from the traceback.

Or if it's a script/CLI (and you know the error reason) don't even show a traceback. Just print the error message to stderr, exit non-zero, and be done with it.


The help desk guys are on the other side of a cubicle wall from my workstation, and almost every call I overhear about someone getting errors just convinces me further and further that people don't only not pay attention to the error message, they don't pay attention to the people they're calling to help them get through the situation either.


Actually reading (and understanding, acting upon) error messages seems to be part of the learning process of every developer. And while more senior devs usually do read error messages, even they sometimes, rather than reading it will jump to behavior like "trying again a different way", before looking closely what went wrong.


Developers often seemed shocked that people can’t find the important error in a wall of text. A particular peeve is when the same error is reported three ways and the real error is sandwiched between others or scrolled off the screen due to spammy behavior.


>>>“RuntimeError: Did not find ENVVAR, ensure this is set to the proper value (see <internal wiki link>) and then restart this service”

I'm laughing as you could not make it clearer if you tried. PEBKAC


The problem is here: "RuntimeError:". Once they saw that, they stopped reading. "Did not find ENVVAR" [..] "ensure this is set to the proper value" [..] "and then restart the service" are also obscure and will stop them from reading.

Why is the user like this? Error message PTSD. Years of staring at obscure errors full of technical jargon that are not helpful to the user, has left them scared to even look at the content of the error message. They have tried to Google these things before and failed, and now they just avoid it entirely and run for help.

I'm sure there's enough detail in the link you provided to help the user. But if that's the case, it will be better for the error message to simply say:

  A problem occurred, but don't worry! You can fix it yourself in 5 minutes! For instructions, visit https://internal-wiki-link/spaces/BLAH/AppUserRuntimeError#A013579
Even if you expect the user to be "smart enough" to fix their own problem, they are more likely to try it themselves if you make it seem easier.


I tried exactly this approach! What I got was a bunch of developers copy-pasting the error message with helpful URL at me and demanding to know what they should do. The number who followed the link and fixed the problem themselves was shockingly small.

Going out on a limb, I think we're all going astray by trying to parse the error messages our fellow developers are reacting to. A great many seem to handle any unfamiliar or unexpected error message by giving up, no matter how friendly or informative or helpful it may be.


They don't parse the error message as a natural language sentence talking to them. They take it as an opaque string, like a big error code. It literally passes through them without getting interpreted.

They learned that the affordances of these error messages are copy pasting into some place: a google search box, or a chat box asking for help. But it has no affordance of "interpret as an English sentence" for them.


If that's the case, then these people may just need training. It's likely that nobody has ever sat them down and explained that they have a responsibility to investigate their own issue. Often people feel they have to rush to get something done, and that they can't take time to troubleshoot. But if their bosses explained that, actually, it's fine if your work is a little late due to troubleshooting, they might do it themselves more often. You also may need to provide back-pressure by interacting via email/ticket.


That's a kind, caring, compassionate, empathetic approach founded on assuming good faith.

Unfortunately, it is perhaps not an ideal fit. I was mostly not dealing with the most junior and new of developers here. I was often dealing with senior developers who fully understood that they were responsible for investigating their own issues in a context where it was understood that troubleshooting takes time.

I often wound up regurgitating the error message back to them, asking them to point to the problems in the documentation getting in the way of them solving their own problems. This generally resulted in a conspicuous silence and the issues shortly thereafter being resolved.

The lesson I drew from this was not that the developers in question needed training. What I learned was that they needed to be convinced to treat these errors as natural-language strings they could interpret themselves.


Isn't this just survivor bias though? You only hear from those that fail to read and act on the error message.


Use the error messages you wrote! Send them the link they sent you, and move on.


Well, imagine the error was simply "RuntimeError: Environment variable not set" instead, then how much of your time would have been wasted by those dozen interactions?


Shouldn't the app gracefully exit with a clear message, and not bail out in a way that looks like a crash? I'd guess that the person who wrote it hooked into the error handler because that was the easy thing to do rather than bother to write a nice way to exit properly.

The fact that you've had this a dozen times points to a problem with the app more than the people using it to me.


This is 100% correct.

In theory, all errors should: explain the input, explain the problem and explain how to solve the problem (actions). And that should help and reduce number of support calls. However, error messages and actions how to solve the error are read by maybe 1% of users.

The only way to improve your UI is to prevent errors and use standards / familiar design.


I mean it kinda makes sense. When you're coding, you're constructing something. When you're debugging, you're deconstructing something. I feel like it's natural for people to take a sec to codeswitch, bc they were likely in a state of flow w/ considerable momentum up until they saw the error


It's a little annoying but to be fair because most error messaging is garbage, its easy to start to ignore them. How often is the error message shown, and the little fix given, actually going to solve the problem in modern web development? 10% of the time? 25% of the time? I'd be shocked if its that high.


Still, if you write proper error messages then at least you can figure out what the issue was without SSHing into the person’s computer and checking their logs.


Don't send people somewhere else to learn how to fix the error. The more steps and indirection you add the fewer people will bother doing it themselves, especially if they can bump it to the developer. Make it easy for people to fix their own problems by being explicit, direct and complete. List all the steps and use formatting to make it visually easier to consume.

So your error message while a far cry from the worst I have seen is also pretty far from the good ones I have seen.


I think his point was the developer tends not to even investigate the ENVVAR at issue or visit the link. If the developer does investigate the link and still has an issue, than you have a point.


Pretty sure his problem was he got contacted about an issue he considers uninteresting, and his preferred solution is the user stops behaving like a human.

Reaching for the easiest way to solve a problem first is a very human thing to do, and in this case he was easier to contact than opening up a browser and reading an article that presumably is written in the same kind of language as the error message.


I admit to doing this. Even many of the useful error messages that clearly indicate the fix are drowned out out by the mass of output. I've made this mistake before, and I'll probably do so again.


I feel like this is a problem of overly chatty application logs + lack of formatting for errors.

If the volume of drivel was lowered and errors were formatted with spacing and color to stand out, then they would be easier to focus on.

So log errors to stderr, send it to a separate log file, and format it well (use multiple lines).


> So log errors to stderr, send it to a separate log file, and format it well (use multiple lines).

Oh, for sure. Do never:

- send errors to the same log you send normal activity.

- default into logging things that aren't errors on the error log (make this possible to override if you want, but never the default).

- log the errors there, but the necessary context on stdout so it appears correct on a terminal. (E.g. build tools that print entering into target in stdout; error in stderr; leaving target in stdout)

- try to recover just to show a different error later.


Oh, boy, on the "never log expected failures as errors" front, I once worked with a database system that used opportunistic transactions. Basically, each modification to a row carried effectively the original value of that trow with the update and if it failed, the API call triggered an error saying that the transaction failed. So if you did a "SET column=(column+1) WHERE rowid=unique", the client could basically do an automatic retry.

But, it also logged each and every occurrence of this at "Error" severity, instead of at "Info" severity (it is, after all, expected to happen once in a while).

And of course, once our code switched over to using this, the first few times every team member had to deal with a production issue, the immediate reaction was "oh, no, the data store is unhealthy! look at this mass of error logs, I can see one every few minutes!". Thankfully, after the first team member (me, as it happens) spent half an hour reading the relevant parts of the design and implementation docs, we could frequently short-cut a lengthy investigation by "oh, you think $DB is bad because you are seeing transaction failures? no, that's expected, see $URL".


I’m left wonder at what point does the “give a man a fish/teach a man how to fish” method of pedagogy apply in terms of ‘acting like a human’ in this context?

Asking as someone who otherwise generally agrees that there are some truly poorly written errors and exceptions out there, but has also been on the admittedly frustrating end of the constant requests for help deciphering error messages that were very plainly stating what the problem is for someone who didn’t even try looking for the fishing rod.


Sure, clearly there are people who will never try, or learn, but in general as an industry I feel like the wast majority of errors are very very far from good.

Few error messages are written well, has good formatting and are self contained (can be used to fix the issue without having to seek further information elsewhere). Sometimes you see errors that contain one of those elements, but rarely all of them.

There has been an effort the last few years improving compiler errors for some languages, but those same improvements have not reached applications.


I feel like I can't get folks to open the log file and cmd-F "ERROR" half the time.


A big part of this is to direct more of your development time into errors that happen more frequently.

Most systems I was involved in designing have some kind of error tracking system, so we can know exactly how often each error occurs.

An error that never happened needs (usually) no attention.

An error that 28% of installations have seen needs a lot of attention. The error text should be translated into local languages, wiki pages should be written about how to resolve it, efforts should be made to auto-resolve the error. The error message should include helpful info, etc.

Eg. "SSH server can't start. Config file unreadable".

Could be split into:

SSH server can't start. Config file error on line 7. 'AllowPasswordLoogin' is an invalid setting. Did you mean 'AllowPasswordLogin'? If you want to make this change, 'sudo nano /etc/sshserver.conf' will let you change this config.


If you're raising an exception deep in some internal code, provide as much detail as possible.

If the error bubbles up to the user, then either the information is over their head, in which case there's no difference to a non-detailed error message, or the user/support person can actually act on it.

The most infuriating error I see is "file not found"... WHICH FILE?!

Of course if the error is found in the higher level due to some consistency check in the business logic, then yeah try to guide the user. But for internal stuff, try to help the person who needs to fix it or find a workaround. It might be you.


> The most infuriating error I see is "file not found"... WHICH FILE?!

Filenames might contain user data, which must not be logged outside of a database with proper access control, schema annotations, and acccess auditing.

We can only display an opaque object key, so authorized devs can look up the filename using secure tools.


Fair enough. I work mostly with good old desktop applications though, so if there's user data, it's almost always the users data.

For the majority of errors in most applications one can provide some helpful information. But yeah, one need to be a bit careful if one has PII in the mix.


> If you're raising an exception deep in some internal code, provide as much detail as possible.

> If the error bubbles up to the user,

...then you have an information disclosure vulnerability! There's a really good reason why we don't bubble up deep exceptions to end users: Attackers can use that info to gain information about your back end that they can use to find worse vulnerabilities.

Put all the detail you want in your logs. Keep the end users out of it. They shouldn't be able to tell what line broke things.


Yeah things are a bit different with web apps. There users usually can't do anything with the info even if they had details, so internal logs is clearly the place. But my point still stands: you want detailed info in those logs, not just a lone "file not found" without anything else.


This reminds me of the two most annoying error messages of all time [for me].

The first one is from PayPal. Whenever I try to add a US bank account to my PayPal account, it says something like "You cannot add this bank account at this time, period"

After more than a year, it turned out that there was no way to add such an account for a foreigner, despite my friends [from the same country] being able to do it easily a couple of months before.

The second one is, poor me again, trying to edit a Facebook page URL I created for a side project, that should read FB.com/[SIDE_PROJECT], where FB keeps rejecting my request with a generic/ unexplained/unhelpful error message despite the page URL name was available.

About a year later, I got it working by, SIMPLY, having my phone number verified! How bad!!


There are fundamentally two classes of error message:

1. Information that can help a technically engaged person debug a problem.

2. Information that can help a user of the system understand what they have to do the overcome the problem.

Since most error messages are created by people responsible for debugging the system they tend to be of the 1st class. There has to be a way to provide different information based on who is getting the error.


> There has to be a way to provide different information based on who is getting the error.

Yes, this concept exists. The error message that is shown to the user (number 2) is what's discussed in the article. The error message that an engineer or someone else debugging the system should get (number 1) is the full stack trace and data dump that should be sent to the application log at the same time that the user is shown the error dialog.

Users can fix the problem by following the instructions in the error dialog and engineers or technical people can come back later and look at the more detailed stack trace to determine the best course of action.


> There has to be a way to provide different information based on who is getting the error.

This is already solved. Provide one error to the user and another to your logging system. In the user error provide a mechanism to point you to the logged error (even a simple timestamp helps).


There's a fatal flaw in assuming that there's no overlap between groups 1 and 2.


There's also a third class which is “Oops! Something went wrong…” which basically means "i don't know. Try and reload the page." Why this is better then a simple "error" is beyond me, but its mildly fustrating.


The error message that is presented to the user should always be clear and helpful. When an error is presented to the user, you should have matching logging (e.g. sentry) that provides technical reporting on what happened. By having both solutions in place you have error handling that is complete and services both communities.


It's easy. Just provide both, with mark-up to label them.


Watched the new Quantum Leap yesterday (it's not great) and there was this really cringeworthy moment when something goes wrong with their awesome supercomputer and the screen flashes a giant "INTERNAL SYNTAX ERROR". Apparently, somebody didn't run their linter before sending people through time. Too bad.


As with everything, context matters. It's a great run-down of how to empower an error message. Many products can add so much value and saved support resources by doing so.

There's one thing I wasn't sure about in this article though. Did they talk to actual users regarding these empowered error messages or even asked them what they want to see out of common error messages they run into? It seems rather difficult to empower error messages without first understanding the scenarios that got them into the error state to begin with. Next would be understanding if these error messages are helpful to the users and asking them how they go about resolving these types of issues. All of that is hinted at in the "what makes a good error message".


The general approach that I take, is that an error message is one of the most stressful occurrences that a user encounters, so it's incumbent upon me to make it as pain-free as possible.

First of all, unless I'm writing an engineering tool, my users aren't geeks, and don't especially care why the error is happening (geeks always need to know why). They just need to know that what was expected, did not happen. If there is a remedy, and it can be simply stated, then I can add that, but it needs to be short and simple. Longer stuff needs to go into some kind of secondary screen (which probably won't be read).

Also, I take the "shopkeeper" approach. The customer is always right, and it's never the customer's fault. I avoid any hints of blaming the user (even if it is their fault), and try to be polite and helpful[0].

Of course, the best way to deal with errors, is to avoid them. I try to design good affordances.

The rules are different for SDKs, though. In that case, I tend to send a great deal of information back. I take advantage of Swift's enums, and the ability to associate data. It can allow me to nest error reports.

[0] https://littlegreenviper.com/miscellany/the-road-most-travel...


Over time I've come to believe in the "grepability" of error messages, and the code-lines that construct them.

Sometimes the data (and error-messages) are flowing up and down through many different modules and APIs and job-queues and whatnot, that when an error pops up it saves a lot of developer-time when you can just text-search on the code repo(s) and see exactly the line that generated it in the first place.


,,Try again'' button is the worst way to solve the problem of having no connection. GMail does it right by trying again automatically periodically while having an error bar on the top of the screen, at the same time not stopping the user from using the application.

If Wix can save the data locally, why not just copy the GMail error interface and let the user decide when to connect to internet?


All the 'do this' versions suffer from the same problems as the 'don't do this' versions. Aside from fixing the tone, they are still generic, still inactionable, and still verbose.


It is my opinion that software problems tend be analyzed corresponding to these four axes:

- Can an end-user solve the problem themselves? If so, tell them how, if not, display a generic error message telling them to ask for support (with an error identifier they can tell the support)

- Developers and end-users need different information: developers need as much information as possible, like file names, contents of important variables and especially where the error happened in the source code with a backtrace, sometimes even two backtraces: the backtrace for the cause of the error, too; and end-users only need to be told what they can do, but this needs to be worded clearly and carefully. This means that error messages need to be written twice.

- Is the problem serious? If so, report, crash and restart, if not, just report and abort the affected operation when neccessary.

- The problem should be logged. Sometimes it can be sent to developers automatically.


My recent experience with docker, I am a total newb so I was running a tutorial step by step, then I get some error about apt certificates/keys/repo stuff. After lot of googling the issue was there was not enough disk space but the fucking error was pointing in a different direction. Also this is a good example why Stack Overflow is usefull for the dudes that hate on it and RTFM everyone else.

This is why I love exceptions, I had an issue with a C# game, but with a stack trace I could figure out myself that the issue is happening when the app initialize and fails to open a file.

I think twe should always give the users a detailed log and stack traces, also docker should fucking have some way to catch the issue when there is not enough space and report the error properly.


I really like this. There are clear shibboleths which identify the author as a person who deeply respects and cares for the readers of error messages, and their experiences. It makes me hopeful for the future of software when I see that there are others. Thanks for sharing.


Tried to download my data from takeout.google.com and got this error:

"500. It's an error."

Thanks, google. I tried to start a chat (I'm a Workspace customer) and could not continue because all the language choices were disabled (even English).


You shall be happy that you got an error message from Google because their default is not to give any.


This is great, I would add one critical ingredient: provide actual customer care.

Meaning, the "way out" is to point users to customer care, but this still does not help if customer care is shit. And we know it often is.

Customer care should be an email address (and/or phone number) in the footer. Not a contact form. Self-help/FAQ is fine, but no replacement for direct contact. Nor is a shitty AI bot.

And when contacting support directly, answers should not be scripted non-sense completely ignoring the actual issue at hand.

I don't care if it doesn't scale. Make it scale. Your problem.


3 things I'd add (and have used with success):

1. Always have an error specific URL to point at. Changing a document stored outside the system is often significantly easier than redeploying a system (order of magnitude seconds or minutes vs. hours or days in the worst case). There are many benefits to this approach. It's available when your system is not. It's possible to look at metrics and collect NPS scores on the information. It's easy to add pictures, steps, links etc.

2. Try to add an operation specific correlation ID. This allows the user to talk about a specific instance of an error easily when dealing with support and developers to look for specific log info. This is also useful if you provide a 'get support' link on errors that require manual intervention.

3. Add an error specific identifier to help developers map error strings back to source code. Often with error messages that are string interpolated the unique values tend to obscure the non-unique parts of the message. Also messages that are fairly similar can make it more difficult for a developer to find the specific cause.

These are not alternatives, but additions to TFA's suggestions.


I believe that any language that treats errors and error management as an afterthought are bad. Also any programmer that treats errors as an afterthought or simply ignore them is going to write bad code/programs. Errors are hard and need language first level support. People talks about “higher order functions” but never how to deal with errors (mainly because it’s boring and complicated). Also errors are tightly coupled with intentions, as if you fail to do something, well that’s an error. But that also means that it’s tightly coupled with what the program is trying to achieve. So anywhere an error happens should be close to what it tries to do. Also it solves what an error is all about, which makes it easy to describe what it should be. Yes there are errors that may not fall into this category as they are much less related to what you are trying to do functionally. Any program which ignores how errors work and flow, in my experience, has always been bad in general, as the structure of it is also bad as there’s no organization.


For me error messages come in two forms.

1. For the user.

You can't do that (maybe explain why). Don't do that.

2. Error that's actually there for the support or engineering team for a customer to convey to support, probably with a handy copy to clipboard link (that the user has at best a 50/50 chance of using no matter how much prodding).

That's it.

Humans generally lock up hard when they see an error in my experience. No amount of information or hand holding will help most of them figure it out. It's better to try to solve it in software.

If the software can't fix the issue internally then they get an error message and 2 things happen:

1. The user is going to try something else and solve it themself (awesome) regardless of the error because they're smart and capable people and could probably solve it no matter what you told them.

2. Their brain locks up, they do the same thing 20 times and get the same result and complain to support with some form of "doesn't work". Doesn't matter what error you give them, they won't even try to tell you what the error was / doesn't register in their brain unless it had a cute cat on it or something (that actually works... so forget this "tone" stuff).

I like the article, but I am skeptical about a UX team who doesn't answer support tickets ... just magically knows what the user is thinking / will work. I get lots of advice on error messages, I change them when they ask, but when it's from folks inside the company who know the product it often isn't helpful.

Heck even users give bad advice about errors. I've had them tell me "Well it should have said X" where X is exactly word for word what it said (they forgot...).

Granted I still try to help the user along, but I'm skeptical that software with any large user base can have "good" error messages.


Really, error handling has been my big beef with CS education for like 40 years. There is none.

Error handling has been left to engineers, and when left to they own devices engineers will almost always make the wrong choice from a user point of view.

Engineering need to think of error messages this way: the error message is there to help people (which might be fellow engineers, support, and/or and your consultants) identify the error quickly so that they can manage the user's expectations, fix the error, and/or both.

Unfortunately, many engineering paradigms make this an impossible task.

Layering and encapsulation means that you have little idea what's happening downstream or how the downstream stuff actually works, but the lower-level you are the less likely the error will mean anything to the end-user.

Then, it's a question of who's responsible for handling the error? If you're on the backend, where does it go? Does the user care that the backend microservice can't connect to the database? Heck, the UI probably has no idea what's happening back there.

However, for accurate troubleshooting detail is needed.

For many orgs, leaving transaction IDs in your log files is the primary way that you figure out errors, especially in big distributed systems. That doesn't really help end-users, and requires developer discipline, something many engineering teams find challenging.

Ideally error objects would aggregate error codes up the stack, so that if an error occurs you can at least present technical people with the errors that were thrown..and they can search through the source code trying to find that unique error code. But designing that is difficult; conceptually you don't want a list of 500 error codes being thrown upwards, one from each function in the call chain. But sometimes you do.

Anyway, error handling design really should be part of the initial architecture, but it usually isn't because architecture guys don't really understand support.


I've been guilty of this in the past - I remember writing an error message that looked like "if you used X setting, do this, otherwise that". The code should have instead checked what settings the user enabled and given a clearer error for the situation at hand.


Internet connectivity is an obvious candidate for this.

Could not connect to server? Check if WiFi is on. Check if Dns is working. Check if ping to router is working. Check if ping to google is working. Link to wifi settings.

Whatever you do. Just don’t do this the reverse way, like my smart ass Samsung tv does! It determines if internet is working by pinging a Samsung server, before it even allows other apps to use their internet. You can probably figure out what will happen when Samsung servers are down.


That sort of code is a bit tricky though.

Since the fault code paths (hopefully) are very rarely executed, the error messages are easy to overlook, and tend to rapidly become stale. This is to an extent always a problem with error messages, but it's an ever bigger problem when you have half a dozen error messages depending on various parameters, since they create more and even more rare code paths for staleness to hide in.


If you have tech support or knowledge base articles for your product, you can include unique error codes in your error messages so that Googling the error code will find the appropriate support article. Microsoft is pretty good about this with their KB article numbers and their compiler error messages like C4000: https://learn.microsoft.com/en-us/cpp/error-messages/compile...


How about just engineering stuff to not have errors in the first place.

My toaster is a complex bit of engineering - it has thousands of parts which all work together to take power from the wall to make toast.

Yet it has no errors. It just does the job I ask it to do.

A computer on the other hand seems to have a lot of ways to fail, and does so nearly every day. I suspect everyone reading this comment has seen at least one error today. Can't we engineers make the software better so that these errors can't/don't happen?


A toaster is probably a bad example, given the common error states (burnt toast, stuck toast) which are no doubt amplified by design flaws in some units. I've never seen a toaster with 2000+ components, so maybe such a machine is different. A toaster is also historically famous for a dangerous error state: if the plug is inserted the wrong way round, the coils will be switched on neutral. A toaster which is "off" is thus liable to shock an unwitting person using a fork to resolve the stuck-toast error state.


I don’t know what kind of toaster you have but mine doesn’t have thousands of parts. Maybe 20 or so.


What the article is missing is how they learned the new error messages are now more helpful to the end user. Some kind of metrics: maybe, the number of support tickets/angry reviews decreased? Otherwise without clear criteria for success I'm not sure if it was worth it and wasn't just changing the error messages for the sake of changing. Sure what they talk about makes sense but "it makes sense" is not a business metric.


Redesign err msg or UX you want, I hope there is always a "more" button to show exactly what went wrong. I hate eventvwr.msc or less -nir wall of log texts.


In my current gig, I would be content if the just has consistent error handling. But for of dozens of endpoints (REST and grpc), there are almost as many different kinds of error responses. Some will return a 400 instead of a 404, some will return a 500 for any error, some will return a sensible error code but the status and message amount to "you called GET on XXX and it failed"


Very well-written article with good examples and advice.


"Passing the Blame" in particular is a personal pet peeve. I hate when apps phrase errors like I did something wrong by clicking the totally normal link. Closely related is the general trend of "lol wut" tone in error messages, which really grates when you're frustrated and doing something that might be very important. "Whoops! We made an Oopsies! Sorry :("


I'm not sure we'll ever eclipse the awesomeness of the VB6 error: "Method ~ of object ~ failed".

On a more serious note, error messages is something I always try to keep in mind on in code reviews. Most error messages the code I review deals with are only ever seen in production logs, so I try to think what I'd do with that message (and accompanying details) if I saw it in production.


It reminds me of an article from Byte magazine back in 1981, but the basics stay the same.

I'd like to learn how to make more meaningful error messages in compilers, particularly "low code" compilers that slice code transformations thinly and thus have a hard time explaining which lines of code are interacting to create this situation that happens at phase 39.


Please also put variable names at the end of sentences when possible. For example, instead of "Your file /user/foo/bar.baz did not load correctly because of whales", how about "This file did not load correctly because of whales: /user/foo/bar.baz". The search is much easier.


Even their example of terrible, "Whoops something went wrong" is miles ahead of Chrome's "Oh snap!"


Nice. While working on a large and long-going project, at one point I started redoing all error messages to be more helpful, have implied suggestions and divided by alert levels and categories. Because I decided to take a pause and take care about my users.


Here's another link for how to write useful error messages: https://www.bbc.co.uk/gel/features/how-to-write-useful-error....


I completely agree with this article, but it never bothers me in particular. But I'm a developer, so I'm an outlier. That said, I do wish that the error message I see every day would be simpler.

    <looks at TypeScript>


> If the issue keeps happening, contact Customer Care.

This actually means "if you like wasting your time and want to speak to incompetent fools who will pass you to an endless stream of their 'colleagues' then dial this number."


> Unable to connect your account

Do they mean “Unable to connect to your account”? Because otherwise it’s not clear to me what this is about. Connect my account to what? This doesn’t read like a user-level concept.


Bonus points if your link to customer care auto-populates the fields necessary to get the ticket where it needs to go and can attach relevant diagnostic information to the resulting ticket.


Nicely written piece with clear examples. It would be great to know the impact of this work. Perhaps one metric to look at would be the number of tickets submitted to customer care?


Reminds me of years ago a junior developer I was working with got a log of good-natured ribbing for a validation message that simply said, "You can't do that."


They recommend, avoid technical jargon so change it to:

'due to a technical issue on our end'

but isn't that also generic and obvious which they were trying to avoid too.


So essentially go back to dev style error messages?

A UX person telling us not to do what the previous UX person thought was cute.

Thank you sooo much! Ask PM for a pat on the back.


This quote form a textbook in my graduate studies helped me a lot: “Error messages should be how to fix it messages.”


Just tell me you can’t connect with a big red Error message. I don’t give a damn about polite error messages.



@Microsoft read this article! ;)


Was hoping to get insight on better logging for engineering users, not UX design.


I saw an error message the other day:

“Deployment failed because: deployment succeeded”


0x0000001E, KMODE_EXCEPTION_NOT_HANDLED

That is all.


> Even in today’s world of user-centered design, technical jargon still sneaks its way into error messages. You couldn’t fetch my data? My credentials were denied? What? The technical stuff is not important to the user

This is the opposite of what I want. Stop condescending and just tell me what actually went wrong.


I have this issue with Google Family Link, where I want to add my child's voice to a Nest Audio. The app straight up tells me that I'm not connected to the wifi, which is clearly not true. Furthermore, the app knows I'm connected because in the logging you can see it finding the Nest Audio.

It's impossible to figure out what goes wrong. Plenty of people have the same problem, but Google only has this forum where superusers assume everybody else is either lying or an idiot. Meanwhile, they take such error messages at face value, despite many people saying they have wifi.

All that to say that I'd rather have an overly technical error that actually tells me what's wrong, instead of a friendly error message that's straight up wrong.


This; particularly because more and more, "support" seemingly has no means to access logs, no ability to do the debugging, and no way to escalate obvious bugs in the application to the developers.

I need the technical jargon to do support's — and the company whose product I'm using's — job for them.

Is it not helpful to laypeople? Perhaps not, but it is what the technical friend they're going to drag into the problem needs.


I think the point they are making here is that clearly stating what went wrong doesn't necessitate using "technical jargon".

Now, "your credentials have been denied" seems pretty clear and does not use jargon in my opinion, but telling the user "the ajax request failed, returning a 403 http error code" seems unhelpful and doesn't tell them what happened.


Even in your example there's a world of difference. "Your credentials have been denied" implies a problem with credentials, while 403 clearly states that the credentials are valid, they are just denied access to this resource.

I know it is a made up example, but it does show the problem with "dumbing down" the error messages. Details matter.


...then they clearly didn't make their point at all. Big error (in communication) on their part. Your single (2nd) sentence communicates everything required.


It all depends on the context. If it's a web application that can't connect to some backend service, for example, what exactly are you going to do with that information?


I'm going to web search it and find advice from other users or devs. Maybe I need to use my email address instead of username, or delete my cookies, or something.

If it's proprietary locked down user-hostile junk, then yeah, all I want in the error message is a statement of a refund on my payment, and a link to a competitor website.


Depends on why it didn't connect, right?

Was it a timeout? Maybe an HTTP 401 Was it a DNS failure Was there a TCP reset immediately?

Each one has a miriad of troubleshooting steps associated with it. Some could be local to the host, some could be network/firewall some could be from the remote host or behind that.


Error messages should definitely be written with a target audience in mind. For Wix, a blogging platform, the target audience is usually decidedly non-technical. For many of the tools I use, more technical detail would be welcome. Then again, my parents are unlikely to use the same tools, while they might use Wix.


How many times I had to strace an application because the fucking error message didn't give enough information!!


I don't understand why they wouldn't have a dropdown below the error that would reveal the technical jargon.


They do. Press F12 ;)


> Stop condescending

Wix is mostly a platform for non-techie DIY website builders. I can't imagine they'd know what to do with a highly technical error.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: