Nice to see how Faster Payments actually work in a nice understandable way.
That does sound weird how it happened by a corrupted date formatter. I'm assuming it's something like the formatter reset itself back to the langauge default
Yeah, instead of `20190530` we were getting `['bert', 'time', 1559, 238096, 0]`. My understanding is a race condition put the code into a situation where it couldn't determine how to format that field, and we just got a default string representation of the underlying value.
That reminds me of a race condition I encountered a while ago:
There was this massive Java application processing hundreds of parallel requests per second. For each request it wrote a line with billing information into a log file. Those lines were fine from a quick glance, but when we tried processing them later we encountered invalid dates. Those log records contained future dates as well as invalid dates like 2019-02-30. Long story short: In the end we figured out that this was caused by the date formatting not being done thread safe (might have been SimpleDateFormat, but I don't remember the details anymore), causing the date components of multiple threads to get interleaved. Ouch, I guess somebody learned a lesson back then.
Which platform gives you unformatted data if a format can't be determined? Or was the code similar to getDate(format) and format happened to be null? Was the race condition in your code or in the platform? (EDIT: I suppose these would be questions more for the developers of the third-party gateway.)
> The bug was in a computer program the [third party] Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).
> But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.
IIUC, the problem seems to have been that the code was looking into freed memory and so the date format was essentially random data. I can imagine a case statement where you have a default case of "don't translate the date" with a comment over it saying, "This should never happen". I'm sure I've naively written similar code when I was sleepy and it tends to pass review because it's innocuous.
It's easy to be hard on the programmer -- probably crashing is better than data in the wrong format, but then you are just pushing out the problem to a different layer. Error handling is completely non-trivial in complex systems. Maybe they should have thrown an exception in that case, but are you sure it's going to be handled? What is the downside in that case? It could easily be worse -- we have no way of knowing. Sometimes it gets down to, "Well, you need to make sure there are no mistakes in the code". If we're going to go down that route, then the incorrect timing of the memory freeing is the real cause (or if I'm being particularly nasty I might say, "You really shouldn't be using threads" ;-) ).
I guess what I'm trying to say is that there is certainly a better way of doing defensive programming in this case, but I wouldn't be able to tell what it was without seeing the code. I also wouldn't expect any large codebase to be completely free of these kinds of problems because it's easy to make a mistake.
That does sound weird how it happened by a corrupted date formatter. I'm assuming it's something like the formatter reset itself back to the langauge default