Nice to see how Faster Payments actually work in a nice understandable way.
That does sound weird how it happened by a corrupted date formatter. I'm assuming it's something like the formatter reset itself back to the langauge default
Yeah, instead of `20190530` we were getting `['bert', 'time', 1559, 238096, 0]`. My understanding is a race condition put the code into a situation where it couldn't determine how to format that field, and we just got a default string representation of the underlying value.
That reminds me of a race condition I encountered a while ago:
There was this massive Java application processing hundreds of parallel requests per second. For each request it wrote a line with billing information into a log file. Those lines were fine from a quick glance, but when we tried processing them later we encountered invalid dates. Those log records contained future dates as well as invalid dates like 2019-02-30. Long story short: In the end we figured out that this was caused by the date formatting not being done thread safe (might have been SimpleDateFormat, but I don't remember the details anymore), causing the date components of multiple threads to get interleaved. Ouch, I guess somebody learned a lesson back then.
Which platform gives you unformatted data if a format can't be determined? Or was the code similar to getDate(format) and format happened to be null? Was the race condition in your code or in the platform? (EDIT: I suppose these would be questions more for the developers of the third-party gateway.)
> The bug was in a computer program the [third party] Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).
> But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.
IIUC, the problem seems to have been that the code was looking into freed memory and so the date format was essentially random data. I can imagine a case statement where you have a default case of "don't translate the date" with a comment over it saying, "This should never happen". I'm sure I've naively written similar code when I was sleepy and it tends to pass review because it's innocuous.
It's easy to be hard on the programmer -- probably crashing is better than data in the wrong format, but then you are just pushing out the problem to a different layer. Error handling is completely non-trivial in complex systems. Maybe they should have thrown an exception in that case, but are you sure it's going to be handled? What is the downside in that case? It could easily be worse -- we have no way of knowing. Sometimes it gets down to, "Well, you need to make sure there are no mistakes in the code". If we're going to go down that route, then the incorrect timing of the memory freeing is the real cause (or if I'm being particularly nasty I might say, "You really shouldn't be using threads" ;-) ).
I guess what I'm trying to say is that there is certainly a better way of doing defensive programming in this case, but I wouldn't be able to tell what it was without seeing the code. I also wouldn't expect any large codebase to be completely free of these kinds of problems because it's easy to make a mistake.
Was the original BERT formatting from the Hub or from something inside the Gateway's network? I always associated BERT with Erlang, do you know if it was involved or if it was something else?
Cheers for this blog post, by the way. It was really informative about the issue, and about how FPS works.
It seems odd that it took them so long to identify and rectify the issue. Have they made you any assurances about what they'll do to reduce this time/prevent this from happening in future?
Specifically this class of incident? Yes. They seem to be reasonably good at operational runbooks for known flaws, so I can imagine this class of incident being handed faster in future.
What worries me is the next unknown unknown, which is why we are insourcing. One thing I think Monzo can be particularly proud of is our incident response, and debugging of our own systems at speed.
That's why this article focuses more on what actions we are taking, and not what actions they are taking.
I believe it's PayPort, which is run by Vocalink - who also run FPS itself.
As far as I understand, PayPort was/is the recommended options for all new "direct" connections. Though it seems that it is also possible to go more-direct into the FPS system itself.
I'll hang around here to answer any more technical questions if anyone's interested.