I just posted this semi-technical post-mortem on Monzo's about why we had an out...

kennydude · on June 20, 2019

Nice to see how Faster Payments actually work in a nice understandable way.

That does sound weird how it happened by a corrupted date formatter. I'm assuming it's something like the formatter reset itself back to the langauge default

robinson-wall · on June 20, 2019

Yeah, instead of `20190530` we were getting `['bert', 'time', 1559, 238096, 0]`. My understanding is a race condition put the code into a situation where it couldn't determine how to format that field, and we just got a default string representation of the underlying value.

Dunedan · on June 21, 2019

That reminds me of a race condition I encountered a while ago:

There was this massive Java application processing hundreds of parallel requests per second. For each request it wrote a line with billing information into a log file. Those lines were fine from a quick glance, but when we tried processing them later we encountered invalid dates. Those log records contained future dates as well as invalid dates like 2019-02-30. Long story short: In the end we figured out that this was caused by the date formatting not being done thread safe (might have been SimpleDateFormat, but I don't remember the details anymore), causing the date components of multiple threads to get interleaved. Ouch, I guess somebody learned a lesson back then.

tuukkah · on June 20, 2019

Which platform gives you unformatted data if a format can't be determined? Or was the code similar to getDate(format) and format happened to be null? Was the race condition in your code or in the platform? (EDIT: I suppose these would be questions more for the developers of the third-party gateway.)

Gaelan · on June 20, 2019

FTA:

> The bug was in a computer program the [third party] Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection). > But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.

btown · on June 21, 2019

This is exactly the type of thing I would expect from old-school banking systems. Kudos to Monzo for even trying to bring sanity to this space.

mikekchar · on June 21, 2019

IIUC, the problem seems to have been that the code was looking into freed memory and so the date format was essentially random data. I can imagine a case statement where you have a default case of "don't translate the date" with a comment over it saying, "This should never happen". I'm sure I've naively written similar code when I was sleepy and it tends to pass review because it's innocuous.

It's easy to be hard on the programmer -- probably crashing is better than data in the wrong format, but then you are just pushing out the problem to a different layer. Error handling is completely non-trivial in complex systems. Maybe they should have thrown an exception in that case, but are you sure it's going to be handled? What is the downside in that case? It could easily be worse -- we have no way of knowing. Sometimes it gets down to, "Well, you need to make sure there are no mistakes in the code". If we're going to go down that route, then the incorrect timing of the memory freeing is the real cause (or if I'm being particularly nasty I might say, "You really shouldn't be using threads" ;-) ).

I guess what I'm trying to say is that there is certainly a better way of doing defensive programming in this case, but I wouldn't be able to tell what it was without seeing the code. I also wouldn't expect any large codebase to be completely free of these kinds of problems because it's easy to make a mistake.

robinson-wall · on June 20, 2019

I haven't seen the code personally, so I'm not sure. The condition was inside the code of the Gateway provider.

CornishPasty · on June 20, 2019

Was the original BERT formatting from the Hub or from something inside the Gateway's network? I always associated BERT with Erlang, do you know if it was involved or if it was something else?

Cheers for this blog post, by the way. It was really informative about the issue, and about how FPS works.

robinson-wall · on June 20, 2019

It's used exclusively inside the Gateway's infrastructure. They're a mix of Java and Erlang but I'm not sure on the proportion.

FPS uses ISO8583 for its messaging format, and I suspect at the edge the Gateway translates it to a BERT blob for passing around internally.

CornishPasty · on June 20, 2019

Ah right that makes sense.

It seems odd that it took them so long to identify and rectify the issue. Have they made you any assurances about what they'll do to reduce this time/prevent this from happening in future?

robinson-wall · on June 20, 2019

Specifically this class of incident? Yes. They seem to be reasonably good at operational runbooks for known flaws, so I can imagine this class of incident being handed faster in future.

What worries me is the next unknown unknown, which is why we are insourcing. One thing I think Monzo can be particularly proud of is our incident response, and debugging of our own systems at speed.

That's why this article focuses more on what actions we are taking, and not what actions they are taking.

hc91 · on June 20, 2019

You guys are using Form3 for FPS, is this correct?

robinson-wall · on June 20, 2019

Ah "any more technical questions"... except this one.

Sorry, but I can't name our partner.

wrboyce · on June 20, 2019

We can take an educated guess...

https://status.form3.tech/incidents/wyhyxydxgh30

robinson-wall · on June 20, 2019

I would note that this status page says "Our FPS Direct gateway provider".

wrboyce · on June 20, 2019

Fair point. FWIW I thought your postmortem was excellent, and certainly puts Form3's to shame regardless of where the blame lies.

BillinghamJ · on June 21, 2019

I believe it's PayPort, which is run by Vocalink - who also run FPS itself.

As far as I understand, PayPort was/is the recommended options for all new "direct" connections. Though it seems that it is also possible to go more-direct into the FPS system itself.

jayelbe · on June 22, 2019

You might be interested to know that Monzo's API refers to Faster Payments transactions as "payport_faster_payments".