This is a well-written post-mortem for public reading. I encourage people to read through it.
Being someone who also works in the payments space currently, relying on gateways, I have gone through several similar outages, where we detected a gateway issue causing an outage, notified the gateway who ack’d... and then we waited. More than one time, like Monzo, we built a workaround on our end, before the gateway provider could even mitigate the outage.
Hats off to the Monzo team, who clearly have a solid oncall and incident mitigation strategy in-place. They determined an outage happening in 4 minutes, built a workaround as best they could and deployed it in 2 hours, while it took the gateway provider 9 hours only to mitigate their change that caused the issue the first place. Granted the issue seemed complex, this is still slow.
Unfortunately, in cases like this, the best one can do is make sure there is a clear SLA in-place with the third party, with a contract stating financial liability in case the third party fails to meet this SLA. Monzo will not tell us much about this part, but I suspect the gateway will have to pay a hefty fee to Monzo, as their availability dropped to under 99% for this month, which should trigger payments/fee reductions from the third party with a well-written contract. It is good to see they are pushing the third party to do a proper post-mortem and prevention actions, as well as holding them accountable.
What I think is fascinating is not just that we applaud Monzo for this, but that we allow other important services that control our lives to get away with revealing nothing about what happened or what they've changed to prevent it. Can you imagine any large bank (for the US, say JP Morgan Chase, Citi, Bank of America, etc.) putting out a note with this level of transparency, accountability, and clear direction to change?
It happens all the time, and they just don't tell you. We get automated notices when banks connect and disconnect from the Faster Payments network, something happens every few days. Not always this length, but occasionally.
Just yesterday a major high street bank stopped sending payments for an hour, and was telling customers on Twitter that there were no problems.
Hell, the central system (what I called the Hub in this article) had a 12 hour split brain meltdown last July which had banks emailing each other spreadsheets back and forth for two weeks afterwards.
Emailing each other spreadsheets for two weeks is nothing! Try a year of it combined with angry telephone phone calls between various 'heads'.
Good write-up and very lucid writing to a wide audience. As a new Monzo customer, the customer service I've received has been excellent and far exceeding any service - in retail banking - from a legacy bank that I've had, just to pop that in.
Various banking (sub)systems break all the time. We just never get any postmortems or public apologies.
This reminds me the saying "Never admit a wrongdoing and you'll never be wrong".
It's great that we get several of those new startup banks (Monzo, N26 etc.) that provide superior experience and slowly show what horrible things traditional banks were getting away with.
Some months ago I transferred some money between two accounts at different banks. The money arrived _twice_ in two different accounts of mine at the destination bank. I contacted them and asked them to cancel the second transfer, but they just told me it would get automatically denied for insufficient funds at the origin. They also warned me that if it happened again my account might have restrictions placed on it.
An apology would have been nice, but I suppose unwarranted threats are more in character.
Clear, detailed but accessible, plans in place moving forward, apology read sincerely, providing support to affected customers immediately, and answering follow up questions to technical users who are interested in more detail.
A+ job on handling the unfortunate situation, Monzo.
We can only hope more companies follow this great example.
Every time I see a company say "we're sorry" I can't help but think about the South Park episode where the BP CEO says "we're sorry!" It's either that or the one with the Time Warner employees with the nursing flaps in their shirts.
Personally, I prefer something written in a relaxed style rather than a formal-voice only in most cases, especially for blog posts.
Formal only generally comes across, to me, as cold and distant. Great for a persuasive essay or other mediums where you want to remove the topic from the author, not so great for communicating with your audience and wanting to come across as sincere.
If anything, a strict formal-voice only blog post would come across, to me, as contrived.
I'm not advocating for formal over relaxed. I find blog posts work best in a relaxed style from the first person singular. This sort of apology should come from an individual at the top, and make reference to the whole team. You're probably right about personal differences though.
Not to take away from their good communication, but they are still relatively small. If they keep this up while growing, then they might have some secret sauce.
Nice to see how Faster Payments actually work in a nice understandable way.
That does sound weird how it happened by a corrupted date formatter. I'm assuming it's something like the formatter reset itself back to the langauge default
Yeah, instead of `20190530` we were getting `['bert', 'time', 1559, 238096, 0]`. My understanding is a race condition put the code into a situation where it couldn't determine how to format that field, and we just got a default string representation of the underlying value.
That reminds me of a race condition I encountered a while ago:
There was this massive Java application processing hundreds of parallel requests per second. For each request it wrote a line with billing information into a log file. Those lines were fine from a quick glance, but when we tried processing them later we encountered invalid dates. Those log records contained future dates as well as invalid dates like 2019-02-30. Long story short: In the end we figured out that this was caused by the date formatting not being done thread safe (might have been SimpleDateFormat, but I don't remember the details anymore), causing the date components of multiple threads to get interleaved. Ouch, I guess somebody learned a lesson back then.
Which platform gives you unformatted data if a format can't be determined? Or was the code similar to getDate(format) and format happened to be null? Was the race condition in your code or in the platform? (EDIT: I suppose these would be questions more for the developers of the third-party gateway.)
> The bug was in a computer program the [third party] Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).
> But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.
IIUC, the problem seems to have been that the code was looking into freed memory and so the date format was essentially random data. I can imagine a case statement where you have a default case of "don't translate the date" with a comment over it saying, "This should never happen". I'm sure I've naively written similar code when I was sleepy and it tends to pass review because it's innocuous.
It's easy to be hard on the programmer -- probably crashing is better than data in the wrong format, but then you are just pushing out the problem to a different layer. Error handling is completely non-trivial in complex systems. Maybe they should have thrown an exception in that case, but are you sure it's going to be handled? What is the downside in that case? It could easily be worse -- we have no way of knowing. Sometimes it gets down to, "Well, you need to make sure there are no mistakes in the code". If we're going to go down that route, then the incorrect timing of the memory freeing is the real cause (or if I'm being particularly nasty I might say, "You really shouldn't be using threads" ;-) ).
I guess what I'm trying to say is that there is certainly a better way of doing defensive programming in this case, but I wouldn't be able to tell what it was without seeing the code. I also wouldn't expect any large codebase to be completely free of these kinds of problems because it's easy to make a mistake.
Was the original BERT formatting from the Hub or from something inside the Gateway's network? I always associated BERT with Erlang, do you know if it was involved or if it was something else?
Cheers for this blog post, by the way. It was really informative about the issue, and about how FPS works.
It seems odd that it took them so long to identify and rectify the issue. Have they made you any assurances about what they'll do to reduce this time/prevent this from happening in future?
Specifically this class of incident? Yes. They seem to be reasonably good at operational runbooks for known flaws, so I can imagine this class of incident being handed faster in future.
What worries me is the next unknown unknown, which is why we are insourcing. One thing I think Monzo can be particularly proud of is our incident response, and debugging of our own systems at speed.
That's why this article focuses more on what actions we are taking, and not what actions they are taking.
I believe it's PayPort, which is run by Vocalink - who also run FPS itself.
As far as I understand, PayPort was/is the recommended options for all new "direct" connections. Though it seems that it is also possible to go more-direct into the FPS system itself.
This is a perfect post-mortem. Their communication and support has always been really good. I've been using Monzo as my primary bank account ever since they registered as a bank, and I've converted a lot of friends to it. But... over the last year, the iOS app has fallen in quality: long UI freezes, frequent sign-outs with no explanation, silly UI bugs. My non-technical friends have noticed the same issues. It's a real shame.
Agreed. This caused me to leave them for Starling Bank, though I’m considering switching back - I’d rather take a faulty app but good customer support than a good app but no support at all.
Recently updated the iOS app and it's definitely got quite laggy, especially on the pots screen (I only have about 5 pots too). Used to be so nippy as well.
I made the adventurous mistake of upgrading my main iPhone to iOS 13 and the Pots screen just refuses to load - tapping the icon freezes the app. As I keep most of my money in pots, I didn't have any money until I got out an old phone and installed Monzo on it.
@robinson-wall Nice writeup, definitely raises the standards in the banking industry! I have a few questions:
1. Was this post-mortem part of an official process or something of an individual initiative? I saw it published on the blog, but it might be helpful to have this information disambiguated from marketing material on a separate site: https://status.cloud.google.com/summary
2. I'm not sure how payment processors work, but would having multiple payment processors from Monzo's interface make sense from a cost/benefit perspective?
3. Any plans to expand to the U.S. anytime soon, or recommend any banks that follow Monzo's best practices? ;-)
1. A mix of both, we have a culture of being transparent by default - it's one of the first things that attracted me to come and work here. I was the incident lead for this on the day, and volunteered to write up this post-mortem. I did have help from colleagues in the marketing team to try and make this as accessible as possible.
As another poster mentioned we already have a status page where we post about incidents as they happen (though obviously not in quite as much detail as here). Personally I think our main blog is a reasonable place to have this ️.
2. Multiple redundant payment processors would be great, but ultimately infeasible. As a settling FPS participant we have to have a single Bank of England settlement account, tied 1:1 to a "bank code". Multiple sort codes map to a single bank code, and migrating sort codes between bank codes is non-trivial.
It'd be great if we could migrate sort codes easily between redundant connections, but as we build our own Gateway we'll have complete control over how our failover mechanisms work. Here's to much greater uptime in the future!
3. As another commenter mentioned - yes! We're just doing staff testing for now, but we've got a waiting list up. It'll be a prepaid product issued by another bank before we get a US banking license, just like we were in the UK a couple of years ago.
The bug was in a computer program the Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).
But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.
Is that really proper use of the term garbage collection? If you are doing memory management manually, it sounds more like the lack of garbage collection. Unless they were using an unsafe GC for C/C++?
Sounds like they were hanging onto a pointer to an object allocated by GC. For example, in Python/C API if you use a borrowed reference PyObject* after it has gone out of scope and been GC'd.
This is a very well-written postmortem. It’s clear enough that a non-technical customer effected by the outage could understand the explanation, at least at a high level. It’s also detailed enough that a technical person can trace the root cause to a buggy garbage collector in format transformation function. The whole thing uses clear language with a bare minimum of jargon. Nice work!
Or, rather, unsafe access of memory managed by a garbage collector:
> The bug was in a computer program the Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).
>
> But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.
What I still don't understand with bank transfers is: what control is there to ensure that debits and credits are offsetting. Doesn't this rely on the bank be being honest? Can't the sending bank just not debit the senders account?
The important thing to understand is "clearance and settlement". Banks either maintain accounts with each other ("nostro/vostro accounts") or, within a country, at the central bank. So e.g. Halifax and Monzo will have accounts at the Bank of England.
Settlement will either be immediate or delayed. For immediate, at the same time as Halifax is sending a "please credit £10 to Bob" message to Monzo, they will send a message to the Bank of England to transfer £10 between their account and Monzo's.
For delayed settlement, the banks wait until the end of the day, add up the total money in each direction, subtract the difference, and transfer that.
A lot of work goes into making sure all the necessary entries line up. So, in the example, if the bank sent a payment message but didn't debit their user's account, either they would have made the central bank transfer (in which case they've lost £10 and effectively given it to their user's account), or they haven't, in which case Monzo will notice and demand payment for the discrepancy.
Banking is eventually consistent, and has been for centuries.
I can speak to how it works for card processing and ACH is most likely similar. To participate in payment processing in the banking network, you have to have a Merchant ID that is tied to a bank account. The processor or gateway is holding a suspense/escrow account on your behalf throughout the day and when a batch of transactions settles, it will resolve the balance difference with your bank account. The amount of payments allowed into or out of your escrow account is set by the processors based on your company's financial health and a risk analysis since if you just debited say $10 million from the escrow account and you only had $5 in your account, the processor would need to collect that debt from you, and they do not have a guarantee that they'll be able to do so. This is how it works for debit cards and bank accounts since $ amounts are real. It's slightly different for credit cards because the $ amount is in a way fictional, so they don't do the escrow holding and just temporarily "allocate" part of the credit limit (this is called an authorization) and when it is settled this is "captured", which enqueues the authorization for future processing. A few days later it will process and be included in a lump sum of funding into your merchant account. This reply is my personal understanding and meant for educational reasons and doesn't represent opinions or viewpoints of any company, and should not be considered advice of any kind and it may be inaccurate.
Bank account balances aren't money, they're IOU's, a record of debt - a bank handing out a statement that shows an account balance of $100 quite literally means the bank saying "we acknowledge that [as of date X] we owe you $100" and nothing else.
The standard money transfer from Bob to Joe is a deal where the bank says "ok, Bob, we owed you $100 but if you want that then we'll now owe that $100 to Joe instead".
It's also worth noting that's just a record of debt not reality - there has to be some legal basis for that transaction to actually change the liability between the bank and the account holder, simply changing the balance in a database doesn't change the amount of debt but just the record i.e. "bank's opinion" of that debt; and if that record/opinion is wrong, then that balance can and will be disputed, and if the dispute can't be resolved otherwise, then it'll be up to courts to decide if that debt is valid or not.
If you record just the credit without the debit, then it's the equivalent of the bank unilaterally agreeing to new debt, the bank asserting that it now owes $100 to Joe just because. It's free to do that, but it would mean that it's "books won't balance" i.e. their accounting isn't consistent with itself and doesn't match reality, so to properly account for that transaction they'd have to book a debit to their profit&loss statement since they lost money by acknowleding that balance increase i.e debt without an offseting balance/debt decrease to someone else.
The sending bank will have their Bank of England settlement account debited by the central FPS system, based on the central FPS system's view of the world, at the end of the settlement cycle.
The recipient bank will receive the money into their settlement account at that time. If the sending bank doesn't debit their customer then both sending and receiving customers will have the money in their accounts, but the sending bank will be out of pocket.
The enforcing entity here is the central bank they both have accounts with. If a customer of Monzo sends £10 to a customer of RBS, then that money never leaves the central bank, and both records are just updated. But the total amount in the central bank must still add up to the accounts of both Monzo and RBS, otherwise there is a discrepancy.
The settlement process through a central bank is a way of ensuring that banks dont need to literally send truckloads of cash to each other at the end of the day.
Monzo says to the central bank, "today I sent RBS £1,500,000", and RBS says to the central bank, "today I sent Monzo £1,200,000". So the central bank just debits Monzo's account with them by £300,000, and credits RBS's account with them by £300,000. The total amount in the central bank remains the same.
So, sure, a bank could claim they sent less money to another bank than they did, but eventually the numbers wouldn't add up, and it would trigger a bucketload of auditing, likely resulting in revocation of banking licenses, and legal issues for both the bank and people involved.
There is also a technique involving things called nostro/vostro accounts, where banks have money on deposit with each other, and the sending bank's deposit with the receiving bank is used to cover transfers:
Of course, then they need to keep their accounts topped up, and they can do that by transfers through other banks, which might be central banks or commercial ones. The nostro/vostro system is suitable for use where banks don't trust each other so much, eg because they are in different countries. I think it was used more in the past, before reliable central settlement schemes were established, but i'm not sure.
You can think of net settlement as being a bit like nostro/vostro where the accounts have infinite free overdraft facilities, and so the banks never build up a credit balance, and just settle their debts at the end of the day.
The project I'm currently working on has a QA lag of 4-5 days for code to reach production.
I'm seriously impressed they were able to deploy mitigations to product twice in the same few hours, especially given they are a bank (and a small one, at that), and the consequences of fucking up are enormous.
It's been said here many times already, but I'll join those saying "well done" for handling this so well, and for the extraordinary level of transparency!
Maybe this comment was about UK banks which I can't speak to, but we have banks in the US (Ally as one that I'm familiar with and use, but there are many more) that offer high-interest savings accounts at above 2% interest.
Could be keeping a good chunk of their deposited cash in money-market funds (or similar) which are currently paying around 2-2.5%, while providing customers with immediate access to funds with their remaining cash (insulating their customers from the delays associated with buying/selling those funds and so forth).
There's a good writeup by Oliver, our head of engineering, about our tech stack on our blog[1] with an accompanying Kubecon talk[2].
TL;DR- Largely Go microservices running on k8s, with http-based RPC calls for synchronous communication, and kafka for asynchronous communication.
As for sending and receiving of this kind of payment message, they are largely async but it does depend on the payment system we're talking about. When we build our own FPS gateway we're going to have to have something to manage "sessions" (TCP connections) which will block waiting for a response to an individual payment messages. Right now our communication with our third party Gateway is via a queue.
We have learnt a lot from you guys as we build out similar systems in India. Thank you for putting this stuff out!
Quick question that I have always wondered about - would you have used something like Uber Cadence (https://github.com/uber/cadence) as the core of your infrastructure if it had been available back thhen ?
I've been using Monzo less and less since I moved to the US due to the cost of topping it up. It's really sad that there is no true equivalent to Monzo here :(
Not related to the outage, but any plans to provide banking on pc's instead of just phones and any plans to provide small businesses accounts in the future?
I registered my interest a while ago but they haven't given me one yet, so I wouldn't say they offer business accounts, so much as they're going to/are testing this.
Starling does offer business accounts now, but you can only have one Person of Significant Control, i.e. over 25% owner. There is no monthly fee with their offering though, so it's probably the better offer.
Note that Starling will prevent you opening a business account if you've held a personal account in the past and closed it. That was the position I found myself in.
For what it's worth, I've seen some suggestions regarding business account pricing on the Monzo Slack, and there are plans for separate tiers (including a free tier which lacks some of the more advanced accounting integrations).
Sorry, perhaps this isn't very clear as I've tried to simplify the explanation to make it accessible to a wide audience.
What I meant here is they could tell that the corruption was being introduced by some component in their infrastructure, and they were only observing it for messages passing through one of their two active-active sites.
Tldr; unsafe memory management in a third party's software corrupted dates (under high load, due to garbage collection), causing transactions to fail or get reversed.
Being someone who also works in the payments space currently, relying on gateways, I have gone through several similar outages, where we detected a gateway issue causing an outage, notified the gateway who ack’d... and then we waited. More than one time, like Monzo, we built a workaround on our end, before the gateway provider could even mitigate the outage.
Hats off to the Monzo team, who clearly have a solid oncall and incident mitigation strategy in-place. They determined an outage happening in 4 minutes, built a workaround as best they could and deployed it in 2 hours, while it took the gateway provider 9 hours only to mitigate their change that caused the issue the first place. Granted the issue seemed complex, this is still slow.
Unfortunately, in cases like this, the best one can do is make sure there is a clear SLA in-place with the third party, with a contract stating financial liability in case the third party fails to meet this SLA. Monzo will not tell us much about this part, but I suspect the gateway will have to pay a hefty fee to Monzo, as their availability dropped to under 99% for this month, which should trigger payments/fee reductions from the third party with a well-written contract. It is good to see they are pushing the third party to do a proper post-mortem and prevention actions, as well as holding them accountable.
Nice work!