I've seen the guts of a few major financial organizations, and there are some common themes regarding their infrastructures.
The one that really sticks out to me as an engineer is the fact that the whole system in most cases seems to be tied together by a fragile arrangement of 100+ different vendors' systems & middleware that were each tailored to fit specific audit items that cropped up over the years.
Individually, all of these components have highly-available assurances up and down their contracts, but combine all these durable components together haphazardly and you get emergent properties that no one person or vendor can account for comprehensively.
When the article says a full reset entails killing the power and restarting, this is my actual experience. These complex leviathans have to be brought up in a special snowflake sequence or your infra state machine gets fucked up and you have to start all over. When dependency chains are 10+ systems long and part of a complex web of other dependency chains it starts to get hopeless pretty quickly.
I’ve worked in a lot of banks, and other similar organisations, and the truth is that big enterprise just sucks at compliance. You can approach compliance frameworks in two ways, you can put a lot of effort into designing your infrastructure and compliance testing methodology (I would suggest Amazon as a canonical example of how to do this well), so that it performs well and meets all the requirements, or you can take out the sledgehammer, implement a new control for every occasion, and create a cumbersome bureaucracy.
The good design approach is obviously superior in many ways. But the downside of it is that you have to trust the competency of a lot of different business units to maintain it. A cumbersome bureaucracy on the other hand ensures that incompetent/lazy/low-initiative internal actors can’t impact your compliance. If you fail, at least you fail in a compliant and auditor-approved way.
That said, a lot of the failures I’ve seen in organisation like this stem from silo’d expertise. People don’t know that much about the systems that are outside their remit, so they will make changes that impact connected systems in ways they failed to imagine. As an example I have seen 3 seperate banks have non-trivial service disruptions stem from the same independently made mistake. A person enabling debug logging on the SIP phones. The traffic DOSes their networks, and all of a sudden, core network infrastructure starts to die. Afterwards they send the right reports off to the right people, make the correct adjustments to the bureaucracy, and proceed with their compliance intact.
There is also a 3rd way to approach compliance, via negligence. But the more you are in the regulatory spotlight, the less of an option that is.
> that were each tailored to fit specific audit items that cropped up over the years.
Or worse, "compliance" line items, that some tool or some company identified in their cookie-cutter processes. As long as that line item goes away, noone really cares what the long term implications are.
Yup. This is why you're seeing rapid adoption of event-based architectures. You can more confidently surmise the larger system state in a well-designed system.
There's so many trade-offs within this statement I felt like it deserved some color -
* Spinning up an event-based architecture is prone to the same issues GP describes - For example what if you spin up a 'pub' side without the corresponding 'sub' side?
* Event-based architecture does not inherently give you better observability of the whole system. I would argue it's actually worse to begin with because of the decoupled nature of the services. You have to build tooling to monitor traces across services or cross-system health
* Any 'well-designed' system will solve these problems. A monolith with good tooling and monitoring will be more easily understood than a poorly designed event-based system.
* I think metrics and monitoring are really the key to knowing about a system, regardless of the architecture.
Basically, I think you can remove your first sentence and leave the last to make a truer statement. A great system should have design considerations and tools to make it understood at various scopes by various people of various skills. I think people have falsely conflated event-based architecture with systems health due to having to rewrite a system almost from scratch, which results in better tooling since it's something you're thinking about actively.
This. Event-based systems do not satisfy the "passive" part of active/passive dist-sys design - ergo, they are not fault tolerant by themselves, and most machinery designed around making them quasi-fault-tolerant tend to be bigger headaches in the long run (see Apache Kafka).
I've had to explain this to a dozen teams so far in my career and most of them go ahead with the design anyway, regretting it not even 6 months later.
What's so bad about Kafka? I've only ever used mature Kafka systems that were already in place, so I don't know what the teething issues are from an operational or development perspective.
Kafka persists events locally, which when mishandled can cause synchronization issues. If an event-based system has to cold-restart, it becomes difficult if not impossible to determine which events must be carried out again in order to restart processes that were in progress when the system went down.
This is a characteristic with all event-based systems, but persistence-enabled event systems (such as Kafka) make it even harder because now there are events already "in flight" that have to be taken into account. Event-based systems that do not have persistence (and thus are simply message queues used as a transport mechanism) have a strong guarantee that _no_ events will be in-flight on a cold-start, and thus you have an easier time figuring out the current overall state of the system in order to make such decisions.
The only other way around this is to make every possible consumer of the event-based system strongly idempotent, which (in most of the problem spaces I've worked in) is a pipe dream; a large portion certainly can be idempotent, but it's very hard to have a completely idempotent system. Keep in mind, anything with the element of time tends not to be idempotent, and since event systems inherently have the element of time made available to them (queueing), idempotency becomes even harder with event based systems.
A rule of thumb when I am designing systems is that a data point should only have one point of persistence ("persistence" here means having a lifetime that extends beyond the uptime of the system itself). Perhaps you have multiple databases, but those databases should not have redundant (overlapping) points of information. This is the same spirit of "source of truth", but that term tends to imply a single source of truth, which isn't inherently necessary (though in many cases, very much desirable).
Kafka, and message queues or caches like it (e.g. Redis with persistence turned on), breaks this guarantee - if the persistence isn't perfectly synchronized, then you have, essentially, two points of persistence for the same piece of information, which can (and does) cause synchronization issues, leading you into the famously treaterous territory of cache invalidation problems.
As with most technologies, you can reduce your usage of them to a point that they will work for you with reasonable guarantees - at which point, however, you're probably better off using a simpler technology altogether.
>Basically, I think you can remove your first sentence and leave the last to make a truer statement.
+1 not sure how event driven arch would have mitigated here. A well designed (with redundancy where necessary, better tooling, monitoring, scalable, etc) would.
So the sub side boots up, says it is ready, accepts an event, then power goes down.
You reboot the sub side. The event never gets processed because pub already sent it and recorded this fact.
Or, the sub side doesn't reboot, the pub side does. The pub side accepts an event for publishing, send it to the sub side, and promptly loses power.
The pub side reboots and either it resends the event and the sub side receives the event twice (because the pub side didn't record that it had already sent it before power was lost), or it doesn't resend the event and the sub side never receives it (because the power loss killed the network link while the packet was on its way out).
If you think you can make these and other corner cases go away with a simple bit of acknowledging here and there, good luck!
The corner cases can be solved, but it's not half as simple as "wait in a queue and be consumed when ready".
I think that's just unfair. Synchronous system finishes a task, tries to change state, power goes out, system is in an inconsistent state. If you think you can fix this with a couple of write-ahead logs and a consistency checker, good luck! That's how you sound.
> If you think you can fix this with a couple of write-ahead logs and a consistency checker, good luck! That's how you sound.
I'm literally saying it can't be fixed with a simple solution, in the parent comment and other comments, so I'm not sure where you get the idea that I'm saying it can.
Dealing with inconsistent states from failures in a distributed system is solvable but it's not simple unfortunately. It's not even simple to describe why.
Full 100 services should have end-to-end integration testing and any change made to that chain of tooling should have to run through a massive integration test. If anything fails, the change is no longer acceptable.
You can comment this on any outage that ever happens. "Why didn't they have a test for that?"
The answer is that tests are never perfect. If you want to create an integration environment that mimics prod, you have to fork an entire parallel universe into your integ environment to run the test. Anything else will diverge from the reality of the future.
Even if every vendor's service or hardware had integration tests, that doesn't mean that the integration tests covered every case. It doesn't mean there's not an emergent property of two systems behaving in a slightly unexpected way that turns into a catastrophic result.
It's not necessarily even possible to have two copies of some of the systems; who knows how expensive a given vendor's hardware box is.
It's definitely not possible to exactly mimic future traffic. Perhaps in the test environment it works, then the prod environment, requests are different so it fails.
Hardware errors happen, and integ testing those is difficult to say the least.
"A data device critical to the Tokyo Stock Exchange’s trading system had malfunctioned, and the automatic backup had failed to kick in. It was less than an hour before the system, called Arrowhead, was due to start processing orders in the $6 trillion equity market. Exchange officials could see no solution."
You know, just from a human perspective, talk about a bad day.
It's like seeing the cruise liner heading for the port too fast, knowing it is going to crash and cause immense damage, and realizing there is absolutely nothing that can now be done to prevent the damage.
A large industry is setup, with how many [thousands] of people employed and expected to execute on their jobs related to this market, and they just lost a whole day of doing their job.
So you have the opportunity costs for those people being paid to do nothing for a day and generating nothing in the form of revenue. Rightly or wrongly, a significant amount of profit generated in the stock market is from short term moves / games, not from buying and holding for long periods.
Second, and I don't know if this is the case, but what about companies that were issuing equities to raise cash on that day? They had geared up entire investment banking teams, with customers orders lined up, and had a whole game plan for executing the offerings.
That all went up in smoke and they will have to re-prepare everything for another day. That's true to IPOs, general follow-on offerings, to some extent ATM programs, etc.
Oh, my biggest worry was futures expiring, but I hope nobody thought to switch over to a new system on the same day, and same with an IPO/2ndary offering.
> How much was lost by being unable to sell something today that you weren’t willing to sell yesterday?
Could be a lot. Markets are not disconnected. If only Japan existed, sure. Some orders would no longer exist, but nothing major, because noone else would be trading anyway.
But suppose this was during an economic downturn. You are trying to SELL SELL SELL because all countries worldwide are feeling some economic pressure and you can't because the exchange is down.
The next day, whatever assets you had are now worth a fraction of what they were one day before.
Oh, and there are some financial instruments that expire.
> You are trying to SELL SELL SELL because all countries worldwide are feeling some economic pressure and you can't because the exchange is down.
Some exchanges deliberately close in those circumstances. Or when big news is about to be released.
But as much as people were unable to sell, there are nearly as many situations where being unable to sell is a good thing.
I guess we’re eventually going to argue for 24/7 stock exchanges and for some reason very few are.
My own experience in Canada is that a lot of big Canadian stocks also trade in NY, and when one market is closed and the other isn’t because of different holidays, not a lot happens.
I guess an unexpected crash is different, but my guess is that everyone takes the day off, avoids releasing any big news out of respect for the situation and gets back to it tomorrow.
No need to pull 24/7 uptime into the conversation, I would say. The impact is that the exchange was not open when they were expected to be open. An outage like that is more disruptive than scheduled "downtime" because of being unplanned.
Except for FX, there is no need for 24/7 and plenty of drawbacks; both in terms of overhaed and execution quality.
In fact the trend is in the opposite direction, towards shorter trading hours as well as more turnover in the opening and closing auctions.
Shorter hours especially would be a huge win for work-life balance and diversity in the industry, and just as importly for reducing costs.
Having the markets open longer only spreads liquidity thinner across the day, and moreover having announcements made, corporate actions processed, etc etc is better done when the market is closed so everyone can be on the same page when the happen.
Stock exchanges earn major share of their profit from processing the orders.
They just lost at least a whole day of revenue.
After compiling some number for their annual report[1], I would say the loss from TSE revenue alone is roughly 2 million USD. But considering the effect trickling down the revenue stream where other security partners whose revenue depends on earning transaction fees, I would say the real damage would be many folds of that 2 million loss from TSE.
More or less by definition, this won't impact their yearly bottom line by more than 0.3% (at least to within first order). Which is quite a bit, but it still seems super minor. Just like 2 million. Isn't that peanuts for a stock exchange?
Agreed that 2 million USD is not much compared to total revenue.
However, you see, TSE YoY growth from 2017 to 2018 is roughly 4 million USD. Growth is hard as it is for TSE, leaving money on the table like this time definitely hurts.
The exchange lost a lot of revenue. Trader's and investor's time was wasted, you can't just change the open time without screwing up a lot of systems. Risk management would've been an issue too, a lot of people crossing the spread to derisk and transferring wealth to market makers.
Opening late won't break much. Most systems will be like "well shucks there's no trades to process, better just sit here and wait". Opening early will break everything.
A large number of people suffered a small amount of efficiency loss. The number of utils or QALY or whatever metric you want lost was probably fairly substantial.
The line of reasoning “well, no one died” is a classic example of humans’ unfortunate tendency to round small numbers to zero in utilitarian calculations, which leads to all sorts of bad decisions when you’re dealing with things that effect lots of people in small ways.
If you waste 10 seconds of 220 million people's time, which by my calculation adds up to about 70 years, it is not equivalent to depriving one person of their entire life.
That doesn't mean I think anything can't be measured or fit into some sort of framework, I just think there is some fundamental truth in the rounding to zero you refer to, that needs to be accounted for. It's too glib to dismiss it.
The opposite end of the spectrum causes problems too: "It's ok for some people to die, if it results in a small amount of good for a large number of people." One way to look at the problem with that in practice is that small amounts of good and large numbers of people both tend to involve a lot of uncertainty. When we're allowed to do motivated reasoning with enough uncertainty, we can usually prove whatever we want, and that sort of simple utilitarianism ends up being too easy to abuse. I think we can look at cultural taboos as a practical solution to this problem: "Ok, clearly there's no reliable way for us to say what killing is good and what killing is bad, and we all agree that most of it is bad, so let's make the simplifying assumption that all killing is bad."
> A large number of people suffered a small amount of efficiency loss. The number of utils or QALY or whatever metric you want lost was probably fairly substantial.
If the market closing is so bad for its participants, why aren’t they lobbying to keep it open past 4PM? Or on Saturdays?
We can use depository receipts and similar strategies to keep trading 24/7. But the markets would probably be a lot more efficient if they were open all the time. That’s not to say that we (traders) would necessarily make much more money off them that way. We’d probably make less, actually, because it would simplify/remove the kind of arbitrage market opens/closes let us do.
The traders still sleep and stuff. The efficiency loss was time wasted that was scheduled for trading, but not used. It's no different from unexpected downtime in any other job.
I guess I question why exchanges haven't moved toward a highway or hospital model where it's open, unless it really needs to be closed. Those places have the procedures in place to operate after hours.
Or at least adopt the retail model where they get some more retail traders by operating some weeknights/weekends.
FX trading is at least 24/5, there may be venues trading on the weekend but I expect spreads are very wide. CME is open 23/5 already. EUREX has extended their trading hours into the Asia shift in the past few years, I think 20/5. Many other European and Asian derivatives exchanges have extended trading hours over the past decade.
Most equity exchanges are only open or liquid during local business hours. US markets are open longer, but generally illiquid and there are fewer protections outside of "market" hours.
I think for equities, nobody wants to be a market maker 24/7 without a sufficiently wide spread to protect them from news events such as the death of an executive. So even with 24/7 markets you'd probably see very poor conditions outside of core hours. With derivatives, there's more interest in never sleeping. The world is a very interconnected place and less hinges upon a single person's death.
Leaders frankly accepting responsibility, known as the Asoh Defense[0] is named after Capt. Kohei Asoh after accidentally landing a DC-8, JAL Flight 2 in San Francisco bay.
Depending on your POV, financial exchanges are a great/awful example of the "behind the surface" complexity of modern life. You'd think with only a few order types and not that many tickers you could stand up an exchange using rust in a few nights and weekends, no? fsync liberally, pay colin his tarsnap dues, and off you go! /s
Trading systems are generally regarded to be some of the most challenging and simultaneously unrewarding type of engineering out there. I don't think there is anyone, even on HN, who will look at the NYSE platform and go "I could build that in a weekend".
Yeah you probably could. The problem is that major exchanges like nyse or cme have hundreds of order types and absolutely staggering volume. Not to mention the billions of dollars on the line as well.
I wonder if a solution such as "Cancel all orders and publicly announce this as well as trading commencement time such as +1 hours ahead, then reboot the server just before that and resume the day as normal" was considered, and if so why it wouldn't have worked, given it seems to satisfy the constraints mentioned in the article
If a system supports "Good 'Til Canceled" orders, casual application of "cancel all orders" will wipe out orders that are months old. Maybe that is called for in an extreme case (data center destroyed, offsite backups cannot be recovered), but at a minimum it is extremely rude. GTC orders even live through corporate actions, getting repriced as appropriate. They treat that order lifetime seriously.
It is not actually a very bad thing for the whole market to go down for everybody at the same time. What is bad is for part of the market to go down, or for it to go down for some people but not others.
This failure highlights how frequent, scheduled testing of your failover system is needed in order to be able to say, honestly, that you even have a failover system, and not just another box burning power and doing nothing. When you have a choice, it is often better to have both systems running all the time, at less than half capacity and sharing the load; or each doing all the work, and throwing away half.
If you choose the former, traffic can sometimes peak over half capacity, without loss. If the latter, you can check that they are producing the same answers, too.
Every person's profit would have been somebody else's loss. They match exactly, less transaction costs which were saved by both sides. Of course somebody else didn't collect on the transaction costs.
The exchange obviously does not get to collect transaction fees for transactions it fails to mediate. But that does not harm buyers or sellers. As noted, they get a vacation from paying fees, along with avoiding any profits and losses.
The exchange suffers, paying salaries and rent without income, but they deserve it for providing crappy service.
The interesting aftermath from this situation is the Top of Tokyo Stock Exchange(東証) as a company is recognizing the technical situation well. That's not the case for most of the companies in Japan as most of the companies simply outsource their system, but 東証 was not.
While Japanese so-called journalists are completely blind at technology, even worse, they don't even have a basic literacy or listening skill, questioning things they already told, or fart out a question like "But computers won't break, isn't it?" while the cause is likely be soft memory error.
Sounds very much like a fault plus a bad HA implementation took down the market. Couldn't care less about the fault. I'd really like to hear about the bug(s) in the HA software.
If you don't regularly test your failover, chances are it will not kick in when the primary fails. Especially if the primary is very reliable. Very common pattern.
Ideally, you periodically test your ability to failover. But if it doesn't work, well, there's a chance that you just caused a user-facing outage with your test.
Active->standby failover is usually that way because it doesn't support active->active. Which likely means that the active->standby system is bolted on with some 3rd party technology that isn't integrated with the application.
I've been involved with a number of failover systems where even when it worked there was the possibility that you might hit a _known_ condition that causes the fail-over to fail.
Pretty scary knowing the product your working on has a couple critical holes in the fail-over that management papered over, which while rare could happen. A lot of these solutions are the equivalent of pull the power on one machine move the disk to the other and power it on. The assumption being that the storage mirror/replication/etc being used to maintain transnational consistency for the "move the disk" part is actually going to be consistent when that happens.
This happened (or at least, was detected) an hour before trading opened. It should have failed over then. To me, that means that you could validly test your failover an hour before trading opened (or, perhaps wiser, an hour after trading ended). If it doesn't work, you learned without causing a user-facing outage.
The system is online, publishing market data and and accepting orders before trading starts. The exchange initially announced a delay in opening, and only later announced staying down for the day. It doesn't sound like they had the option to do what you described. The staff trying to resolve the problem was presumably doing everything they could before giving up for the day.
Failover should be tested, but what should be tested? There are many components and possibly innumerable failure reason is exists so 100% confident isn't possible.
I have worked with Fujitsu in Japan before. I could see the following as one way it could happen.
There is a large disconnect with large system integrators there and the actual developers / architects; usually 2 or 3 levels of subcontracting. This means there are 2 to 3 levels of intermediaries taking a margin too, and as such, the actual developer doesn't have the fiscal wherewithal to push back on requirements or deadlines too strongly. (How do you save money on a 300k yen salary in Tokyo?)
If I was to guess, the requirement was there, but by the time someone technical got into the real nitty gritty, they discovered that the timeline was too tight to effectively do the testing. Instead of pushing back, they just rushed it through with a lot of overtime work.
Also, though maybe this is unfair, but Japanese culture doesn't work very well for software development. I worked with a couple on a project and they are absolutely unwilling to ask for help, ask questions, or proactively fix problems. Of course this is a very small sample, but I've heard very similar things from a friend who worked at a Japanese conglomerate for a couple years in Japan.
They treat mistake/bug/problem as a failure of a developer or engineer, not something to be proactively looking for and taking as a lesson for both engineer and code improvement.
"When the error happened, the system should have carried out what’s called a failover -- an automatic switching to the No. 2 device. But for reasons the exchange’s executives couldn’t explain, that process also failed. That had a knock-on effect on servers called information distribution gateways that are meant to send market information to traders."
I would be very curious to hear about your favorable experience with untested failover systems. My last job we tested our active/standby failover like clockwork every 3 months. It served us very well.
My experience is that they worked when needed and it was a sigh of relief, and a little pat on the back that we perceived reality correctly when we had the foresight to set them up.
It's sort-of a saying in the industry that can be applied to many things. The biggest problem is the people who might know how to make it work are no longer there or have forgotten. If it's automated there's a chance it might work, but if it's slightly off then it's even harder to comprehend and adjust.
I've seen backups and failovers not work so many times that it's an amazing surprise when one actually works--usually after laborious manual off-script intervention and invention. I was being somewhat kind to failovers, I have seen smallish backups work on several occasions. An untested disaster recovery is the worst.
Edit: think of an unexercised procedure as not "It works for me" but rather "it worked for me once."
I don't think failover won't work in every failure scenario, but won't work in this particular scenario. We can't test all possible failure situation regularly.
Someone speculated that since this is a RAM failure, the storage in question might kept crawling with reduced RAM and full workload so failover didn’t kick in(either failed from high load, or didn’t trigger from the fact that faulty RAM is successfully isolated). Sounds plausible.
RAM failures are fun. Maybe 15 years ago, I had one box reboot itself and come back up with only ~ 16 MB out of an expected 1 GB or so; it was pretty overprovisioned and was able to sort of keep up working from swap but was sending alarms on a system that was always quiet (near idle cpu normally, important but low throughput system). More recently, I've had a couple systems with so high a rate of correctable ECC errors that machine check processing ate nearly all the cpu; but it still had enough to pass healthchecks, but not handle requests in a timely fashion, but didn't manage to get an uncorrectable error that would have paniced the machine so failover could work.
Shouldn't something like Lasp[1] help to build better distributed and fault-tolerant systems of this scale? Or using the purer programming languages and frameworks, like Jane Street with OCaml, Standard Chartered with Haskell.
The one that really sticks out to me as an engineer is the fact that the whole system in most cases seems to be tied together by a fragile arrangement of 100+ different vendors' systems & middleware that were each tailored to fit specific audit items that cropped up over the years.
Individually, all of these components have highly-available assurances up and down their contracts, but combine all these durable components together haphazardly and you get emergent properties that no one person or vendor can account for comprehensively.
When the article says a full reset entails killing the power and restarting, this is my actual experience. These complex leviathans have to be brought up in a special snowflake sequence or your infra state machine gets fucked up and you have to start all over. When dependency chains are 10+ systems long and part of a complex web of other dependency chains it starts to get hopeless pretty quickly.