Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been defending Robinhood since the beginning but this is just too much. I will start looking at new brokers this afternoon.

At this point for me its not so much about the lost opportunity as it is about what the hell actually exploded so hard that they need to be down for 2 consecutive days. From my perspective, it's looking less and less like some one-off technical or hardware screwup and more like a fundamental limitation of their core architecture.



I actually switched most of my "real" investing to https://www.m1finance.com/ a few months ago. It is also commission-free, but unlike Robinhood you can do IRAs as well as taxable accounts, and it automatically reinvests dividends. One interesting difference is that they limit buys/sells once daily (twice if you pay a small monthly fee). Day traders won't like this, but for buy-and-hold investor not trying to time the market to the minute it makes sense to me. As a developer I'd think it has to be easier to code than real-time. I feel like it's essentially a cheaper Betterment/Wealthfront. I keep a taxable "short term savings" account in 90%bonds/10% stocks, and then my IRAs in 10%bonds/90%stocks.

I was a little concerned they were front-loading orders. Still a valid question, but they say no: https://www.m1finance.com/blog/how-m1-makes-money/

Not to be too spammy, but here's a referal link if you want the 10$: https://mbsy.co/swlqd


I haven't looked into the technical credibility of this at all [EDIT: I should have more strongly indicated my doubt here], so this isn't any sort of condemnation on my end, but I thought the Leap Day theory was interesting:

https://twitter.com/jtech63/status/1234600045787394048


That was probably a joke, see Robinhood's reply: https://twitter.com/AskRobinhood/status/1234861941413351434

Seems like infra problems.


It's starting to seem like they use some sort of message bus to tie everything together. If the servers actually processing messages can't handle the volume, the entire show comes grinding to a halt. I've seen similar things in other industries where the entire system looks like its perfectly healthy, and then before you know it 100% of your systems are down because you simply can't push peak message volume or because one participant is infinite-looping messages.


Correct, here is their Kafka-based stream processing package that presumably ties everything together: https://faust.readthedocs.io/en/latest/


Their director of infra tried to recruit me and basically said that they wanted to replace every relational data store with message queues instead... that struck me as a bit weird and overzealous.


It's worth noting that they were down 4 years ago on March 2 as well.


I am seeing this parroted everywhere this comes up and it is very unlikely to be the case. For one we have no idea what that API call is doing. The date parameter could be an upper bound, ie. it's just asking for all of the most recent data. In addition, Robinhood was up over the weekend, after the leap year occurred, and was even available early on Monday before the market opened.


Robinhood only trades in the US and the US markets are closed on Saturday and Sunday which means there were no live orders on those days which in turn means that certain code paths were not executed and errors were not thrown on Saturday and Sunday the two days when it would have been detected.


There was a discussion last night about this. Some of the more technically competent posters dismissed it because it's unbelievable that a financial platform would roll their own date-time implementation.


I work in finance, and at a previous employer (not Robinhood), I partially rolled a datetime implementation. Mostly it was a wrapper around Boost Date Time [0]. It was a facade that smoothed out the interface and implemented some missing functionally, like a cross-platform strptime and loading of the Olson timezone database.

I spent a better part of a year working on it (along with other things). Modeled the interface after Python's datetime module (which I think is one of the simpler and easy to use date-time libraries across various languages I've used). More than $20B USD trades using that library.

The motivation was the firm used to use RogueWave's date-time facilities, but we moved away after they jacked up the licensing. Think we used to have a site-wide license, but they were moving to a per-core licensing, wanting something like $2K per core annually.

Needless to say, testing was extensive. Overflows were found in Boost in far distant dates, had to work around those. Tested against a ton of historical dates. Tested against Northern and Southern Hemisphere daylight saving time (a lot of people don't realize that Southern is inverted from Northern). Learned a lot about timezones and the history of timezones along the way.

[0] https://www.boost.org/doc/libs/1_72_0/doc/html/date_time.htm...


While I agree in spirit (Occam tells me it was just unprecedented load), that explanation doesn't make any sense to me in terms of disproving the Leap Year theory. The mistake wouldn't even have to be in their code.

As for believing/unbelieving, I mean, I've seen dumber mistakes in financial products I've worked on. Not as high stakes or damaging, but definitely dumber.


That means people should reevaluate the weight of the opinions of those technically competent posters - we have a screen shot that robinhood did roll their own implementation because yesterday was March 2nd and its app was requesting March 3rd.

That at least means that some portion of their stack used roll your own datetime library. That would not actually be a problem as long as the entire stack got the same library or the same rules for datetime. The problem is of course that it probably did not so the libraries were not bug compatible which of course caused errors and those errors need to be handled, preferably every fast.

Error load and error handling is the least tested part of every system because it can only be properly tested in production and, unless your company embraces the Chaos Monkey approach to testing, the C-level would have a heart attack when anyone proposes doing it.


That particular API endpoint is just returning market open times, in which a request for tomorrow might be perfectly reasonable.

https://api.robinhood.com/markets/XASE/hours/2020-03-03/

vs

https://api.robinhood.com/markets/XASE/hours/2020-03-07/

This whole discussion lacks context to the nature of the requests, aka a front end code review.


The issue is going to end up being related to some sort of date being injected by the front-end and propagating that incorrect date into the infrastructure.

Somewhere within the infrastructure there's going to be an assertion such as

  if (max_drift_delta > delta(order_live_date,system_live_date) {
    # oh crap, something is completely broken in our system where did this come from?
    blow_up("terrible things are happening! how did we get this order")
  }
which is an excellent and correct catcher for "terrible things are happening" since those things should never happen. That blow_up() code path is likely to be very expensive which kills performance of the system, which in turn means that it no longer can handle the load.

And since RH has lots of people who use apps, it is not that they can just push an immediate bugfix.


This definitely does not come from a front-end:

https://twitter.com/jhyu/status/1234617361467990018


Something similar occurs to crypto exchanges. They go down during days with extremely high volatility. During Q4 2017 most of the crypto exchanges were fighting hard to stay online, sometimes in precarious conditions, sometimes with execution delays of up to a few minutes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: