Read the presentation. Answer was what I expected. We had unique problem and because we make oil drums amount of cash, dipping a bucket and taking that cash to solve the problem was easy justification.
These are really smart people solving problems they have but many companies don't have buckets of cash to hire really smart people to solve those problems.
Also, the questions after presentation pointed out the data isn't always analyzed in their database so it's more like storage system then database.
>Participant 1: What's the optimization happening on the pandas DataFrames, which we obviously know are not very good at scaling up to billions of rows? How are you doing that? On the pandas DataFrames, what kind of optimizations are you running under the hood? Are you doing some Spark?
>Munro: The general pattern we have internally and the users have, is that your returning pandas DataFrames are usable. They're fitting in memory. You're doing the querying, so it's like, limit your results to that. Then, once people have got their DataFrame back, they might choose another technology like Polars, DuckDB to do their analytics, depending on if they don't like pandas or they think it's too slow.
I skipped to the "why build a database" section and then skipped another two minutes of his tangential thoughts - seems like the answer is "because Moore's law"?
Not sure if it's standard practice when using Business Source License, but seems to have additional terms too, like:
> BSL features are free to use and the source code is available, but users may not use ArcticDB for production use or for a Database Service, without agreement with Man Group Operations Limited. [...] Use of ArcticDB in production or for a Database Service requires a paid for license from Man Group Operations Limited
So not just "source available, go ahead and use it" but basically "free for personal use only".
> Free as long as you aren't a competitor, or you use ArcticDB for anything we could consider "in production"
This is how I understand their text, and I wouldn't be super excited about trying to figure out exactly what "in production" means with their lawyers in court.
Yeah that’s fair, not open source, I was incorrect, though I think it’s more than personal use only. “Production use only” is doing a lot of heavy lifting and I don’t know exactly what that means (though could make a guess).
I still didn’t get why they built this, there’s a better explanation of the feature set in the FAQ comparison with parquet: https://docs.arcticdb.io/latest/faq/
> How does ArcticDB differ from Apache Parquet?¶
> Both ArcticDB and Parquet enable the storage of columnar data without requiring additional infrastructure.
> ArcticDB however uses a custom storage format that means it offers the following functionality over Parquet:
> Versioned modifications ("time travel") - ArcticDB is bitemporal.
> Timeseries indexes. ArcticDB is a timeseries database and as such is optimised for slicing and dicing timeseries data containing billions of rows.
> Data discovery - ArcticDB is built for teams. Data is structured into libraries and symbols rather than raw filepaths.
> Support for streaming data. ArcticDB is a fully functional streaming/tick database, enabling the storage of both batch and streaming data.
> Support for "dynamic schemas" - ArcticDB supports datasets with changing schemas (column sets) over time.
> Support for automatic data deduplication.
The other answer I was looking for was why not kdb since this is a hedge fund.
I think people are getting a little tired of being held ransom to kdb/q and kdb consultants. Even if you have 'oil barrels' of money, eventually it is annoying enough to look elsewhere.
The idea they had was to just make something up, like pandas and a series/columnar database, themselves. There are other competitors to kdb/q, but they are not as entrenched and maybe not a perfect fit. These guys cooked up a closer fit for their own systems, than, say clickhouse and other tools.
It had to be a very close fit to what they do, as kdb/q is pretty damned good at some things in finance. Maybe there is not enough money in the highly specialised areas it does very well at, for other people to come in with something new.
It would be a huge mistake to think sql is a replacement for Q.
Given how high a selling point this is, it is something that I cannot recall ever using in ArcticDB (5+ years of using). It's (financial) time series data, the past doesn't change.
Back adjusting futures contracts or back adjusting for dividends / splits come to mind as I write this, but I would just reprocess these tasks from the raw data "as of date" if needed.
For pricing data, sure, but for things like fundamentals or economics where there are estimates that are revised over time, one way to store this data is with a versioning feature. It allows for PIT data without much overhead, in theory.
And actually, pricing can be revised as well, though it is much less common.
That said, versioning is not the only way to handle these kinds of things.
Versioning can also be useful as an audit trail for data transformation. Though again, these could be stored in another way as well.
100% makes sense, it depends what you're looking for.
There are however lots of time-series that do change in Finance, e.g. valuations, estimates, alt-data (an obvious one is weather predictions). The time-travel feature can being super useful outside of external data changing as well, as an audit-log, and as a way to see how your all-time evaluations have changed (say backtests).
I’ve never seen it used for backtests personally. Generally backtested results are saved as a batch of variants following some naming convention as different symbols.
You would use some convention for naming and parametrising backtests, 'different' backtests would get stored separately. But once you start updating backtests, running them in a loop with changing data, that's when the time-travel feature starts to be useful.
I believe when a bunch of ex-Goldman's were at BofA (they move from bank to bank, trying to re-implement SecDB [1]), they were creating their own time series database as well.
Hi. I'm the presenter. Thanks for the interest. Opinions here are my own.
I'll put in a TLDR as the presentation is quite long. The other thing I'd like to say was that QCon London impressed me, the organisers spent time ensuring a good quality of presentation. The other talks that I saw were great. Many conferences I've been to recently are just happy to get someone, or can choose and go with well known quantities. I first attended QCon London early in my career, so it was interesting coming back after over a decade to present.
TLDR:
Why did we build our own database? In effort terms, successful quantative trading is more about good ideas well executed than it is about production trading technology (apart from perhaps HFT). We needed something that helped the quants be the most productive with data.
We needed something that was:
- Easy to use (I mean really easy for beginner/moderate programmers). We talk about day 1 productivity for new starters. Python is a tool for Quants not a career.
- Cost effective to run (no large DB infra, easy to maintain, cheap storage, low licensing)
- Performant (traditional SQL DBs don't compare here, we're in the Parquet, Clickhouse, KBD, etc space)
- Scalable (large data-science jobs 10K+ cores, on-demand)
This sort of general architecture (store parquet-like files somewhere like s3 and build a metadata database on top) seems reasonably common and gives obvious advantages for storing lots of data, scaling horizontally, and scaling storage and compute separately. I wonder where you feel your advantages are compared to similar systems? Eg is it certain API choices/affordances like the ‘time travel’ feature, or having in-house expertise or some combination of features that don’t usually come together?
A slightly more technical question is what your time series indexes are? Is it about optimising storage, or doing fast random-access lookups, or more for better as-of joins?
We do have a specialist time-series index, optimised for things like tick-data. It compresses fairly well but we generally optimise for read-time. Not all over the place random-access, but slicing out date-ranges. There are two layers of index, a high level index of the data-objects, and the index in each object in S3.
A built-in as-of join is something we want to build.
I feel like ‘exactly’ is doing a lot of work in your comment and I am interested in the reasons that that word may not be quite the right word to describe these situations.
Well yes, in the right context, like hobby/personal programming. But things like "we built our own database" tend to be really hard to justify and mostly represent technical people playing toys when they actually have an obligation to the business that is spending the money to not play toys but to spend the money wisely and build a wise architecture, especially for business critical systems.
It's indulgent and irresponsible to do otherwise if standard technology will get you there. The other point is that there are few few applications in 2024 that would have data storage requirements that cannot be met by Postgres or some other common database and if not then perhaps the architecture should be changed to do things in a way that is compatible with existing data storage systems.
Databases in particular as the core of a business system have not only the get "put data in/get data out" requirement but hundreds of sub requirements that relate to deployment and operations and a million other things. You build a database for that core set/get requirement and before you know it you're wondering how to fulfill that vast array of other requirements.
This happens everywhere in corporate development that some CTO who has a personal like for some technology makes the business use that technology when in fact the business needs nothing more than ordinary technology, avoiding all sorts of issues such as recruiting. The CTO moves on, leaving behind the project built with the technology flavor of the month and a project that either needs to struggle to maintain it into the future, or the need to replace it with a more ordinary way of doing things. Likely even the CTO has now lost interest in their playtime technology fad that the industry toyed with and decided isn't a great idea.
So I stand by my comment - they are likely to replace this within a few years with something normal, likely Postgres.
I think you just fundamentally misvalue what he is optimising for. It's live or die by alpha generation, and his tradeoffs are not HFT. There is going to be a whole other infra and opsec for executing the strategy.
I think it is somewhat like git's creation story. Sometimes a senior dev sees a tool that is close to ideal but needs to work a little differently than what the industry has built.
Databases are up there with encryption. Don't roll your own... mentality.
But sometimes they don't fit the problem your solving. Sometimes the data never changes so why have infrastructure for updates.
Having a big DB running all the time could be too expensive for your business model
Also it is good to be curious about "what is an index" and how does a parquet file look in hex editor. Why can't I write the underlying db table outside of postgres. Why are joins hard..
And then you discover your tools give you a competitive edge
Most of the time there are existing tools, but sometimes they don't.
They have tons of money, enough to support a development team improving their database. In addition there is a long legacy, both technically and politically, making it hard to even propose to get rid of it. The only likely switch is a gradual decline, with parts of the system moved to another database.
No, but you’re the first person I’ve ever seen mention TileDB in the wild. I applied there a few years ago; didn’t go anywhere, but I’ve been keeping an eye on them ever since, because I think it’s an interesting idea.
I know there are tons of problems that are solved in excel while they really shouldn't. Instead of getting the expert business analyst to use a better tool (like pandas), money is spent to "fix" excel.
Apparently there is also a class of problems that outgrow pandas. And instead of the business side switching to more suitable tools, some really smart people are hired to build crutches around pandas.
Oh well, they probably had fun doing it. Maybe they get to work on nogil python next
There's value in 'backwards compatibility' from a process/skills perspective. I agree that companies usually pay too high a premium on that, but there is value.
These are really smart people solving problems they have but many companies don't have buckets of cash to hire really smart people to solve those problems.
Also, the questions after presentation pointed out the data isn't always analyzed in their database so it's more like storage system then database.
>Participant 1: What's the optimization happening on the pandas DataFrames, which we obviously know are not very good at scaling up to billions of rows? How are you doing that? On the pandas DataFrames, what kind of optimizations are you running under the hood? Are you doing some Spark?
>Munro: The general pattern we have internally and the users have, is that your returning pandas DataFrames are usable. They're fitting in memory. You're doing the querying, so it's like, limit your results to that. Then, once people have got their DataFrame back, they might choose another technology like Polars, DuckDB to do their analytics, depending on if they don't like pandas or they think it's too slow.