ArcticDB: Why a Hedge Fund Built Its Own Database

stackskipton · 2024-08-24T20:29:37 1724531377

Read the presentation. Answer was what I expected. We had unique problem and because we make oil drums amount of cash, dipping a bucket and taking that cash to solve the problem was easy justification.

These are really smart people solving problems they have but many companies don't have buckets of cash to hire really smart people to solve those problems.

Also, the questions after presentation pointed out the data isn't always analyzed in their database so it's more like storage system then database.

>Participant 1: What's the optimization happening on the pandas DataFrames, which we obviously know are not very good at scaling up to billions of rows? How are you doing that? On the pandas DataFrames, what kind of optimizations are you running under the hood? Are you doing some Spark?

>Munro: The general pattern we have internally and the users have, is that your returning pandas DataFrames are usable. They're fitting in memory. You're doing the querying, so it's like, limit your results to that. Then, once people have got their DataFrame back, they might choose another technology like Polars, DuckDB to do their analytics, depending on if they don't like pandas or they think it's too slow.

primitivesuave · 2024-08-24T21:31:49 1724535109

I skipped to the "why build a database" section and then skipped another two minutes of his tangential thoughts - seems like the answer is "because Moore's law"?

datahack · 2024-08-24T20:57:36 1724533056

This comment is underrated comedy gold. You clearly have worked with big data.

dnadler · 2024-08-24T23:50:13 1724543413

If it wasn’t clear from the article, this is open source and available on Man’s GitHub page:

https://github.com/man-group/arcticDB

I used to work at man, so take this with a grain of salt, but I really liked this and have spun it up at home for side projects over the years.

I’m not aware of other specialized storage options for dataframes, but would be curious if anyone knows of any.

camkego · 2024-08-25T03:04:13 1724555053

The license is the Business Source License, which is not open source: https://en.wikipedia.org/wiki/Business_Source_License

The current license says the terms shall change to Apache after two years a release has been available.

Not for me, but some might find it interesting.

diggan · 2024-08-25T10:55:03 1724583303

Not sure if it's standard practice when using Business Source License, but seems to have additional terms too, like:

> BSL features are free to use and the source code is available, but users may not use ArcticDB for production use or for a Database Service, without agreement with Man Group Operations Limited. [...] Use of ArcticDB in production or for a Database Service requires a paid for license from Man Group Operations Limited

So not just "source available, go ahead and use it" but basically "free for personal use only".

wdb · 2024-08-25T12:00:36 1724587236

Feels more feel free to use it as long you aren't a competitor

diggan · 2024-08-26T20:25:20 1724703920

> Free as long as you aren't a competitor, or you use ArcticDB for anything we could consider "in production"

This is how I understand their text, and I wouldn't be super excited about trying to figure out exactly what "in production" means with their lawyers in court.

dnadler · 2024-08-25T12:22:16 1724588536

Yeah that’s fair, not open source, I was incorrect, though I think it’s more than personal use only. “Production use only” is doing a lot of heavy lifting and I don’t know exactly what that means (though could make a guess).

cgio · 2024-08-25T02:36:21 1724553381

Need a licence apparently, source available.

flockonus · 2024-08-25T03:50:02 1724557802

Few doc choices are as annoying as using GIF to show code that i should read in 2 seconds.

jjmunro · 2024-08-26T09:20:26 1724664026

Will improve.

faizshah · 2024-08-25T05:33:35 1724564015

I still didn’t get why they built this, there’s a better explanation of the feature set in the FAQ comparison with parquet: https://docs.arcticdb.io/latest/faq/

> How does ArcticDB differ from Apache Parquet?¶

> Both ArcticDB and Parquet enable the storage of columnar data without requiring additional infrastructure.

> ArcticDB however uses a custom storage format that means it offers the following functionality over Parquet:

> Versioned modifications ("time travel") - ArcticDB is bitemporal. > Timeseries indexes. ArcticDB is a timeseries database and as such is optimised for slicing and dicing timeseries data containing billions of rows. > Data discovery - ArcticDB is built for teams. Data is structured into libraries and symbols rather than raw filepaths. > Support for streaming data. ArcticDB is a fully functional streaming/tick database, enabling the storage of both batch and streaming data. > Support for "dynamic schemas" - ArcticDB supports datasets with changing schemas (column sets) over time. > Support for automatic data deduplication.

The other answer I was looking for was why not kdb since this is a hedge fund.

mianos · 2024-08-25T07:18:12 1724570292

I think people are getting a little tired of being held ransom to kdb/q and kdb consultants. Even if you have 'oil barrels' of money, eventually it is annoying enough to look elsewhere.

blitzar · 2024-08-25T08:48:42 1724575722

You don't get 'oil barrels' of money if you are in the habit of giving big chunks of it to other people.

Handing over seed capital levels of cash to external vendors inevitably gives someone the idea to seed a competitor instead.

mianos · 2024-08-25T10:02:28 1724580148

The idea they had was to just make something up, like pandas and a series/columnar database, themselves. There are other competitors to kdb/q, but they are not as entrenched and maybe not a perfect fit. These guys cooked up a closer fit for their own systems, than, say clickhouse and other tools.

It had to be a very close fit to what they do, as kdb/q is pretty damned good at some things in finance. Maybe there is not enough money in the highly specialised areas it does very well at, for other people to come in with something new.

It would be a huge mistake to think sql is a replacement for Q.

porker · 2024-08-25T08:38:43 1724575123

> Versioned modifications ("time travel") - ArcticDB is bitemporal. > Timeseries indexes.

That's a nice addition.

blitzar · 2024-08-25T08:59:12 1724576352

Given how high a selling point this is, it is something that I cannot recall ever using in ArcticDB (5+ years of using). It's (financial) time series data, the past doesn't change.

Back adjusting futures contracts or back adjusting for dividends / splits come to mind as I write this, but I would just reprocess these tasks from the raw data "as of date" if needed.

dnadler · 2024-08-25T12:41:50 1724589710

For pricing data, sure, but for things like fundamentals or economics where there are estimates that are revised over time, one way to store this data is with a versioning feature. It allows for PIT data without much overhead, in theory.

And actually, pricing can be revised as well, though it is much less common.

That said, versioning is not the only way to handle these kinds of things.

Versioning can also be useful as an audit trail for data transformation. Though again, these could be stored in another way as well.

jjmunro · 2024-08-25T09:39:22 1724578762

100% makes sense, it depends what you're looking for.

There are however lots of time-series that do change in Finance, e.g. valuations, estimates, alt-data (an obvious one is weather predictions). The time-travel feature can being super useful outside of external data changing as well, as an audit-log, and as a way to see how your all-time evaluations have changed (say backtests).

nzjrs · 2024-08-25T11:50:30 1724586630

Id be interested to hear how often you have your team use this to also store historical backtest results as people iterate on strategies.

dnadler · 2024-08-25T12:46:33 1724589993

I’ve never seen it used for backtests personally. Generally backtested results are saved as a batch of variants following some naming convention as different symbols.

jjmunro · 2024-08-26T09:12:48 1724663568

So, yes, separating the use-cases out...

You would use some convention for naming and parametrising backtests, 'different' backtests would get stored separately. But once you start updating backtests, running them in a loop with changing data, that's when the time-travel feature starts to be useful.

dnadler · 2024-08-26T16:26:19 1724689579

Yeah, that’s fair, and something you basically get for free.

Also, hi James! :)

jjmunro · 2024-08-26T09:08:05 1724663285

It's a primary use-case.

jjmunro · 2024-08-25T09:56:07 1724579767

That's fair feedback. A direct parquet comparison in the presentation would have been useful.

dang · 2024-08-24T19:37:03 1724528223

Introducing ArcticDB: Powering data science at Man Group - https://news.ycombinator.com/item?id=35181870 - March 2023 (1 comment)

Nelkins · 2024-08-24T20:18:17 1724530697

I don’t think the last link is related. Different database.

silisili · 2024-08-24T20:42:27 1724532147

Correct. They renamed FrostDB, here is the announcement -

https://www.polarsignals.com/blog/posts/2022/06/16/arcticdb-...

dang · 2024-08-26T02:54:48 1724640888

Thanks for the correction! I'll take it out from the list above and put it here instead:

Introducing ArcticDB: A Database for Observability - https://news.ycombinator.com/item?id=31260597 - May 2022 (31 comments)

chirau · 2024-08-24T23:44:28 1724543068

Two Sigma did a similar thing a few years back. It's called Smooth Storage.

https://www.twosigma.com/articles/smooth-storage-a-distribut...

gadders · 2024-08-25T11:45:19 1724586319

I believe when a bunch of ex-Goldman's were at BofA (they move from bank to bank, trying to re-implement SecDB [1]), they were creating their own time series database as well.

[1] https://www.wsj.com/articles/understanding-secdb-goldman-sac...

OutOfHere · 2024-08-24T20:05:06 1724529906

https://github.com/man-group/arcticDB

jjmunro · 2024-08-25T09:31:53 1724578313

Hi. I'm the presenter. Thanks for the interest. Opinions here are my own.

I'll put in a TLDR as the presentation is quite long. The other thing I'd like to say was that QCon London impressed me, the organisers spent time ensuring a good quality of presentation. The other talks that I saw were great. Many conferences I've been to recently are just happy to get someone, or can choose and go with well known quantities. I first attended QCon London early in my career, so it was interesting coming back after over a decade to present.

TLDR:

Why did we build our own database? In effort terms, successful quantative trading is more about good ideas well executed than it is about production trading technology (apart from perhaps HFT). We needed something that helped the quants be the most productive with data.

We needed something that was:

- Easy to use (I mean really easy for beginner/moderate programmers). We talk about day 1 productivity for new starters. Python is a tool for Quants not a career.

- Cost effective to run (no large DB infra, easy to maintain, cheap storage, low licensing)

- Performant (traditional SQL DBs don't compare here, we're in the Parquet, Clickhouse, KBD, etc space)

- Scalable (large data-science jobs 10K+ cores, on-demand)

A much shorter 3 min intro from PyQuantNews: https://www.youtube.com/watch?v=5_AjD7aVEEM

GitHub repo (Source-available/BSL): https://github.com/man-group/ArcticDB

dan-robertson · 2024-08-25T10:06:56 1724580416

This sort of general architecture (store parquet-like files somewhere like s3 and build a metadata database on top) seems reasonably common and gives obvious advantages for storing lots of data, scaling horizontally, and scaling storage and compute separately. I wonder where you feel your advantages are compared to similar systems? Eg is it certain API choices/affordances like the ‘time travel’ feature, or having in-house expertise or some combination of features that don’t usually come together?

A slightly more technical question is what your time series indexes are? Is it about optimising storage, or doing fast random-access lookups, or more for better as-of joins?

jjmunro · 2024-08-26T09:18:34 1724663914

We do have a specialist time-series index, optimised for things like tick-data. It compresses fairly well but we generally optimise for read-time. Not all over the place random-access, but slicing out date-ranges. There are two layers of index, a high level index of the data-objects, and the index in each object in S3.

A built-in as-of join is something we want to build.

joewood1972 · 2024-08-25T18:07:26 1724609246

For example, Apache Iceberg is exactly this. Complete with bitemporal query support.

dan-robertson · 2024-08-25T20:13:48 1724616828

I feel like ‘exactly’ is doing a lot of work in your comment and I am interested in the reasons that that word may not be quite the right word to describe these situations.

andrewstuart · 2024-08-25T00:46:04 1724546764

Blog post from same company in two years:

"How we switched from a custom database to Postgres".

ateng · 2024-08-25T05:36:05 1724564165

I used to work at Man, and they have been using ArcticDB (including the earlier iteration) for over 10 years now.

Loughla · 2024-08-25T02:04:30 1724551470

I don't understand your cynicism here. Should people/businesses not try new things?

andrewstuart · 2024-08-25T04:16:57 1724559417

>> Should people/businesses not try new things?

Well yes, in the right context, like hobby/personal programming. But things like "we built our own database" tend to be really hard to justify and mostly represent technical people playing toys when they actually have an obligation to the business that is spending the money to not play toys but to spend the money wisely and build a wise architecture, especially for business critical systems.

It's indulgent and irresponsible to do otherwise if standard technology will get you there. The other point is that there are few few applications in 2024 that would have data storage requirements that cannot be met by Postgres or some other common database and if not then perhaps the architecture should be changed to do things in a way that is compatible with existing data storage systems.

Databases in particular as the core of a business system have not only the get "put data in/get data out" requirement but hundreds of sub requirements that relate to deployment and operations and a million other things. You build a database for that core set/get requirement and before you know it you're wondering how to fulfill that vast array of other requirements.

This happens everywhere in corporate development that some CTO who has a personal like for some technology makes the business use that technology when in fact the business needs nothing more than ordinary technology, avoiding all sorts of issues such as recruiting. The CTO moves on, leaving behind the project built with the technology flavor of the month and a project that either needs to struggle to maintain it into the future, or the need to replace it with a more ordinary way of doing things. Likely even the CTO has now lost interest in their playtime technology fad that the industry toyed with and decided isn't a great idea.

So I stand by my comment - they are likely to replace this within a few years with something normal, likely Postgres.

nzjrs · 2024-08-25T10:48:19 1724582899

I think you just fundamentally misvalue what he is optimising for. It's live or die by alpha generation, and his tradeoffs are not HFT. There is going to be a whole other infra and opsec for executing the strategy.

Versioned dataframe native database hits perfectly for attracting productive quant researchers.

(Disclaimer, I'm also CTO of a quant fund and understand what he was optimising for)

pbrumm · 2024-08-25T14:42:04 1724596924

I think it is somewhat like git's creation story. Sometimes a senior dev sees a tool that is close to ideal but needs to work a little differently than what the industry has built.

Databases are up there with encryption. Don't roll your own... mentality.

But sometimes they don't fit the problem your solving. Sometimes the data never changes so why have infrastructure for updates.

Having a big DB running all the time could be too expensive for your business model

Also it is good to be curious about "what is an index" and how does a parquet file look in hex editor. Why can't I write the underlying db table outside of postgres. Why are joins hard..

And then you discover your tools give you a competitive edge

Most of the time there are existing tools, but sometimes they don't.

oxfordmale · 2024-08-25T08:46:21 1724575581

They have tons of money, enough to support a development team improving their database. In addition there is a long legacy, both technically and politically, making it hard to even propose to get rid of it. The only likely switch is a gradual decline, with parts of the system moved to another database.

bdjsiqoocwk · 2024-08-24T22:38:55 1724539135

Isn't it constrained to minutely timestamps or something like that.

dnadler · 2024-08-24T23:43:07 1724542987

Nope, it’s fairly generic. I worked at Man for a number of years and with Arctic a lot during that time.

In my role I was mostly working with daily data, but also higher and lower frequency data, and never had issues.

jmakov · 2024-08-25T07:51:58 1724572318

Is there any reason to use that instead of Delta lake?

quickvi · 2024-08-25T12:02:56 1724587376

anyone knows how it compares to TileDB? Seems like TileDB is just a better ArticDB

sgarland · 2024-08-25T13:43:11 1724593391

No, but you’re the first person I’ve ever seen mention TileDB in the wild. I applied there a few years ago; didn’t go anywhere, but I’ve been keeping an eye on them ever since, because I think it’s an interesting idea.

tda · 2024-08-24T22:35:40 1724538940

I know there are tons of problems that are solved in excel while they really shouldn't. Instead of getting the expert business analyst to use a better tool (like pandas), money is spent to "fix" excel.

Apparently there is also a class of problems that outgrow pandas. And instead of the business side switching to more suitable tools, some really smart people are hired to build crutches around pandas.

Oh well, they probably had fun doing it. Maybe they get to work on nogil python next

smabie · 2024-08-25T18:29:18 1724610558

All quants know pandas. beyond that their technical skills are often quite limited.

It makes sense to invest resources into something everyone knows how to use, even if it's suboptimal.

paretoer · 2024-08-26T22:30:47 1724711447

The old "can't make money with numpy, its all produced in already"

beckingz · 2024-08-25T02:41:55 1724553715

There's value in 'backwards compatibility' from a process/skills perspective. I agree that companies usually pay too high a premium on that, but there is value.

laweijfmvo · 2024-08-24T23:12:03 1724541123

i’m so sorry someone built something that you don’t agree with