Hacker News new | past | comments | ask | show | jobs | submit login
Scaling Financial Reporting at Airbnb (medium.com/airbnb-engineering)
175 points by knighthacker on March 16, 2017 | hide | past | favorite | 40 comments



If this is generalizable, and it looks like it is, it would be pretty smart to spin this out as a new company, make Airbnb the company's first customer, and go after Oracle's financial products.

(Disclosure: I used to work at Oracle on Financials Cloud.)


There seems to be an accounting equivalent of Greenspun's tenth rule:

"Any sufficiently complicated financial reporting system contains an ad-hoc, informally-specified, bug-ridden implementation of double-entry accounting."

It's such a powerful yet simple way of thinking about money that anyone who builds a billing or accounting system should be intimately familiar with it.


Related, might be interesting: "Building a powerful double-entry accounting system" -> https://www.youtube.com/watch?v=aw6y4r4NAlw


Agreed; it's unfortunate that explanations are often filled with accounting jargon that obscure its advantages. The concept itself is awesome, and avoids so much pain down the line.


"Going from imperative programming to functional programming has been a powerful paradigm shift for us to think about financial processing and accounting. We can now think of this system as a straightforward actor/handler system rather than getting mired in complicated SQL-join logic."

Though whether SQL is a functional language (or a programming language at all, if you're talking ANSI SQL) is a subtle question, I would at the very least not describe it as a traditional imperative programming language. I think this is an important distinction for the article because, contrary to what this article suggests, I've found SQL to be quite helpful for understanding functional and declarative programming concepts. That said, it might be a lot easier to express the types of tasks in the article as straight-forward functions rather than getting wrapped up in all this set-based talk in SQL.


Relational algebra is a really useful model to think about ETL task generally; SQL is an awkward dialect to express relational algebra, but it is at least a well-known one, and reasonably portable for a subset of querying. You can see the payoff in the Hadoop ecosystem too: Hive with HQL, spark-sql, Impala - SQL being used to express a data flow graph with a bunch of relational operators.

When you program directly against Spark, you're effectively building SQL plans explicitly. It's both more indirect - instead of writing a program that does stuff, you write a program that creates a data flow graph that does stuff; and you have more responsibility for performance, for good and bad.

I think to get good performance, you simply can't think on a per-item basis. You need to orient your thinking towards what can be efficiently performed at the bulk level. Whether it's column scanning in HDFS, or index scanning in a RDBMS, you need to be aware of the engineering properties of the operators you're applying. Doing lots of things per-item is a recipe for blowing your budgets, whether it's cache, memory, I/O, whatever. You want to iteratively do a little work to lots of items, and then join, rather than lots of work to each item one at a time.


Hi, I'm the author. Yeah, you have a good point. Imperative programming was just the way that we were using SQL to build that system.


Did you consider an off the shelf ERP product?

I've written something similar to the first part - extract raw data out of various source DB's using SQL queries then push it to our organisation's ERP product (SAP) using A2A messaging.

From my view SAP is black box but it handles the actual accounting/ financial logic part i.e Ledgers, product tracking, inventory management etc. Our Accountants all seem pretty comfortable using it.


Airbnb's data models weren't initially designed to be financially reported on, and by the time we needed better financial reporting, it was too late to change those models. 90% of the work was about rethinking the way to think about all of this, what financial impact should be booked, and how it could be derived from the data. None of the ERP solutions fit our use case, and I think it would have been very difficult to integrate.

We still use a general ledger to book the outputs of our new financial pipeline, but I don't think we have a traditional business model (no traditional inventory management). That's more in the finance and accounting department though, and I can't speak much to that.


I don't know the volume of transactions they have, but in my past experience with SAP it was extremely hard and expensive to implement and make it scale (1~2 years for the initial implementation and migration and 10M+ USD spent), and we weren't even that big (~3M sales orders per year). Debugging it was also a nightmare, and they still had frequent data integrity issues, even just between SAP modules.


Do you plan on using Scala and Akka Persistence in the future for the entirely event-based system? http://doc.akka.io/docs/akka/current/scala/persistence.html.

We have been using it at the finance company I work for to maintain customer's ledgers.


It's definitely something to consider, depends on our project roadmap.


Thanks for the writeup! I was on the payments team at Groupon and have fond memories of the same challenges.


SQL is in fact one of the most well known declarative languages. It operates on sets so I'm not sure it even knows how to do imperative.

Can it?


They mention triggers, my guess is that they were using something like PL/SQL to push data into the financial tables.


This is very interesting, but I would love to know more details. Are they using something like Amazon Kinesis or Kafka to send events and handle missed events? What serialization format are the messages? How do they manage keeping the schemas of the events in sync?


At my company we built accounting on top of Xero.com API. Even at our small scale we're already busting their capacity. Some reports can't even be generated anymore because we have too much transactions.

This post is a fascinating outlook of how things can be ran at big corp. I always wondered how big companies did accounting. This stuff can't be outsourced since it's so tightly coupled to your business logic.

Is there any tech conferences geared toward this stuff?

Any accounting saas that can scale indefinitely? As opposed to Xero?


> Any accounting saas that can scale indefinitely? As opposed to Xero?

Need to wait for AWS to release a version of Airbnb's system. ;)


>I always wondered how big companies did accounting

Many big corporations use SAP or Oracle.


This is great, thanks for the writeup! We are actually building something quite similar. We have a rough landing page up at hireross.com. Airbnb folks, I'd love to chat if you're up for it. Just to learn more. email is at the bottom of the landing page.


changed the landing page a bit to include a signup form instead of email. Email is ross@hireross.com.


Did you consult your auditors through this? I would expect their answer to be something along the lines of "Spark? Scala? We don't think we can trust this so we'll need to do a full audit." versus "Oh, you're using SAP, ok."


„...On the date of the reservation confirmation, we have a guest receivable of $100, and a future host payable of $90...“

The one thing I hate about AirBnB.... Looking to rent a nice place for 10k 9 months in advance? Ok...we'll charge it RIGHT NOW.

Why not do it like Amazon and charge when the package is shipped (trip date has arrived)? I get the cashflow thing but come on... (100 bookings and counting...)


I bet if they didn't charge now, suddenly they have to deal with "card charge failed the day before the flight and now the reservation isn't paid for and the person is in front of the door"-style issues.

Of course AirBnB could forward the money to the host but doing it they way they do cuts out the problem entirely.


It's the same with booking airfares. I guess when you pay in advance for services to be rendered at a future date you're technically buying a financial "future", which does have a present value. However, I wish these futures were tradable!


I've previously worked with enterprise financial systems, and most of what was described here was standard functionality, or would be customisable to do so without too much effort. I would have loved to have heard more about whether they evaluated traditional ERP systems and why they chose to build their own event sourcing system instead.


One thing to consider was that we had to build this in about 5-6 months from the ground up, and match it to our existing financial system. Most of the work was to derive the financial meaning from a system not designed to provide it. The traditional ways you would extract data into an ERP would be with raw data import, or with some slightly processed data. If we did that, we still have the problem of tightly coupling data models and accounting logic, which makes for very slow engineering progress on the product front. Not a tradeoff we want to make.

If you were talking about exporting the data from the event based system, then that's still possible, and I think it's something that our finance team may still be evaluating, but I can't speak for them.


Same story with Superset, their homegrown BI system. Why re-invent the wheel?


If it was not for "reinventing the wheel" we'd all be coding in cobol or Fortean against IBM supplied mainframes.

In other words - because they can and should (and can afford to and not be hooked up to some third party vendor milking them for the rest of their companies existence).


Why sell your soul to Tableau when you can leverage your large pool of engineering talent to create something that only requires investing once, rather than large recurring license fees forever?

In my opinion Superset actually fills a niche that was open for far too long.


Because it is not rocket science and you need people inside the company that know the insides of it when you want to adapt it?


NIH syndrome?


I'm worn out on articles dissing the performance of SQL databases without quoting any hard numbers and then proceeding to replace the systems with no thanks of development in the latest and great tech. I have nothing against spark, but I find it very hard to believe that alarm code is now readable than SQL. In fact, my experience is just the opposite.

AirBnB is using an extract, load and transform architecture. No mention of the hardware, data through put, whether they have a message broker/queue to ease the burden of peak volume but work.

I have a strong feeling that they could have 1.) Kept the system exactly how it is and done some performance tuning. But that's not sexxy anymore. Things are just supposed to scale. Which brings me to

2.) Moved transformation logic to its own server or multiple servers using a message broker and queue to aid the transfer of data between systems. It would have been more readable and could have been done in a mo the or less.

In summary I believe they should have put some effort in to keep SQL. Especially for the purpose of accounting because spark does not lend itself to readable logic.


Alice, do you recognise guest receivables the day a booking is made, not the day after check-in ("services rendered")? If that's true, not sure if this is correct, i.e., this event doesn't seem to meet the asset recognition criteria... My intuition would tell me Dt Accounts receivable - Cr Revenue on the day after check-in.


On a second look, I don't seem to understand the flow of your accounting entries:

Booking date (I understand that on that day, the booking is confirmed, but the guest's payment method has not yet been charged)

  Dt Receivables from guests 100

    Cr Future host payout 90 (Problematic: how can you pre-recognise a liability on your balance sheet? Unless this is an off-balance sheet account...)

    Cr Deferred income 10 (Problematic: deferred revenue arises when you receive a pre-payment from customers and have a standing obligation to them to render the services)
Payment made by guest

  Dt Cash 100

    Cr Receivable 100
Check-in date

  Dt Deferred revenue 10

  Dt ??? ??? (seems to be missing to balance the double-entry)

    Cr Payable 90
To me, the natural way to do this would be:

Booking date - no accounting entries, no effect on books. You did not render a service to the customer yet, nor fulfil or incur any obligations yet.

Payment received from guest - you received a prepayment for future services to be rendered, so now have a liability to the customer to fulfil this obligation, i.e., deferred revenue.

  Dr Cash 100

    Cr Deferred revenue 10

    Cr Payables to host 90 (unsure about this one, as this goes into the whole gross vs. net revenue recognition discussion for marketplace-type businesses)
Day after check-in - recognising the revenue on one single day might work now, but it's only a makeshift solution - what if your customer stays for a longer period of time, or between two accounting periods, and this becomes material? By definition, you recognise revenues proportionally, but I get that the cost vs. benefit of doing this now might be unfavourable.

  Dr Deferred revenue 10

    Cr Revenue 10


This new system that uses events looks more flexible since it is decoupled on the application logic. I think the downside with this one is that the new system has a lot of moving parts. Also, changes in logic/new events must be communicated and should be supported before the main app is put into production.


Is it really that big that we need to talk about "scaling" it? 4-5 hours of processing time is a lot... I mean how many transactions would they do on a normal day?


Yea - it does seem a bit high. We use Spark for our adtech data pipeline and we're handling tens of billions of events a day in less time. It may be a function of how much data they're pulling in from other systems or dumping the data back into a variety of systems. Spark itself is parallelizable so in theory can be sped up just by running more nodes.


financial processing is typically sequential - can't calculate some metric until some other thing was calculated (or pulled data for)... not well parallelizable in other words. or so it is with some systems I deal with.


tldr; They built an event/messaging system. When an event occurs they broadcast that event with meta data. (EG. event: reservation_booked, meta {..}). Before they tightly coupled reporting sql inside the business logic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: