How We Built the Software that Processes Billions in Payments

swombat · on June 30, 2011

And that is how you write a great job advert.

More specifically, it gets across:

- the company culture

- the development practices/methodologies in place

- what's exciting about the work/company

- what technologies they like (and the fact they use a variety of technologies)

- the fact that they get that typical job adverts suck

... all while looking like a regular article.

Great work.

breck · on June 30, 2011

How do you guys maintain an ever growing code base?

It seems to me that the practices you have--TDD with full code coverage and regression tests, pair programming everything, many client libraries--would lead to a massive amount of code that needs to be maintained.

In my experience I've found that TDD often leads to a huge amount of code that needs to be rewritten if you do a major change to your codebase. If you have a new pattern you want to implement it may require rewriting dozens or more tests. Also, when I'm working alone I constantly go back and clean up functions and what not to make them shorter and clearer. I've found I do that less while pairing because you want to keep moving forward.

My opinion is probably just biased by the fact that I've done relatively little pair programming and TDD. I'm sure the more you do the better you get.

But I'm curious if you could take some about the size of your code base and how these practices effect that?

brown9-2 · on June 30, 2011

How do you safely make those major changes to the code base without all those tests?

Retric · on June 30, 2011

One of my favorite stores from an old and vary experienced coder. Someone had been fired after a spending around a year on a project that never quite worked, it was hack upon hack and had failed several passes though QA with major issues etc. So he spent 2 weeks cleaning the thing up redoing about 1/2 of it in the process and sent it to QA to see if anything else was missing.

Anyway, a week later he get's a call that some fairly basic functionality was broken in production which he fixed. Afterward he asked some people in QA how it ended up in production so quickly and how they had missed such a basic issue and their response was "Ops, we stooped testing your code a few years ago this is the first time it bit us."

TDD always struck me as an attempt to answer to the question what do I do if I don't know if this actually works and nobody is going to QA this crap. But, if you actually consider what happened TDD was unlikely to help because the coder was simply unaware that existing code was broken and needed to do something else.

mnutt · on July 1, 2011

Sure, but that just means that QA knew something about how the application worked that the developer didn't. In an ideal world the developer would be familiar enough with the product to write good tests. In the reality, TDD is good for checking the really repetitive edge cases, and QA is good for catching business logic failures that the devs aren't aware of.

breck · on July 1, 2011

Good question. Separate your Dev from QA, and make QA code the automated tests. That way, you still get the eventual safety of test coverage but developers are never slowed by the need to write or rewrite tests.

brown9-2 · on July 1, 2011

And what is the benefit of that separation? Seems like that sort of thing makes it easy to blame the other guy "I didn't know to test that / I'm not the tester, not my job".

The "cost" of high-coverage automated tests is overblown, IMO. The few extra minutes spent writing test code is worth the extra safety and confidence. However I believe striving for 100% coverage on every method/branch is overkill.

It doesn't have to be either you do TDD religiously or you have no tests.

breck · on July 2, 2011

I agree. The middle ground is better than writing no tests.

I do think it's more than a few minutes though. The problem is when you do a big rewrite you've now sometimes got a lot of failing tests. Do you go back and delete these tests and write new ones? Or refactor the test code? It's just a big distraction because sometimes you're "holding the program" in your head, as PG would say, and to worry about tests might cause you to lose your focus on the more important code.

mkramlich · on July 1, 2011

By using one's brain. Understanding the code. Knowing what effects what. And by running it, particularly the changed flows. The disadvantages to tests/TDD is (a) it assumes the developer is stupid, and (b) even if he is not, he has the additional burden of creating and updating the tests as he goes along, slowing him down. Yes there can be downsides to not having tests/TDD. But the upside can vastly outweigh the downside, in many situations. YMMV.

yellowredblack · on June 30, 2011

Testing: testing is at the forefront of our development philosophy. We never need to check our code coverage to know that it's at 100%: with disciplined TDD, no line of code will be written without a test.

Bravo. In my experience that can be overkill, but with finance, I agree: why risk it. TDD everything.

We don't have a QA team.

WTF?

That might be terrifying when you consider the type of software that we're building, but we're confident that our automated testing is thorough and will catch any regression bugs.

Are regression bugs the only kind of bugs?

We use continuous integration to test every version of every client library against our gateway.

What happens when someone uses your client library in a way you didn't anticipate?

What happens when johnny-botnet hits your API directly without using the client library?

I spent several years developing games, with QA teams that outnumbered the developers. The QA team did not just play the levels through and say "it works!". Sure, they did that for the first hour. Then they'd start doing all those things that they thought someone might try (e.g. in a fit of boredom, or for a laugh). After that they'd just try breaking stuff. What a lot of bugs they would find!

As I write I find it hard to believe that I, a game developer, is having to explain the importance of QA to a financial company.

aquark · on June 30, 2011

Agreed -- 100% code coverage is impressive, but only means you are testing the code you've written.

What about all the cases you forgot about. By definition you forgot about the tests too.

What about all the interactions between module A & module B? They may work 100% on their own, but not together.

A good QA department can generate many more tests that the developer could, and do so without the 'author blindness' that is somewhat unavoidable when writing code.

dan_manges · on June 30, 2011

Dan from Braintree here. I think a QA team would be valuable; we just don't have one right now. To answer your rhetorical question, regression bugs aren't the only type of bug, but they're one of the most dangerous in a payments system. If there's a bug with an unanticipated use of a library, it will show up in our sandbox environment before merchants hit it in production. But if functionality that works in production breaks because of a change, that's a really serious problem.

yellowredblack · on July 1, 2011

How are regression bugs more dangerous than other kinds of bug?

Here's how I'd rate the "dangerousness" of a bug:

  1. How easy is it to detect?
  2. How easy is it to reproduce?
  3. What are the consequences of it occuring?
  4. How likely is it to happen?

Look, bravo for all the TDD. TDD eliminates a huge chunk of bugs. But by definition, the bugs that you find with CI are easy to detect, easy to reproduce and 100% likely to happen. Sure, without a TDD/CI system, these bugs may not have been detected, may not have been easy to repro. But the reverse does not hold: a TDD/CI system doesn't make all bugs easy to detect and easy to repro.

So all the other bugs that your system has right now, are the ones that are left: hard to detect, hard to reproduce, and don't always happen. Now turn on a thousand users. How many users are you hoping to have btw?

Your worst kind of bug:

  * Is not detected for months.
  * Unable to reproduce.
  * Company killer. (Reputation, lawsuits, whatever).
  * Happens once every 40,000,000 sessions.

Not detectable using TDD and CI. Company still dead.

hello_moto · on July 1, 2011

And QA would be able to detect these worst kind of bugs?

I'm not suggesting that QA is useless, I think QA should guide developers in terms of testing, as in QA should help writing the test-cases including the corner-cases in spec and let the developers write more tests around those things.

I also think that QA should help performing benchmark tests, load tests, and probably write end-to-end automation-tests (what do they call it these days? Acceptance tests?)

Last but not least, QA should redefine the software processes if bugs happened regularly in a particular area. Consider QA to be a manager that responsible for the productivity of your software team: if a software process doesn't work (let's say one day you found out that TDD doesn't work well), QA should detect that and figure out a better way.

Unfortunately, QA these days are still old school button clicker and test-case fanatics (i.e.: prepare 1000 test cases and ask the director for a week to run them all).

But at the end of the day, bugs exist. No amount of human or practices would cover those exotic bugs.

sawyer · on June 30, 2011

Love the insight into BT's process.

I'd love to hear some more about pair programming, has anyone here done it extensively enough to shed some light on the pros and cons? My gut is that it would be less productive than a simple code review procedure, but does it reduce bugginess to a level that offsets that productivity loss?

wisty · on June 30, 2011

I've heard it's bad for creative stuff, but really good if you know your requirements.

For payment processing, that's a good thing.

As I understand it, the productivity isn't too bad, as the programmers egg each-other on. Sort of like having an obnoxious micro-manager over your shoulder, without the obnoxious bit.

benjiweber · on June 30, 2011

We do pair programming for nearly all development (apart from trivial copy changes etc)

I find it increases productivity because:

* Every problem that comes up is solved more quickly with two minds on it discussing the problem. (Often more than twice as quickly than working alone)

* One person is usually the driver (uses the keyboard) and the other the navigator. This means one person can keep in mind the overall picture of what they're building, all other parts of the codebase that need to be updated and so on, while the other person concentrates on the nitty gritty of implementing a specific method.

* You get distracted less easily (IRC, Twitter, HN, etc) when you're both working on something.

* You produce better quality code - There's no temptation to cut corners when the other person is right there when you're writing it - Your pair can help you stick to TDD properly - Both people have insights into best way of writing things - This means less time wasted refactoring later

This last point is why it is so much better than a code review process. Once the code is written there is a temptation to not have to go back and change it, especially for a major redesign that someone else spots. Changing things involves risk of introducing new bugs. There's also the risk of offending the person who wrote the code

yellowredblack · on July 1, 2011

I once pair-programmed an IK animation harness. The math was hard (for me anyway). There were plenty of times where I'd have balked and gone and made a cup of coffee but my partner had the insight and we plowed ahead. Sometimes I was the one with the clarity. The increase in productivity is vastly more than just twice as fast. Sometimes it can mean that a feature that could not be done can now be done. Or take days instead of weeks.

For the other 80% of what goes into any software: total waste of time, or worse. Two people, independently, could get twice as much done. But worse: pairing on something like that can result in over engineering, language battles, just about any kind of drama to make the day more interesting. Seen it happen.

If I had to choose between 100% pair-programming or 100% single-programming, I'd have to choose pair-programming. In this artificial comparison, pair-programming is more productive. Which is why, I think, so many companies fall into this trap. 100% PP teams beat 100% lone-coder teams.

However, teams that pair when appropriate will destroy the zealots of either denomination. Rationality vs religion.

sambeau · on June 30, 2011

Whenever I have done it it encourages off-screen programming: paper, diagrams etc.

In these situations the actual typing occurs after the programming has been done.

bguthrie · on July 1, 2011

I've paired with several people at Braintree before on other Agile teams. It is much more productive and often a pleasure, particularly with these guys. Code reviews don't begin to compare to constant collaboration.

thadeus_venture · on June 30, 2011

Different strokes for different folks i guess. I know a lot of people will be fans of this, but

>we pair program to write all of our software. We work on Mac Pros with two keyboards and two monitors. We work in an open team room; no cubicles or private offices.

No thanks, if I'm the developer. And if i wouldn't do it, why would i make my employees.

getsat · on June 30, 2011

I've worked at Pivotal Labs' office in San Francisco and done pair programming at a few companies now.

When you're pair programming, you're not wearing headphones and "getting into the zone". You're openly collaborating, sharing thoughts, bouncing ideas, prototyping things on a whiteboard, and so on. Just like at a party, you subconsciously filter out the surrounding noise when you're talking to your pairing partner. An open office is perfectly fine for this. I didn't think it would be a good situation at first, either. :)

However, I personally do not enjoy pair programming. There's a few reasons why, but the big one is that it's mentally exhausting. Eight hours of engaging in conversation completely wipes me out even if it results in amazing code. I couldn't deal with it any more. To a lesser degree, I do not get the same sense of accomplishment from completing tasks when pair programming that I do from completing tasks by myself.

That said, if I ever run a company with a handful of programmers or more, I'm going to hire engineers who like pair programming.

goo · on June 30, 2011

Ironically, their website is not handling the load from HN (and anywhere else they're presently linked from), from what I can tell.

I don't mean to complain - when you're so focused on other parts of your business, it's easy to let things like preparing your front-end for heavy bursts slip by.

phinze · on June 30, 2011

Yeah, you're spot on. We've so been focused on scaling our Gateway and services that we didn't prioritize the infrastructure serving our marketing site. It could really use some love, which we'll be giving it in the very near future. There should be enough caching in place now to handle the HN bump, and we'll be keeping an eye on it.

3am · on June 30, 2011

I wouldn't advertise you don't have a QA team.

msluyter · on June 30, 2011

I agree, and I disagree with their rationale. IMHO, just because every line of code is covered by a unit test doesn't imply that the product is adequately tested. Errors can occur at more systemic levels via integration of well tested components. Furthermore, developers have blind spots and hidden biases and can't fully compensate for the lack of an aggressive QA person who is trying to break your code.

On the other hand, they're clearly doing something right, so far.

3am · on June 30, 2011

I agree with you so much that I wrote a substantially identical response to a sibling comment, with the exception being that I neglected to congratulate them on their success. Clearly what they are doing is working for them right now. I have spent years testing payment systems, and it isn't easy. They must have an exceptional group of developers.

I wish them continued success, but I encourage them to start looking for a rockstar QA/release engineer.

bguthrie · on July 1, 2011

They didn't say that they ONLY write unit tests. They possess an automated test suite that covers multiple levels of application integration and acceptance checking, in which "unit" is only the first line of defense.

elbenshira · on June 30, 2011

Why shouldn't they? With TDD and pairing, I'd say that their developers are the QA team.

3am · on June 30, 2011

Well.. I disagree, but I wouldn't have downvoted you because it's a fair point.

It's a bad idea for developers to QA their own code for a number of reasons. 1) Developers have cognitive blinders, like everyone else. They might not test for something that they is unlikely 2) Some errors can be impractical to catch outside of top level integration testing (getting into unexpected states in state machines or race conditions) 3) There is a conflict of interest between deadlines and meeting requirements.

I have seen companies reach 100+ developers using this approach, conclude it's unsustainable, and be forced to make exceptional efforts to build a QA team. I believe reason 3 is the biggest risk.

IgorPartola · on June 30, 2011

Eek. I believe in TDD, but a QA team adds a whole other layer of checks, that you cannot automate. Especially, when developing any kind of UI, there is no substitute for a grumpy QA with a detailed test plan.

rimantas · on June 30, 2011

Developers by definition are not QA. And TDD and paring is very poor substitution for the proper QA.

sigil · on June 30, 2011

...we are able to perform all our maintenance without downtime. We can deploy new versions of our software, make database schema changes, or even rotate our primary database server, all without failing to respond to a single request. We can accomplish this because we gave ourselves the ability suspend our traffic. To make this happen, we built a custom HTTP server and application dispatching infrastructure around Python's Tornado and Redis.

Why is it necessary to suspend traffic to make these kinds of changes? Just curious.

ary · on June 30, 2011

Probably because the data transformations and storage required to complete a transaction need to be handled by a coherent version of their code. Processing payments with a half-updated stack sounds painful and error-prone.

_3u10 · on June 30, 2011

Because schema changes and deployments are generally non-atomic, running a transaction through an untested configuration (front-end code = 1.0, schema = 1.2 and backend = 1.0) generally results in unknown things happening. When dealing with money people generally prefer right to fast.

Swannie · on July 2, 2011

This one made me go "Huh?" and reread.

I don't understand why you'd need a custom HTTP server. It might be overly simplistic, but just write a module for nginx? Surely that would have been faster. You get controllable timeouts, you could tweak their throttling code to stop it throwing 50x, and modify/take inspiration from nginx-ey-balancer. And you get a bullet proof HTTP engine for free?

becomevocal · on July 1, 2011

Awesome post. I deal with payment gateways daily, and you guys seem to be a cut above the rest. Keep it up!

evertonfuller · on July 1, 2011

Too bad you're US only.

That's one thing you can't seem to do.