Data-Oriented Design (Why You Might Be Shooting Yourself in The Foot With OOP)

akkartik · on Dec 19, 2009

I find it useful to consciously separate input data structures from intermediate data structures (accidental complexity). I try to structure my code so it doesn't rely on intermediate data structures, it knows how to recompute them. When this works it can be very pleasing: just input data structures with caching all over the place.

Sometimes I think I'm chasing pure functional programming, but in a more pragmatic form and separated from type-checking.

gruseom · on Dec 19, 2009

That's very interesting. We're doing something similar. I'm curious as to how you reconcile your "intermediate data structures" with one of the principles in the OP, that of minimizing the transformations you have to do on your data in the first place. The latter is a profound insight that I am slowly digesting. One thing it throws out the door, for example, is layered architectures. Not a small deal! Yet it makes sense to me, because my experience with layered architectures has been that the more nicely modular and well-defined you make each layer, the more bloated and nasty the mappings between layers become.

Sometimes I think I'm chasing pure functional programming

No question this is more suited to FP than OO.

Edit: this really is a rich subject. It's interesting that a lot of this discourse is coming out of the game dev world, because that's a section of the software universe which is relatively free of pseudo-technical bullshit (probably because it's so ruthlessly competitive and the demands on the apps are so high).

akkartik · on Dec 19, 2009

I usually don't think about performance. nostrademons sounds right: "minimize transformations when you productionize." I don't have experience with productionizing.

So far when I've found myself scaling one piece of my pipeline up, I do one of two things: 1) I switch key pieces from arc to scheme or C. 2) I add a periodic precompute stage to reduce cache misses.

Like pg said somewhere (update: http://ycombinator.com/newsnews.html, /15 Jan), the goal isn't optimizing but keeping performance mediocre as you scale up.

Update: After reading http://news.ycombinator.com/item?id=1005145 I see 'productionize' isn't a big-bang, stop-the-world step. By the new definition I find I do rewrite things for performance fairly often.

The image in my head now: a ball of mud (http://www.laputan.org/mud) with rewrites as layers. Older layers that have proven themselves harden and fossilize as subsequent rewrites focus more on their performance without changing semantics. But even they aren't immune to the occasional tectonic upheaval.

gruseom · on Dec 19, 2009

I usually don't think about performance.

I was wrong then. We're not doing something similar :)

Like pg said somewhere (update: http://ycombinator.com/newsnews.html, /15 Jan), the goal isn't optimizing but keeping performance mediocre as you scale up.

Where did he use the word, or the concept, "mediocre"?

akkartik · on Dec 19, 2009

Yes, not there, but I seem to remember it.

Update: Ah, it's here: http://www.paulgraham.com/hackernews.html

gruseom · on Dec 20, 2009

Oh. But he didn't say the goal was mediocrity, only that performance at least wasn't getting worse as the site grew.

But then I'm sure you didn't really mean that mediocrity was your goal, either, right? Right? :)

akkartik · on Dec 20, 2009

:)

I think my choice of the word 'goal' was incorrect. If you have other priorities performance need be just good enough.

nostrademons · on Dec 19, 2009

I found that minimizing transformations on your data is a principle you apply when you productionize. For most of the development cycle, you want to keep things as debuggable as possible (at the possible expense of performance), and intermediate data products + debugging hooks are a good way to do this.

This brings up a much bigger question of when to productionize, though. Most programs are never actually "done", but at some point you have to release to the public and hopefully get millions of users. You need to make the performance/maintainability tradeoff sometime. The later you push it off, the more productive you can be in the critical early stages, and the better a product you can bring to market. But if you push it off too long, you miss the market window entirely and don't get the benefit of user feedback.

eru · on Dec 19, 2009

Is this related to `Fusion' of functional programming? There the compiler removes some of intermediate structures.

E.g. http://homepages.inf.ed.ac.uk/wadler/papers/deforest/defores...

gruseom · on Dec 19, 2009

But these are fundamental design issues. You can't change fundamental design when you "productionize"; coming up with that design and implementing it is the development cycle.

nostrademons · on Dec 19, 2009

Productionize usually means "rewrite". I think that software engineers in general have become too averse to rewriting code; as long as you do it with the same team that wrote the prototype, it's often a good idea to throw away everything and start from scratch.

The development cycle for me is much more about collecting requirements than coming up with a design that satisfies those requirements. That's what iterative design is about - you try something out, see if it works for the user, see what other features are really necessary for it to work for the user, and then adjust as necessary. Once you know exactly what the software should do, coming up with a design that does it is fairly easy.

My current project is nearing its 3rd complete rewrite since September, plus nearly daily changes that rip out large bits of functionality and re-do them some other way.

akkartik · on Dec 19, 2009

"software engineers in general have become too averse to rewriting code"

Fervently agree. I was one of them.

No amount of rewriting is too much - as long as you constantly have a working app.

gruseom · on Dec 19, 2009

Productionize usually means "rewrite".

Oh ok, you meant something quite different than I thought, and I don't disagree.

akkartik · on Dec 19, 2009

I try not to have a 'fundamental design'. If you rely wholly on caching, you have no intermediate data structures, and code becomes easier to change in dramatic ways. This is the ideal I've been striving for.

Check out arc's defmemo function. Given the ability to memoize (or cache) function invocations, changing your data structures can become simply a matter of refactoring your function boundaries and deciding which ones perform caching.

Shamiq · on Dec 19, 2009

Why bother with the acronyms? Just look at the problem and figure out a beautiful solution. It's a lot tougher than just picking a design philosophy, but the result justifies the mental effort.

chadaustin · on Dec 19, 2009

That was a bit of my reaction too. But then I thought:

Object oriented programming solves a great many problems with the construction of large systems.

However, when you're writing real-time or interactive systems, there's no escaping the fact that you must understand how CPUs, memory, and caches work.

If your game turns out to be successful and you need to fit its frame updates in 16 milliseconds (60 frames per second), then you'll need to optimally map your algorithms to the hardware.

However, most startups and most games fail. So why not optimize for whatever it takes to prove a product and scale an engineering team? As long as you understand the optimal capacity of the hardware, is initially writing your system with OOP so bad? I don't think so.

On the other hand, these types of discussions are a great way to teach people about the realities of modern hardware.

chipsy · on Dec 19, 2009

Why did you conflate startups with games? A game ships once(unless it's online), a startup ships endlessly.

The article is a bit confusing, but the way I took it when it ran, and now, is that there aren't just performance benefits to thinking "data flows" vs "objects," there's source readability benefits too. If you can define a bespoke data structure that manages state in exactly the way you want it, that's far better than a cluster of objects that mostly do the job but need a little massaging at key points. Better on the hardware, simpler to read, less likely to cause bugs. A 5% improvement in low-level state management multiplies many times over, because the management pattern is likely to be replicated over hundreds or thousands of slightly different game features that all rely on that data model.

chadaustin · on Dec 19, 2009

I'm not talking about shipping, but instead about development risk. I've seen too many teams start projects with lots of low-risk but high-cost "engine" work like the example given in the article, when they don't even know if the game will succeed in the market.

... crap, I confused the linked article with a very similar article which I read today: http://research.scee.net/files/presentations/gcapaustralia09...

Anyway, my point stands. When starting a project, you should understand the eventual end state (high-performance algorithms making effective use of cache and memory) but don't think you need to implement it all up front.

If a data flow or procedural approach is clearer and easier to maintain, then by all means. But don't discount OOP as an intermediate state simply because you'll eventually have to translate the code to fit better on the hardware.

That's all. :)

gruseom · on Dec 19, 2009

The trouble with this ubiquitous argument is that it is a cost-benefit argument that simply ignores one of the major costs, that of changing the design later. Yet that cost can easily be as high or higher than the cost of building the original system.

Treating this sort of design change as an optimization problem (i.e., we'll measure and fix the bottlenecks later) is a category error. There are many OO systems that simply can't be refactored to solve the problems the OP is talking about.

Does this turn out to matter? Sometimes yes, sometimes no. Is there any way to measure it in advance? I doubt it. But that means there's no real cost-benefit argument here at all, only gut-feeling judgment and confirmation bias.

chadaustin · on Dec 20, 2009

You should definitely weigh the cost of doing the extra work now versus doing it later. In profitable, stable ventures, time now and time later have similar costs. However, in new projects, time now is dramatically more expensive than time later.

Can you give an example of an OO system that can't be refactored to a data-driven system later? I ask because I've made very similar changes to Cal3D, converting overly-object-oriented code to memory-efficient data transformations and, thanks to unit tests, it wasn't hard at all.

gruseom · on Dec 20, 2009

Can you give an example of an OO system that can't be refactored to a data-driven system later?

The systems I was thinking of are ones I've worked on or consulted on. Mainly, they were just big and hard to change. The OO aspects didn't help, mainly because of their tendency to object-graph spaghetti.

There was an inaccuracy in what I said (mainly for brevity). It's not true that you can't refactor such systems to solve their design problems. Technically, you can refactor anything into anything. What I mean by "can't be refactored" is "can't be refactored at a cost less than writing a whole new program". Even then, that's too strong, since you can't prove that. So strictly speaking I should have said "There are many OO systems where nobody who works on them can think of a way to refactor them to solve the problems the OP is talking about in a way that is easier than just rewriting the program." :)

I agree that test coverage makes this easier, although it also adds a maintenance burden.

duncanj · on Dec 19, 2009

Once, I had to evaluate for rewrite a program that was written in a naive "object-oriented" style. The program built up a large graph of objects, did a few transformations, and wrote its stuff out. It ran out of memory on small subsets of the data it needed to work on.

I evaluated the program's data usage and rewrote it with the metaphor that I had to process the whole thing from a tape drive. It was still object-oriented, but the memory needs were now bounded.

tl;dr: I don't see the dichotomy.

eru · on Dec 19, 2009

"tl;dr:" doesn't go down well around here. (Perhaps you should have just posted the first part of your comment. That part's good.)

klipt · on Dec 20, 2009

Ironic, considering it isn't a comment on the OP but a summary of their own post, and could easily be replaced by something like "In other words..."

eru · on Dec 20, 2009

Oh, you are right. I assume the down-voters did not recognize the colon, either.

lukifer · on Dec 19, 2009

I've always thought OOP was an overused pattern. If you don't need inheritance or information hiding, what does OO give you that can't be accomplished more easily with functions and arrays/hashtables?

xtho · on Dec 19, 2009

Whether inheritance is an essential quality of OOP is IMHO debatable. This leaves us with data abstraction and polymorphism.

eru · on Dec 19, 2009

And Haskell solves those two problems pretty nicely without OOP.