"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)
Though Guy Steele's idea sounds contentious, that OO encourages "data-first" because code is encapsulated:
"Smart data structures and dumb code works a lot better than
the other way around."
This is especially true for object-oriented languages,
where data structures can be smart by virtue of the fact
that they can encapsulate the relevant snippets of "dumb
code." Big classes with little methods–that's the way to go!
Or maybe he is just encouraging OO programmers to think more in this vein?
The tricky part is that smartness of data structures is context-sensitive.
One of the most common design errors in OO systems seems to be building systems that beautifully encapsulate a single object’s state… and then finding that the access patterns you actually need involve multiple objects but it’s impossible to write efficient implementations of those algorithms because the underlying data points are all isolated within separate objects and often not stored in a cache-friendly way either.
Another common design problem seems to be sticking with a single representation of important data even though it’s not a good structure for all of the required access patterns. I’m surprised by how often it does make sense to invest a bit of run time converting even moderately large volumes of data into some alternative or augmented structure, if doing so then sets up a more efficient algorithm for the expensive part of whatever you need to do. However, again it can be difficult to employ such techniques if all your data is hidden away within generic containers of single objects and if the primary tools you have to build your algorithms are generic algorithms operating over those containers and methods on each object that operate on their own data in isolation.
The more programming experience I gain, the less frequently I seem to find single objects the appropriate granularity for data hiding.
The exercise of comparmentalising and creating atomic islands of objects that dutifully encapsulate data becomes difficult during reassembly simply because we recreate the need for declarative style of accessing data in an imperative (OO) world. It's ye old object-relational impedance mismatch.
A (relational) data model is a single unit. It has to be seen this way. Creating imperative sub-structures (like encapsulating data into objects) breaks this paradigm with serious consequences when attempting to rejig the object-covered data into an on-demand style architecture. The whole model (database?) must be seen as a single design construct and all operations against the entire model must be sensitive to this notion - even if we access one table at a time. Yes, at specific times we may be interested in the contents of a single table or a few tables joined together declaratively for a particular use case, but the entire data model is a single atomic structure "at rest".
When paradigmatic lines like this are drawn, I side with the world-view that getting the data model "right" first is the way to go.
Fred Brooks and Linus Torvalds speak from experience in the trenches.
This also comes up in relational databases. There might be a nice, canonical way to represent what the data really is (i.e. what it represents). but then access patterns for how it is used mean that a different representation is better (usually, somewhat de-normalized). Fortunately, relational algebra enables this (one of Codd's main motivations).
A programming language is even more about data processing that a database is. But it still seems that data structures/objects represent something. I recently came up with a good way to resolve what that is:
In a data processing (i.e. programming) language what you are
modelling/representing is not entities in the world, but computation.
Therefore, choose data structures that model your data processing.
This definition allows for the messy denormalized-like data structures you get when you optimize for performance. It also accounts for elegant algebra-like operators, that can be easily composed to model different computation (like the +, . and * of reg exp).
Have you noticed yet that programming ideologies go around in a circle. Programmers may currently be defending data-first (/data-oriented/...) programming, but it isn't the first time they did so ?
The way I experienced it:
micro-services/small tools "that do one thing well"/... (pro: reasonably generic, con: loads of tools, requires expert users, if one tool out of 30 is not working well things break)
data-first/data-oriented programming (really easy to manipulate data, very, VERY hard to maintain consistency)
database-oriented programming (enforce consistency. Otherwise data-oriented. Works well, then when in data-oriented programming your data would have gone inconsistent, in this paradigm you get errors. Needless to say "every operation errors out" is better than "our data suddenly became crap", but still blocks entire departments/systems unpredictably)
event-driven programming (really easy to make button X do Y) (to some extent built into data-base oriented programming, also available separately. Works well, but gets extremely confusing when programs get larger)
object-oriented programming (massively simplifies the "I have 200 different message types and forgot how they interact" problems of event-driven programming, also provides the consistency of database-oriented programming)
<web browsers start here>
event-driven programming with UI designers (makes event-driven programming and later object oriented event-driven programming accessible for idiots)
declarative object-oriented programming / aspect-oriented programming / J2EE and cousins / "Generating forms from data" / Django
micro-services/"do one thing well" (same disadvantages as during the 80s)
data-first/data-oriented programming (same disadvantages as end-of-80s) * you are here *
How much do you want to bet that servers that enforce data consistency and store it are next ?
So so much truth in this. I've also seen Batch processing with terminals ID move to real time programming and operating systems. Move back to browsers and session ID (IE HTML GET / PUT are the new punch cards)
Algorithms + Data Structures = Programs by Niklaus Wirth - Does not one read this book anymore. It takes both and both must be considered throughout the coding process. Its all math / computing is a matter of encoding and logic. 'A' = 41 hex = 01000001 binary. An array of 3 x 3 is also an array of 1 x 9. - A L Turing
Both are good places to start, and both should be given serious consideration early in the project.
Code is usually the worst place to start. If you do need code early on, I think it's fine to write it as quickly as possible, even ignoring edge cases and simplifying algorithms.
If you have good data structure design (and understand the invariants) and you understand the problem space well from a functional standpoint (including functional edge cases), you can get the code into shape later. The bugs you encounter will be mostly the trivial kind.
But if you misunderstand the user experience, or the data structures are a mess, then it's difficult to recover.
That's one of the reasons SQL databases are so great: they help you get the data design up and going quickly, and will enforce a lot of your invariants (through schema normalization, PK/FKs, and CHECK constraints). If anything goes wrong, rollback returns you to a known-good state.
I don't know anything equivalent at the user experience level, though.
This is one reason why prototyping is often so necessary. When you start from the user experience, you usually end up working your way back from the front-end: first you visualize what the user should see, then you mock up screens to give them that, then you figure out what they should be able to interact with and how, then you write code to make that happen, then you define data structures that make the code possible. When you start from data, you identify what data the user will need to manipulate, then you identify relationships between that data, then you write code to manipulate it, then you slap a UI in front, one which usually mimics the relationships you identified a priori.
There's a bit of an impedance mismatch here. Start from the UX, and you end up with a bunch of ad-hoc data structures that are very difficult to rationalize and inefficient to access. Start from the data, and you end up with a UI that mimics the data you were given and not how the user thinks about achieving their task.
The solution is to write a quick & dirty prototype focusing on UX, nailing it but focusing on only the happy-path that's core to the user experience. Then take careful note of the data structures that you ended up with, and throw away the prototype. Then you start with a carefully planned data architecture that captures everything you learned in the prototyping phase, but eliminates redundancies and awkward access paths that you wrote in the quick & dirty prototype.
The better solution is to start by designing the API, and to leave the design of the data model (to support the API) and myriad of user interfaces (as clients of the API) for later.
You don't know what should be in the API until you've had a couple clients built on top of it. Users will surprise you; oftentimes that little UI detail (the one that's really hard to fit into your API) is what will make them buy.
A similar argument also applies in the opposite direction. That is, it’s all very well beginning with a beautiful, idealised API, but if there is no efficient way to implement that interface then you’ve backed yourself into a corner before you even start.
Also UIs which don't have to support an actual data manipulation tend to underestimate dealing with the edge cases of that manipulation - which I tend to think is why you never get too far from the shell in Unix-likes.
And why so many database-backed webapps look like a frontend over a database, and virtually all mobile messaging apps are frontends over TCP/IP, and web directories before Google were just webpages with a whole lot of links.
It turns out it can be pretty profitable to break with the machine model of the underlying data and instead present things the way your users want to see them, even if it makes the code necessary look complicated and gross. The danger with building the data model or API first is that you'll build the easy, obvious API, which is the same easy, obvious data model & API that all your competitors build. Start with the user instead and you can end up with some really powerful differentiators, at the cost of it being a bitch to transform into working, sane computer code.
It turns out it can be pretty profitable to break with the machine model of the underlying data and instead present things the way your users want to see them, even if it makes the code necessary look complicated and gross.
It certainly can, though I think you’re still figuring out a data model first on those projects. It’s just that you’re starting with how the information is organised and manipulated from the user’s point of view, then figuring out an internal representation to support it afterwards, rather than the other way around.
Experience (with this style of development) helps out here. For stuff you're completely clueless on though, it still happens. And that's when you revise your design.
Its far better to go in this direction than the other, in terms of code clarity, and it's pretty rare to have it impact performance... at all, as long as you're willing to go back and rewrite the ideal code to be a bit more practical when you notice an issue.
Which API? There are many different APIs within one application. You can't really start at one of them, you'll first need to decide what the parts of your program that need APIs between them are going to be.
The data constraints in an application will inform the user experience. Until you know what your application has to do with data any UX design will be guesswork.
I agree that the user experience development should come before any code, but it can't come before designing the data structures.
> Or, start from the user experience.
> Both are good places to start
Reminds me of Merise - a modeling method where the data model and the process model co-evolve in parallel. After many years I still have a fondness for it - but maybe that is just the French 80's kid in me...
Ah yes, takes me back to the time when 'Data Analyst' was an actual profession, and organizations planned their workloads in order to place an appropriate order for 'enough' IBM hardware and dasd. Back then, during the first wave of business automation via computer, the processes being automated were so well defined, this level of understanding was acheivable; when you have rooms full of clerks calculating monthly payroll, you KNOW what the data is.
In so many domains today, you don't know, and the business is forgiving enough (or amateur enough?) that THEY don't even fully understand their processes.
Everyone bash Cobol since the dawn of microcomputing, but the definition of a "Data Division" section where all data is specified before literally any actual line of code is something very interesting.
This language implementation mostly made the Y2K transition so sucessful, and today these Fortune 500 companies are still stuck with their legacy systems which were written by people "with a clue" back then.
Cobol is interesting, but horrifically impractical in practice. Cobol is my bugbear language... I just couldn't solve problems in it. It somehow separates what should go together and puts together what is unrelated.
> the business is forgiving enough (or amateur enough?) that THEY don't even fully understand their processes.
After the first few waves of automation, the business made the savings to justify automation by sacking everyone who understood the processes, all the RPG programmers who moved the processes into code have retired, and now we're getting by on hope and hacks.
Or agile enough. Most of the the time, when business processes are a mess, it comes from learning from their customers and being relentlessly customer-focused.
Code samples or GTFO. No seriously: an article that brings up messy real-world OO code and then handwaves it away with "just use FUNCTIONS, yo!" and some box-and-arrow diagrams is useless and borderline unhelpful. Something like:
is, IMO, more valuable in the discussion. It points out the tradeoffs explicitly: pure functions mean (in this case) immutable data, which means more-complex data structures - which means the OPPOSITE of the micro-optimization stuff this article closes with.
It's also worth noting that if you're fussing about the cache efficiency of code you haven't written yet, STOP. JUST STOP. Write the code and MEASURE IT.
There are some things worth doing the napkin math first before even bothering writing the code. That includes cache efficiency. Finding and fixing design mistakes long after you've written a few thousand lines of heavily depended on code is rather difficult to tear out when you realize the design was wrong to begin with.
Since it's mostly game engine programmers presently driving this design philosophy imagine a scene in a game. There are some number of entities in the scene. Is that number infinite? No. Do each of those entities exist in their own bubble? No. You want to process all of the entities that need updates all at once. One transformation. You also know that from frame to frame there may be 0 - N entities in the scene but you don't want to be constantly allocating, deallocating, and resizing data structures. So you sketch out a reasonable number of entities to support in a scene. You know the data each entity needs to track to participate in the scene and you figure the best way to pack all of that into a struct of arrays so that processing that update is fast and doesn't waste a byte of cache.
You haven't even written a line of code yet.
There is a give and take. You have to test out your assumptions and not be afraid to throw away bad code and start again. Once you start down a path you may find that the statistically significant number of accesses for entity data happens for the members related to the physics update. You might choose a different structure to pack that data into apart from the cold data that doesn't change much.
Much of data-oriented design is like this: save the modelling for documentation and diagrams and things that humans understand. Write the code for the machine since the machine is the platform. You lose mechanical sympathy when you start designing data structures based on intuition.
It doesn't cost you anything to think about these things. It's just engineering.
I don't think the article as advocating pure functional programming. More of a data oriented/array programming style, which has been gaining traction among game programming. There are similarities but more differences than similarities.
I recommend you watch the Mike Acton talk he linked at the end (even if you aren't a C or C++ developer).
P.s. And yes, I agree you should measure (obviously), however that is only one half of the equation. You make assumptions about the performance issues and verify them by measuring. Measuring is very poor signal to noise, but your assumptions can be wrong. Both are needed to optimize code well.
If what you said were true, it wouldn't be possible to design for efficient cache access. And it is.
Huh? I read the article as being about simple code that is easier to understand and maintain because it's not semantically overloaded with abstractions, and picking data structures that fit the need of your process, instead of some patterns because Herb said so or something. Usually you don't need code samples for that because it's such a common bad thing.
AFAIK(I'm no Haskell guru) Haskell doesn't let you specify memory layout explicitly. Without that you won't see the massive gains that come from data-oriented design.
No, the lazy evaluation in Haskell would make it difficult to reason about the runtime performance in his case.
The conclusion is to use tightly packed data structure to take advantage of the cache locality of the modern CPU for high performance requirement. That can be done in most languages.
It can be done less easily in Haskell, because even if arrays are available (albeit as second-class citizens) and hopefully well packed, the actual structure and memory layout of their elements is an implementation detail (and, in practice, it tends to be unusually pointer-rich because it has to support lazy evaluation).
That's interesting, we don't generally consider relativistic effects on computation.
There could be an interesting synthesis between Minkowski spacetime, lambda calculus, and Lamport clocks that will either a) show that the fundamental unifying element in the universe is functions (and really annoy at least one side in the disagreements over the Black Hole Information Paradox), or b) make for a terrific Greg Egan novel.
My first programming (Fortran!) teacher listed input, storage, decisions, and output as the four concepts needed. Later I learned about Turing machines, same deal.
Point to where the decisions are in SKI calculus. SKI calculus is as strong as Turing machines and lambda calculus.
(If you squint right, you can see the decision making bit. But it's a tight squint.)
--- (from Wikipedia:)
The evaluation operation is defined as follows:
(x, y, and z represent expressions made from the functions S, K, and I, and set values):
I returns its argument:
Ix = x
K, when applied to any argument x, yields a one-argument constant function Kx , which, when applied to any argument, returns x:
Kxy = x
S is a substitution operator. It takes three arguments and then returns the first argument applied to the third, which is then applied to the result of the second argument applied to the third. More clearly:
Does the conclusion boil down to "use functional programming"? The author doesn't specify the way data or functions are encapsulated in the final figure. But my gut reaction is that this describes FP better than OO or any other paradigm.
Not really, it's basically 'composition over inheritance', and laying out data structures in a cache-friendly way, and only then build the functions that work on those data structures (the common term is 'data-oriented design').
It's the mantra that game developers have been preaching for at least the last 5..10 years and which slowly makes its way into the general programming world, because the CPU/memory performance gap is getting wider, not narrower. The bitsquid engine blog (now: Autodesk Stingray engine) has a lot of really great articles on that topic (for instance http://bitsquid.blogspot.de/2014/08/building-data-oriented-e...).
No. Because the data structures he is referring to need not be immutable. Think "C struct". Structs in C are more closely aligned with components and lend themselves to composition much easier than a tower of inheritance. If you ever try Unity after bashing your head in with C++, you'll know what a godsend components are.
I admire the combination of theoretical audacity and sweeping pragmatism taken by the authors of that paper.
They mention Oz a few times. I am big fan of "Concepts, Techniques, and Models of Computer Programming" by Van Roy and Haridi, which shows the reader how different computing paradigms can be constructed from a kernel language in Oz. That makes the text something of a kindred spirit to the paper, insofar as it tries to aid the student in grasping many different paradigms of computation (functional, object-oriented, concurrent, and relational are covered).
The Haskell using folks at Standard Chartered Bank have been implementing some of the "Out of the Tarpit" ideas for a few years now. I can confirm, functional relational programming does work really well in practice.
It's interesting that he's advocating this on games.
I've been annoyed at most OOP codebases, but conceded that games and GUIs were the canonical applications of OOP. Perhaps you could say that highly stateful apps running on a single machine can use state encapsulated in objects.
In contrast, for web UIs and their "big data" back ends, I believe you also want to use a data-oriented approach, rather than a code-oriented approach (or at least this is the style that my code has converged upon). You don't need to wrap everything up in objects all the time when you're really just passing strings through the network and doing light processing. And when you're not even using a type-safe language to begin with.
But he's advocating the same style on games, just with structs and functions rather than buffers and processes. This makes me think OOP is more of a mistake, although I will admit that I use objects as modules now (a longer conversation, but the idea is to instantiate almost all your objects at startup, and not allocate during normal program flow).
I've long thought there was this interesting historical oddity in that we came up with languages in the 90's and designed them for GUIs, but we're using them the 2010's for web and server programming. Then you get web apps that are a mess of misleading objects, when really the architecture is stateless (or stored in a database, which is not programmable using your language).
I think our main error was in thinking that objects could be pervasive, in the same way that today one encounters "test-driven" culture.
They work reasonably well at a lower level of granularity - make a fairly complex data model using components under the hood, and then publish the resulting api as a more streamlined mutable object.
But the diminishing returns are fast once you start needing each piece of data to be fungible and adapt to new requirements, and then wrap it up in an object. The relational database has been standardized for some time in large part because it guides you towards coherent data automatically - systems that resist that model are more likely to experience a catastrophic failure, and "ORM" systems are likewise brittle.
Compare out "The Next Mainstream Programming Language: A Game Developer’s Perspective" (https://news.ycombinator.com/item?id=1029820). By the people behind Unreal. They advocate a functional approach for games, too.
When I took my first APL class the prof drilled into our heads to focus on data representation before writing any code. Over the years and across a dozen languages or so this has proven to perhaps have been the most valuable thing I learned in CS. I lost count of how many problems I've seen go from difficult to manageable and even simple to solve by devoting more effort towards finding the best way to represent the data or the problem than simply taking things as presented and writing code right away.
I feel the author has chosen poor examples for illustrating 'bad code'. Does he really think so little of the id software team that they aren't aware of these things? There's no mention of the fact that in the real world, all the patterns and weird tricks there are won't save you from compromising on a perfect design to get stuff done & shipped.
I don't think it's a bad example. I remember John Carmack saying Doom 3 was their first C++ project and they weren't very familiar with the language and did some poor decisions on that.
This idea of data-oriented design has been influencing much of how I write code and approach problems these days. In dynamic GC'd languages it's all too easy to trick yourself into believing that all problems are unbounded. However most problems are not. And doing the math ahead of time allows you to write the minimal amount of code to do the transformation required.
I think the author has some great points to begin with, but the finale is a little.. lacklustre.
It's all well and good to say we can separate process from objects, but in reality it's not uncommon to require instantiating new objects/components dynamically or changing structure in ways that don't map so easily to the input->process->output diagram shown.
I also think that maintaining code and building up testable modules was one of the niceties of OOP ahead of all the inheritance spaghettis that started happening in many poorly-OOPed projects.
Kudos though to the author for showcasing his first few points by commenting on Doom's code- you don't often see that.
Couldn't agree more. Whatever program we write, it is manipulating and transforming data. If our code is revolving around data, it couldn't make more sense to formulate data before actually start code which is supposed to work on that data.
I used to think I can communicate the architecture (including code) of an entire website using just the tables. And a good enough developer+designer could turn the schema into a website.
I so wish c++ would let me code like this without getting in the way with it's OOishness. Recent gripes:
- got a few bit flags per struct, need to know if there is at least one instance with the flag set (a counter per flag). I cannot bit-pack the flags without duplicating the flag getter/setter/destructor code.
- no way to declare static linkage funcions as friends (the linkage is forced to external). so i need to expose types that are implementation detail in public header.
would be nice if c++ could get some RAD-slanted design work, instead of trying to get all RUSTy
(expanding on the above while using non-touchscreen keyboard)
The problem that I continually encounter writing data-first in c++ is the inability to group data members that are related to single implementation detail but are scattered in multiple classes, and indicating in a compiler-parsable way which functions have the responsibility to manage these data fields. For example, consider maintaining an M to N relation links between classes A an B, without putting links in parent class, and grouping all implementation details of these links in a a single file. Or even maintaining reference counter without using inheritance or other heavy-OO mentioned in the article.
Even worse if your data is only a bitfield.
C++ assumes single implementation detail to be a single class, contained in a single allocation. It can be subverted using "public", "friend", CRTP, but it feels like hammering nails with pliers.
Newer standards go all-out on supporting really powerful data structures in library, and writing incredible templates, but there is nothing for low-level C-style custom data handling.
> The boss comes in and says "Hey, change of plans. The player is now a car."
Stupid decision-making is why I will never work in games. With regular coding, you can iterate towards the domain and so long as you keep your wits about you and never make the same mistake twice, never be subjected to such a demand.
In the arts, everything's got to hew to some asshole's vision. The people building it can never really know what's going on in his head and the asshole never understands how much work it is to change things to fit the evolving vision. Killer features on games are rarely evident from day one, game shops seem to always wind up looking like sausage factories.
> Stupid decision-making is why I will never work in games. With regular coding, you can iterate towards the domain and so long as you keep your wits about you and never make the same mistake twice, never be subjected to such a demand.
In the ideal environment, perhaps; all too often in real environments, radical shifts in requirements not discoverable by advance consideration of the domain can happen outside of games (sometimes, because there are radical shifts in the domain -- if your system is automating implementation of corporate or government policy, whims of policymakers can change things as radically as "the player is now a car").
I work for a marketing company, I'm well aware of shifting whims. You can build flexibility into your infrastructure if you really need it. You come up with a model for how things are likely to change in the future and work modularity into your codebase.
This approach doesn't really work for games. Everything is so interdependent, modularity that allows for all possible ways for the system could change would slow it down to a crawl. There's just no way to avoid expensive, painful rewrites.
This is somewhat true, but you structure the code in a way so that rewriting any one part is easy. You keep your modules fairly large, your abstractions thin, and the code very literal. It's much easier to do a total rewrite on code that's like this than when you abstract it out a great deal.
It's very quick to write code like this, and the maintenance cost of this isn't as large as you might think (honestly, after you get used to it, you often end up preferring it. It's very, uh, blunt, for lack of a better term).
Edit: Worth noting that these rewrites aren't always due to a change in creative direction. There are often ones for technical reasons.
via "Objects have not failed" -- Guy Steele, http://www.dreamsongs.com/ObjectsHaveNotFailedNarr.html