Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Seven-table joins. Ugh.

What? That's what relationship databases are for. And seven is nothing. Properly indexed, that's probably super-super-fast.

This is the equivalent of a C programmer saying "dereferencing a pointer, ugh". Or a PHP programmer saying "associative arrays, ugh".

I think this attitude comes from a similar place as JavaScript-hate. A lot of people have to write JavaScript, but aren't good at JavaScript, so they don't take time to learn the language, and then when it doesn't do what they expect or fit their preconceived notions, they blame it for being a crappy language, when it's really just their own lack of investment.

Likewise, I'm amazed at people who hate relational databases or joins, because they never bothered to learn SQL and how indexes work and how joins work, discover that their badly-written query is slow and CPU-hogging, and then blame relational databases, when it's really just their own lack of experience.

Joins are good, people. They're the whole point of relational databases. But they're like pointers -- very powerful, but you need to use them properly.

(Their only negative is that they don't scale beyond a single database server, but given database server capabilities these days, you'll be very lucky to ever run into this limitation, for most products.)



People hate joins because at some point they get in the way of scaling, and getting past that is a huge pain.

Or at least, that's where the original join-hate comes from.

In reality of course, most of us don't have that problem, never had and never will, and it's just being parroted as an excuse for not bothering to understand RDMS's.

Relational database design is a highly undervalued skill outside the enterprise IT world. Many of the best programmers I've worked with couldn't design a proper database if their lives depended on it.


People hate joins because at some point they get in the way of scaling...

No, in fact, they don't.

Poor relational modeling gets in the way of scaling, and that can be geometrically exacerbated by JOINs. A JOIN, in and of itself, is neither good nor bad. It's just a tool, and like all tools, how you use it is what makes it "good" or "bad" — just like you can build a house or bash in a skull with a hammer.


In most relational database implementations, joins stop scaling after 10-50 million rows or so assuming an online transactional site.

A time series data warehouse could go into the billions of rows with scalable joins with partitioning and bitmap indices ... but is also only applicable in the unlikely case you could afford oracle at $60-90k/CPU list price

Also, most databases that aren't Oracle don't have high performance materialized views to "preprocess" joins at upsert time, therefore people resort to demoralized tables and their own custom approach to materializing those views.

Then even denormalized tables begin to stop scaling at around 250 million to 500 million rows. So people resort to sharding managed in a custom way.

I haven't even begun to express the scalability impacts of millions of users on a LRU buffer cache used in most RDBMS - that usually is resolved through an in-memory cache (Memcached, Redis) whose coherency is also managed in a custom manner. Or you could spend $$$ for Coherence, Gigaspaces, Gemfire, etc. but that's also unlikely in most web companies.

At the end of all this, even if you bought a cache, you wonder why you're using an RDBMS at all since you're so constrained in your administrative approaches. Cue NoSQL.

of course in practice many devs ignore all of this history and "design by resume" assuming their new social-mobile-chat-photo-conbobulator will be at Facebook scale tomorrow.


This is not true at all. I've worked on several databases with billions of rows in several tables. A good solution for improving your query performance is to use a multi column index http://www.postgresql.org/docs/9.3/static/indexes-multicolum...


What part isn't true?

I'll restate my narrative: Single instance, normalized, unpartitioned databases run into scaling problems the several-hundred million row range especially when under heavy concurrent load.

But once you start moving to multi-instance, partitioned databases, you start to lose the benefits of the relational model as most databases have to restrict how you accomplish things -- e.g. joins are severely restricted.


Oracle will handle anything you throw at it, assuming you have the $$$. Ebay uses it for 2 Petabytes of data:

http://www.dba-oracle.com/oracle_news/news_ebay_massive_orac...


That's a link discussing an analytic database from SEVEN YEARS ago. eBay has moved on.

Please understand what I am saying:

- Traditional database architectures have limitations on what you can express in SQL for highly available and scalable online transaction processing once you introduce partitioning and clustering.

- Oracle has probably the best support for partitioning and clustering out of all RDBMS, but even that has limits in the billions of rows

- Many companies do not use Oracle for business reasons (licensing/sales/pricing practices)

What I am not saying:

- Oracle sucks (it's the most feature complete and robust RDBMS out there and is );

- Oracle is not used (Amazon, Yahoo, eBay, etc. all use Oracle in various contexts);

- Oracle does not scale (it does, though it requires you, the SQL developer, to intimately know the database physical design at a certain point of scale, which defeats much of why SQL exists to begin with)


I routinely deal with joins on a 100 million row table and they work just fine. Other than that, I also use a 10 billion row table for searches. This is in Oracle.


I've also used databases with more than a 100 million rows in a single table and received realtime query performance in multi-table joins. And this is using Sqlite! No expensive DB licenses - but it was using a high-end SAN since we actually ran thousands of these multi-million row databases in parallel on the same server.


"demoralized tables" :)


For me, it was a database schema with 38 joins (and 2 additional queries) to effectively get the data to display a single page. For that use case, mirroring the data on save to MongoDB was a no-brainer... with geospacial queries out of the box, and a few other indexing features it made a lot of sense.

I wouldn't even think to use MongoDB for certain use cases... but for others, it's a great fit. I think that Cassandra, Riak, Couch, Redis and RethinkDB all have serious advantages and disadvantages to eachother and SQL.

I do find that MongoDB is a very natural fit for node.js development, but am not shy about using the right tool for a job.

Another thing that tends to irk me, is when people use SQL in place of an MQ server.


> For that use case, mirroring the data on save to MongoDB was a no-brainer

I think you just confirmed the OP's point -- MongoDB makes a good cache, not a good primary store. I'm guessing you didn't do updates into that MongoDB store, and always treated the SQL source as "authoritative" when it became necessary. Am I right?


I no longer work at the company in question, but the plan was to displace SQL for the records that were being used in MongoDB, for mongo to become the authority. NOTE: this was for a classified ads site for cars. Financial and account transactions and data would remain in the dbms, but vehicle/listing records would have become mongo authoratative.

The transition was difficult because of the sheer number of systems that imported/updated listing records in the database... there wasn't yet a 100% certainty that all records were tagged on update properly, so that they could be re-exported to mongo... each day, all records were tagged ... took about 24 minutes to replicated the active listings (about 50K records), and we're not talking "Big Data" here, but performance was much better doing search/display queries against MongoDB.


> In reality of course, most of us don't have that problem, never had and never will

Maybe you never had any problems, but I don't believe "most of us" can say the same. At least me, I'd encountered problems derived from join-abuse in almost every job I've had.


That's funny, because I've mostly encountered problems with people who prefer to nest SQL queries inside a sucession of loops in their code, rather than learn how to use SQL properly.


Yeah, me too. But having said that, I've also seen problems with mongodb and they're much, much, much, much harder to solve.


7 isn't necessarily nothing. Each join is O(log(n)), so I believe you're stuck with O(log(n)^7) as a worst case, although in practice it will probably not be so bad since one of the joins will probably limit the result set significantly.

The other problem is that with 7 joins, that's 7! permutations of possible orders in which the database can perform the join. That's a lot of combinations, and often you can run into the optimizer picking a poor plan. Sometimes it picks a good plan initially, and then as your data set changes it can choose a different, suboptimal plan. This leads to unpredictable performance.

I think that in practice, you're best off sticking with only a few joins...


If you're regularly doing 7 joins it's a good sign of an over normalized databased.


Nonsense. It very much depends on the problem domain.


> A lot of people have to write JavaScript, but aren't good at JavaScript, [...] they blame it for being a crappy language, when it's really just their own lack of investment.

I think it's pretty much an accepted fact that JS has its problems. Even Brendan Eich has been quoted as admitting it.

(Note: I am a JS developer myself)


This is true, but the "wtf js is such a fucked up language" meme is outsized compared to the actual problems of javascript. Having worked full time in python for a couple years I could easily show you just as many weird python semantics that will inevitably bite you[1]. I think the grandparent's point has merit, that people expect to invest in their primary language for a project, but when circumstances dictate that they need to use a bit of javascript they find it annoying.

1] What does this program do?

    print object() > object()


  >>> print(object() > object())
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: unorderable types: object() > object()
Yet another reason to upgrade to Python 3! :)


About your code snippet, it prints a "random" boolean value

You're creating two objects (with random addresses, which affect the __str__ method result, which in turn result in a string comparison that returns False or True)


That's nothing compared to the "Perl is Satan!!1" meme that us Perl programmers usually have to put up with ;)


Actually, Perl is many different Satans, depending on the particular stylistic quirks of the programmer in question.


TMTOWTDI: There's more than one way to damn it!


Not so much with the usual coding standards (see "Best Practices" book etc).


IMHO, it's well deserved. Start by stop having those $, @, % identifiers for variables and then we 'll talk again about how many more daemons you need to impale.


Sigils are what make Perl standout, in syntax and behaviour, to (most) other languages so it would be silly to get rid of them!

i.e. it's a differentiator to what is or isn't Perl.


That one is somewhat reasonable (and relatively obscure code you'd probably never write).

A more realistic example: inner classes can't see class variables from their enclosing classes. (Why enclose classes? - builder pattern)


Every language has its quirks. Python is not perfect either. But JS shows clear signs of bad design decisions, such as the behavior of the == operator.


What == does is pretty simple and easy to understand. If you have a hard time with it, use ===. Problem solved.


The problem is that JavaScript does not exist in isolation, there are other languages that use this operator. If you're familiar with any other C-derived language, the way == acts in JavaScript is very unexpected.

Yes, you can learn to deal with it, but that doesn't mean it wasn't a bad design choice. If both forms of equality are required to be operators, == and === should have been swapped. Too late to do it now, woulda, coulda, shoulda, but it certainly is, IMO, a "bad design" smell in the language, and hardly the only one that still bites people.

Another example: the way 'this' scoping works is similarly busted in that while the rules for it are reasonably straightforward in isolation, it is different enough compared to other languages that share the same basic keywords and syntax that it should have been called something else.

To be fair, I don't think much of this has to do with Brendan Eich being a bad PL designer as much as it has to do with the odd history of JavaScript that is still represented in the name of the language ("Take this client language you made which has no connection to Java, and make it look kinda like Java, please!").


I agree with you to a point. Joins are your friend. But trying to pull out all of the information about a graph of 'objects' using a single query with multiple one-to-many and many-to-many joins is just as foolish in SQL as in Mongo.


Do you have any resources you would recommend to understand or at least give an overview of indexes?

I learned basic SQL once-upon-a-time and understand the relational algebra side of things, but only truly picked up the finer details and specific engines in piecemeal manner, as needed in various projects.


http://use-the-index-luke.com/

This is a really good resource for understanding how the queries you do relate to the actual actions that the database engine takes.


http://use-the-index-luke.com/ is a great online book (free) targeted at programmers and developers. It's practically required reading in my opinion.



Star, constellation, snowflake, flat.. Developers (not the author) would benefit from database introductory course even if they are not using databases. I think Stanford did one that was open to everyone.


This article ends up agreeing with you at the end, by the way.


When program errors pass silently, that is a legitimate problem in the toolchain.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: