Hacker News new | past | comments | ask | show | jobs | submit login
Things in SQL Server which don't work as expected (tech.pro)
53 points by dangoldin on July 21, 2013 | hide | past | favorite | 40 comments



I've worked with SQL Server for 20 years now, and you'd be amazed at how often the issues listed here actually come up. My Damascus moment was using Postgres for the first time. I kept discovering that it "just did the right thing" in a way that SQL Server never seems to.


SQL server by default creates tablespaces with non ANSI SQL NULL behaviour. Exlicitly specifying ANSI SQL compatibility when creating tablespace makes it "to do the right thing"

Edit: was doing it many years ago so forgot the details (or they changed :), so just checked - it is "alter database" that can switch various "ansi_null" default settings into ON from default OFF


The default in upcoming versions of SQL Server will use the ANSI NULL behavior.

http://msdn.microsoft.com/en-us/library/ms188048.aspx


> Oracle and PostgreSQL sort NULL values last by default. Also, those two RDBMS support an extension to ORDER BY clause: each expression accepts possible modifiers NULLS FIRST and NULLS LAST

The wording may make one think NULLS FIRST and NULLS LAST a is proprietary feature of Oracle and Postgres. It's actually an optional feature of the SQL standard (T611 "Elementary OLAP operations")


If your expectations are learned in one RDBMS and you move to another, you will always find things that "don't work as expected."

This is why it's a bad idea to assume you can treat RDBMSs as interchangeable black boxes, unless you confine yourself to the most basic subset of standard SQL.


That would be a good defence if he were talking about leaving out optional parts.

He is giving examples of violating the standard.


Just to clarify, I'm not trying to defend anything. Just stating a fact.

I've seen the same thing with experienced SQL Server developers who come into an Oracle-based project: they gripe about how Oracle is "weird" or "doesn't do what I expect" and then they eventually accept reality and get on with the project.


So, just out of curiosity, which other RDBMS doesn't violate the standard?


I never said that.


I honestly think it's a very informative article, clearly SQL Server doesn't adhere to the standard (whichever one you choose) but I'm not aware of any RDBMS that does so I was curious as to whether there was one.

There's lots of amazing databases around, each has their own quirks and as the gp says it's generally difficult to drop in replacements due to those quirks. There's a few listed in the article for SQL Server I'm sure anyone could easily find where others deviate from the spec too. But anyway getting back to what you said:

> He is giving examples of violating the standard

That's absolutely true and it's something every other RDBMS does.


> That's absolutely true and it's something every other RDBMS does.

To various degrees, and there's violating the standard (by adding non-standard construct) and violating the standard (by failing to implement a feature entirely, or implementing it with different syntax or semantics).

http://troels.arvin.dk/db/rdbms/ is generally my go-to page for that stuff as it lists standard solutions (with reference and mention of the features used being optional if applicable) and alternative db-specific solutions (exclusive or inclusive of the database accepting the standard solution). Sadly it is not really up to date, and obviously does not go through the whole standard.


Regarding my comment above, my troll filter obviously had serious issues yesterday. : )


I've learned that lesson as well. As much as web frameworks try to make the database interchangeable past a certain point it becomes amazingly difficult to switch.


This was mostly well done and really makes me want to take a closer look at Postgres. But a cheap shot came on the last one, "Readers and writers block each other". Both locking and snapshots are implemented in SQL Server, and you can pick your poison. Is Postgres or Oracle better? Is either method more efficiently or intelligently implemented in either of those systems than in SQL Server?


PostgreSQL is 100% pure MVCC, a little bit like the "SNAPSHOT" method described in the article: updates and deletes always create new copy of a row, and every row has a visibility associated with it: the oldest transaction and the newest transaction that can see that row (if the row hasn't been deleted by any currently active transactions, the 'newest transaction' ID is empty).

There are definitely some drawbacks to MVCC. For example, many people are surprised how long "select count(*) from X" takes to run, because it has to do a full table-scan to check which rows the current transaction can actually see. You need to periodically 'vaccuum' tables, which updates a cached list of which table rows are deleted, so they can be reused by future update (modern Postgres will auto-vacuum, but it's another set of knobs for the DBA to tweak). Plus, the storage overhead mentioned by the article (although that really depends on the amount of churn in your particular application, and regular vacuuming keeps it to a minimum).


In fact, the latest stable Postgres supports fast, index-only scans for count queries: https://wiki.postgresql.org/wiki/Index-only_scans.


I think the last one is actually the most important one. The READ COMMITTED isolation level is a bad default. I have encountered many SQL Server databases that are unknowingly stuck with this default, with "WITH (NOLOCK)" added all over the place.

It is kinda funny that you used the phrase "pick your poison". That's exactly my feeling when working with SQL Server. I have never had a "pick your poison" moment with PostgreSQL :)


Ever since I got exposed to Haskell, I've been wondering why RBDMSes--notionally the things which should have the strongest type-systems of all--have the moral equivalent of nullable pointers, rather than an Option sum-type. There are many cases in the use of SQL queries where weird inconsistencies with NULL could be solved by just distinguishing between Nothing, Just Nothing, Just Just Nothing, etc.


NULL doesn't mean "no value", it means "unknown value". It's much closer in meaning to IEEE floating point "NaN" than to Haskell/ML "None". In particular, NULL != NULL. Also like NaN (and unlike, say, Bottom), it is testable via special predicates.

(I think, though I may be wrong, that a lot of the trouble with NULL stems from the very fact that it is testable. Were it more like Bottom – i.e., merely a means of lazily representing erroneous expressions to overcome limitations of a non-dependent type system – designers would be forced to encode meaning explicitly and we wouldn't be having this conversation.)


NULLs are (NULL is?) the way they are in SQL (terrible) because SQL is, in fact, terrible. I never allow nullable columns in my tables, but have to deal with null semantics because of e.g. outer joins. It is to weep.


The db designer can handle that, just add an fk to a table that defines the values required for a particular problem domain.

In other words, don't succumb to the laziness of using nulls.


> don't succumb to the laziness of

No, see, that's what I was questioning. Why is using NULLs easier, such that it's the "lazy" thing you can "succumb" to? UI is fundamentally about incentives--making good things easy to do, and bad things hard--and an API, or a protocol, is a UI for programmers.

Sure, instead of a NULLable column, you can have a pair of columns, one with an ENUM type to represent the sum-typeclass's type-instance, and the other to represent a value of that type. But that sounds like you're working at the wrong level of abstraction, fundamentally.

You're using an RBDMS because it lets you speak in terms of tuples, relations, projections, and other Relational Algebra concepts, rather than in terms of keys, values, and indexes. I can see doing the pair-of-columns thing in a NoSQL store--but why did the designers of SQL let this implementation detail infect my RDBMS?

It's kind of funny, actually. I see developers treating their RDBMS as a dumb object store... but then implementing a "domain model" layer right beyond it, basically creating a "model server" accessible over an API, which enforces the business domain constraints on the objects it sends and receives. In other words, they've Greenspunned an RDBMS of their own in all but name, with its own custom protocol and half-assed validation logic. I feel like a large reason for this is that current RDBMSes just aren't up to representing the types and constraints and relations that developers need to express, so they put up their noses and say "business logic doesn't belong in the database!" ...and proceed to build a database for their business logic to reside in.

It's times like this when I almost get an urge to develop my own RDBMS that actually keeps the Relational-Algebra abstraction clean. No concept of a "table" or a "view" or an "index"; just relations between projections of an infinite theoretical tuplespace, where tables are created to store data that can't be inferred from an existing view, and garbage-collected behind the scenes when their values are no longer referenced by any projection. Indexes created gradually when a tracing-JIT-like query planner notices repeated access patterns. Strong, structural, parametric type declarations on every piece of data. Constraints that actually allow you to express things like "a user-to-user 'friend' relationship must be symmetric", rather than just being a weak declarative dress over triggers. Etc.

Would anyone be interested in such a beast?


You don't need a pair of columns to replace a nullable column. I was addressing the second part of your comment not your questioning of allowing nulls.

Nulls have their place I believe, but they are grossly overused. For example, if you have a Person table, then something like a nullable DateDied is logical and acceptable over a sub-type solution. But something like HairColor should not be nullable to cater for bald people, it should non-nullable and reference a table that has a 'none' value.

On another note, I don't think you can eliminate the null concept because they are the implicit outcome of outer joins and therefore means of comparing and aggregating nulls needs to built in.


> if you have a Person table, then something like a nullable DateDied is logical and acceptable over a sub-type solution.

But this is exactly where NULLs are bad. "DateDied IS NULL" could mean many different things, semantically:

- This person hasn't died

- We don't know if this person has died (the information was not provided with the record when imported)

- DateDied is not an applicable property for this subtype of Person (for example, if you have an ImmortalPerson subtype.)

There are probably even more, if I wasn't just going off the top of my head. When you do an outer join, you get another level of NULLability--which would perfectly map to another layer of "Nothing | Just something"--but without that, you get those NULLs mixed in with the other three types of NULLs above. And then you insert all those into a table, export it, and import it somewhere else... and nobody knows what any of the NULLs mean.


Yes. Absolutely.


Yes. You would be a hero. I guess Date and Darwen's D language could be a starting point?


This is why you want to use SQL databases quite conservatively. Nearly every product I've seen that heavily leverages SQL has been an unportable mess.

... then you hit the performance wall, and say, "Well, just /one/ of these little features won't hurt, right?" And a year later you're singing the blues.

Standards should be executable.


An enlightening and thorough article. Note that it is also not possible to SUM() a bitcolumn. Why, I will never know.


Is someone who finds something enlightening qualified to say whether something is thorough?


Good article. Another one that I find counter intuitive is that the sum of an empty set returns null which goes against set theory (states it should return 0) but apparently this is by design


Why shouldn't the "sum" of an empty set return 'NULL'? I could understand why a 'count' of an empty set should be 0 (which it is in SQL Server).


    sum(A) + sum(B) = sum(A ∪ B)
    sum(A) + sum({}) = sum(A ∪ {})
    sum(A) + sum({}) = sum(A)
    sum({}) = 0


(This assumes either A and B are multisets, or they are disjoint.)


Indeed, but in the context of minor changes in how things SQL should behave, we can assume that everything is a multiset.


As one of the comments notes, ORDER BY in subqueries is sometimes used for implementing pagination. I only have a limited amount of experience with SQL Server, but you ought to be able to accomplish the same using window functions:

    SELECT blah FROM 
        (SELECT blah , ROW_NUMBER OVER(ORDER BY blah) num from mytable) as subq
    WHERE num < 10
Of course, it's anyone's guess how efficiently SQL Server will optimise that :-).


That is correct: http://troels.arvin.dk/db/rdbms/#select-limit-offset

Although note that if you're one of the Cursed Few who still has to use SQL Server 2000, you don't have ROW_NUMBER.


That's a useful site, thanks!


That was quite dull and for the large part spurious. Order by on an in clause, aggregating bits, multi column in clause, trim, ordering of nulls, using,.. All of these are of zero practical interest.

Also, all this talk of 'standards' is strange because there is no mention of what standard is being referred to.


Dull perhaps, but spurious? As someone who writes SQL in a couple of dialects, these are the sort of things that I need to keep in mind.


Lots of good tidbits and examples in there - I really enjoyed reading that. There are several not-quite-inaccuracies-but-not-really-how-you-would-think moments in it but, overall, it's very well written. The pictures were annoying after a while haha.

Just to clarify, this isn't "specific to SQL Server" - SQL Server was originally a straight-up "licensed by Microsoft" version of Sybase SQL Server and thus it's early code base (things that defined the Transact-SQL language) has almost all to do with Sybase and naught to do with Microsoft. Things like the plus sign as a string concatenator, the way numbers are handled when compared to strings - all of that was the way Sybase did it and, for whatever reason, Microsoft didn't change it (we're talking way back in the late 80s and early 90s). In fact, there's not a single thing in that article that has changed or been implemented by Microsoft - nearly every one of the nitpicks/topics referred to by OP refer to the original choices made by the Sybase engineers. You could say, "Yes, but Microsoft hasn't changed this despite having 20+ years with the source code + standards!" and that's a valid point.[0]

1) TRIM - This has been asked for and discussed ad nauseum in the SQL Server and Sybase community. It always circles around the idea of VARCHAR though - the variable character type automatically trims trailing blanks thus, years ago, developers said, "Ahhhh, screw it - no need for a TRIM() function! They just should use the LTRIM() function to get rid of the leading blanks." It's not correct IMO but that's always been the logic.

2) "This behavior violates the standard and is specific to SQL Server." - I hate this statement. Which standard? What section? This is a case of it being easy to say something but, without backing it up with a citation, it implies that I trust the OP. I dont' know OP - maybe he's right but maybe he's interpreted "the standard" incorrectly, or is referring to an outdated standard, or is comparing SQL Server 2000's interpretation of a later-written standard. There's no way to know - it's just too ambiguous.

3) ORDER BY in subqueries - I laughed when I saw this:

SELECT * FROM (SELECT TOP 100 PERCENT* FROM mytable... ORDER BY id) q

In nested queries and inline (single-query) table valued functions, ORDER BY is not supported unless TOP is used with the SELECT.

Kinda, sorta - it depends on which version of SQL Server you are referring to. SQL 2012 - yes. SQL 2008 - yes. The "original" SQL Server would have done what OP talks about - sorting w/ TOP - but, at some point, MSFT changed it to give the behavior he talks about. I can't remember when - maybe SQL 2005 or, at worst, 2008 - but this is an example where using specific version names would be helpful.

[0] http://en.wikipedia.org/wiki/Microsoft_SQL_Server#History




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: