Hacker News new | past | comments | ask | show | jobs | submit login
User-defined Order in SQL (begriffs.com)
314 points by grzm on March 21, 2018 | hide | past | favorite | 134 comments



Having run into this problem a few times now, Approach 1 is unfortunately best. Two reasons:

(1) If updates fail, you don't get a resulting broken state

(2) The size of the lists is not that large (<50), so the extra writes are not that expensive.

Essentially, with any variation on fractional updates (where only 1-2 rows are updated per list update), the client is sending to the server "move item a to position x". Then, shortly after, "move item b to position y", etc. And if any intermediate step fails, the server and client are out of sync.

If, instead, the client message is "here is the new state of the list: (a,b,c)" and the server updates, failed requests only leave the state temporarily out of sync. The client will continue to send the complete correct state at each update.


I've run into this problem when keeping a single large (100000+) list sorted. In my case it was very important not to change the sort value on rows that didn't need to.

I had considered most of the solutions listed in the article except for the true fraction method. I believe that solution was explored by the author as they thought floating points were somehow solving a problem that couldn't be solved by integers. I settled on an approach that is essentially Solution 2 but using integers instead, each time bisecting the other two integers. The important aspect is that when inserting at the head or tail of the list, you bisect between MININT (-2^63) or MAXINT (2^63-1) respectively.


have you considered using a string and have them sort lexographically? To insert, you have to create a new string that sorts between the two, which this library proposes an algorithm for: https://github.com/fasiha/mudderjs


This was my thought, too. Use a point-based versioning-like scheme as a string (1.1.25, 1.1.26, 1.1.26.1, 1.2, etc). You will have to worry about adding prefixed 0s, though, as powers of ten are increased if you want it to sort correctly, and of course there is presumably a hard-limit on the size of the string.

It might be possible to have a table that keys to the first that has a level and a value. For 1.1.26.1 above, it would have the rows (1, 1), (2, 1), (3, 26), (4, 1) (with keys to the table rows). Not sure exactly how to write the sql to join it all together and sort it, but it seems possible, at least if you have some known maximum number of levels.


if you use the full unicode space as alphabets, you can probably get a lot of mileage out of just having a 255 char column (or bigger, if expecting a large number of items constantly changing). Most databases will have a inbuilt way to lexographically sort text, and so you won't have to do anything extra. The smarts is in the computation of the ordering string.


Nope but I will next time. This is quite clever. It's likely equivalent to using arbitrary precision numbers but probably performs a lot better.


Agree. I wonder what problem anyone would have with Approach 1 that also meant the other approaches were optimal solutions.

If your problem is there are too many items in the list, that's going to be a usability problem so you'd probably need to categorize the items.

If performance is such an issue that ordering <50 items with approach 1 is unacceptable you undoubtedly have performance issues with the rest of your app too.


I assume that this method was invented for storing some large ordered list - not for a literal Todo list app. Any realistically sized Todo list can be stored anywhere. Hell, you could probably use pingfs for it.


You could even use a paper actually :)


I agree, but I use a small improvement to the first approach: split each insert into two steps: "insert at end" and "reorder". That way: (a) the insert is nice and simple and (b) you can reuse the code for re-order which you likely need anyway (and which solves the foreign key constraint issue, perhaps by doing it in a single SQL update).


That was my first thought. Disallow inserting into the middle of the list and solution 1 becomes much simpler.


Er, wouldn’t you do all the updates in a single transaction? If one fails then it rolls back all the changes.


Yes, ideally in a single UPDATE statement.


I'm not sure how to do that because a single UPDATE statement only has one WHERE clause. You'd have to come up with an expression to put in the SET part that gives the right values for different rows.

I guess you can always swap pairs of rows with subtraction, like swapping row 5 and 7 would be:

    update t set user_order = 5 + (7 - user_order) where id in (123, 456)
Maybe you could use case statements to make an enumerated function in the more general case of updating multiple rows, but that's a bit ugly.


Postgres bulk update:

http://sqlfiddle.com/#!17/b3d19/1

You can do something similar in MySQL with temp tables.

Bulk update through upsert (slightly different than a bulk update, since it can create rows too)

MySQL: http://sqlfiddle.com/#!9/a6f050/1

Postgres: http://sqlfiddle.com/#!17/a8367/1


And preferably you use the postgres specific "unnest(array1, array2, ...)" function which turns array parameters passed into the query into a temporary table that you can query.

    UPDATE example
    SET position = CAST(temp.position AS INTEGER)
    FROM unnest(:ids, :positions) AS temp (id, position)
    WHERE example.id = CAST(temp.id AS INTEGER);


OK, so SQL extensions, so just the usual reasons to do bulk changes. I thought they might have mentioned it here because there was something unique to say about it in the context of this problem.


This is just off the top of my head, but to move an item from position P to Q, you would do something like this:

    update items
    set pos = case
        when pos = P then Q
        when P < Q and pos between P and Q then pos - 1
        when Q < P and pos between Q and P then pos + 1
        else pos
    end
Plus a where clause to limit the shuffling to the actual list being reordered, since you would keep all users' lists in the same table.


idk if this is what the parent had in mind, but this seems straightforward enough:

    update t
    set user_order =
        case
        when id = 123
        then 500
        else user_order + 1
        end
    where id = 123 or user_order >= 500;


If you do them in a transaction with the correct isolation level, it will appear as a single atomic update.


We did #1 at a previous job - the users wanted their common countries in a list of countries to appear at the top of the drop-down (US, Canada, Germany, etc) and not in the natural sort position. It was something that practically never got changed once it was set up, so the update problem was handled manually via DBA scripts.


You could probably do #1 pretty easily with a descending sort on each country's population.


I agree. I have found this to be the 'easiest' with 'smaller' list sizes (definitely less than 50 - ordering of UI elements).

Implement a specific call that takes a list of all IDs in the new order, and batch run the UPDATEs on the server side.

I did enjoy the article, but did not realize it was PostgreSQL specific until towards the end.

Very clever, Argyle.


We also went with option 1, but sidestepped the OP's problem by not requiring the index to be unique. I guess our naive approach was pretty expensive, but in two steps, we first incremented all values from the target position, then inserted the new item. It was performant at our scale, and seemed elegant at the time.


you should generally use some concurrency control approach like optimistic or pesimistic locking (UPDATE state = new WHERE state = old/SELECT FOR UPDATE - but the latter isn't a good idea for UI).


I agree, but I note that Hibernate will put nulls in your list if you have gaps in the sequence. I have run into issues as a result of this.


> Robust? Yes. It would take virtually forever to run out of space through repeated list reordering.

I think the author missed a pathological case with using the Stern-Brocot tree.

The goal is to sort each item with a score. Then to insert something between two items, you calculate a new score that is "between" their existing scores.

- Option 1) Take the average of the two score a/b and c/d:

  (ad+bc)/(2bd)
- Option 2) Take the mediant(https://en.wikipedia.org/wiki/Mediant_(mathematics) ):

  (a+c)/(b+d)
We don't want to use the first option, the average because almost by definition you would require at least another bit of precision in the denominator for each insert due the multiplication by 2. This limits you to a meager ~64 repeated inserts before overflowing the denominator.

So the mediant is better because the precision requirements is limited by the number of times you can add instead of multiply. This might seem to grow a lot slower at first. And the author even showed that if you're repeatedly averaging 0/1 with the new score, you would get the pattern 1/2, 1/3, 1/4, 1/5, 1/6, .... Which means you can handle up 2^precision number of inserts which is much better than before.

But this only applies to inserting to front (medianting with 0/1) or back (medianting with 1/1) because this will only add a 0 or 1 each time.

If you go down a zigzag path down the middle of the tree instead, you can get the two numbers added to be much closer in magnitude and explode. For example: 1/1, 1/2, 2/3, 3/5, 5/8, 8/13, 13/21, 21/34, ... (it's fibonacci!).

So essentially you get the same thing as the averaging case where the precision required roughly doubles(1.6180339887...) at each step and it will overflow quickly.

This isn't a purely theoretical edge case either. The example from before can be realized by repeatedly inserting between the last two items inserted, which arises naturally if you're repeatedly adding to the middle of a list.


As far as I can tell, there's really no advantage to using floating point/fractions at all. I've used an integer-based solution for this very problem in the past.

You can apply the same logic of finding the midpoint between two integers using very simple integer math, store the result as an integer, and get high performance. Just don't start by populating your list with small numbers, the first number would be 0 and the second one would be the midpoint between 0 and either MAXINT or MININT depending on if it came before or after. This ought to have all the advantages of approach 3 (True Fractions) with none of the disadvantages. I'd create a function integer_intermediate just like the rational_intermediate function documented in the article too.

The problem with floating points is that the available precision is highly variable. Sure you can bisect between 0 and 1 many times, but what about between 1000 and 1001? Those 10 bits are lost.


This seems like the best approach on this thread. Decimals and floating points are plain out of place here. An extremely simple bisection between two elements a and b where either is null if at the head or tail of the list would do it:

bisect(a,b) = (coalesce(a, MININT)/2) + (coalesce(b, MAXINT)/2)

gns24 is right, if you're using 64 bits of precision, no matter the format, there must be some combination of at most 64 insertions that trigger the edge case of trying to insert between two items that have no representable value between them. I feel like there's a simple proof with e.g. pigeonhole principle.

So all of these approaches will have to be paired with a normalization routine that takes takes the bisected values and spreads them back out across the space to allow insertion again. Probably an after update/insert trigger that only kicks in when a new item is placed directly next to another, and spreads out all the nearby keys so that there is some minimum distance between them.


Exactly. If we're storing the positional value into 64-bits then there must be some patterns of 64 inserts which don't fit.

The decimal solution is really not bad. You could improve on it by not taking the midpoint between the two values you're inserting between, but rather another point which doesn't require an extra digit, when that's possible, but really that's very likely to be an unnecessary optimisation. So continual insertion on the end might give 0.5, 0.8, 0.9, 0.95, 0.98 0.99, 0.995...


That's a really elegant way to put it. Yay for information theory.


What if, periodically, the orders are recalculated? Like 95% of cases are not going to reorder items back and forth 64 times in a session. If that does happen then take the extra time to reset everything (1,2,3,4...).

And maybe periodically do the same for all values in your table? Like when users are not interacting with the items maybe.


Thank you! I was intuitively sure this was the case (i.e., that you couldn't do markedly better than doubles when working with a fixed size of 64 bits) but I didn't see how to construct the pathological sequence like you did.


Vadim Tropashko wrote a book "SQL Design Patterns" that, among others, presents non-trivial approaches to storing trees and graphs in SQL databases. I highly recommend it. The PDF version is free here: https://vadimtropashko.wordpress.com/%E2%80%9Csql-design-pat...


Throwing my qnd solution to this problem in here.

  id | name | at | date_modified
   1 | foo  | 1  | 2018-03-21 12:00
   2 | bar  | 2  | 2018-03-21 12:01
   3 | baz  | 3  | 2018-03-21 12:02
Updating the ordering is a matter of updating a single row with it's new intended position and triggering a date_modified update.

  -- Move last item to top
  UPDATE ordered_set SET at = 1, date_modified = NOW() WHERE id = 3;
Retrieving can easily be done using a multicolumn order by;

  SELECT * FROM ordered_set ORDER BY at ASC, date_modified DESC;

  id | name | at | date_modified
   3 | baz  | 1  | 2018-03-21 12:10
   1 | foo  | 1  | 2018-03-21 12:00
   2 | bar  | 2  | 2018-03-21 12:01
If you want to explicitly get the current ordering as a sequence number some databases have ROW_NUMBER() OVER (ORDER BY <same as above>) as a potential solution.

That said, this seems like a pretty trivial non-issue. You wouldn't want to do this on very large datasets, and updating many rows for small sets performs just fine in my experience.


If you have two rows with at=1, how would you insert a row between them?


You just convert the datetimes to integer timestamps, then find the midpoint between the two entries... wait this sounds familiar :)


Since this is a transactional database, why not simply UPDATE … SET at = at + 1 WHERE at >= new_at & then INSERT? Things like this usually aren’t insanely write-heavy.


At the start of the article, elegance I defined as:

>The solution shouldn't require complicated PL/pgSQL functions or parsing lists of numbers in strings.

Not to be overly pedantic, but if creating an extension that implements a new data type isn't "[requiring] complicated PL/pgSQL functions", I don't know what is.


One of the benefits of Postgres is having user-defined extensions like this that are effectively drop-in. Yeah, it's a trade-off. The whole submission is a discussion of the trade-offs between various implementations. One I'd make is using a custom datatype rather than custom PL/pgSQL functions (and I've written my share of those, including in this domain—they can be tough to get right in all of the edge cases) as the type is likely more efficient and correct.


My immediate thought was to store the information in a doubly-linked list type structure in SQL. Obviously the space efficiency is not optimal, but I am surprised that the article didn't even point this out.

The table could be designed with a unique PK, the todo item, and some binary or bool field to determine first (head) and last (tail). You would then add 'previous' and 'next' fields that would be updated for insert/delete/updates.


The problem with that is that it kind of breaks SQL to traverse the list.

Let's say you have a 1M element list, and you want to get the first 100 elements. Normally you would write something like:

  > select top(100) * ... order by itemPosition asc
To walk a linked list, you'd either have to:

- walk it on the client, one query per element, resulting in way too many roundtrips to the server.

- get all the data in one go and then walk it on the client. Here we're selecting and returning 1M records and throwing most of those away.

- walk the list in the database with a recursive common table expression. CTE's aren't appropriate for this; it'll be slow, and we'll run up against a recursion limit for large lists.

- walk the list in the database with cursors/loops/etc. Very icky, and breaks composability.


You are right and make many succinct points. I imagined this in the context of the todo list example given in the article, and assumed there would be a sane limit on items in a list, and that each record in the database would be tied to a list ID, or user ID, or some other method to allow only pulling the items that are actually in the list.

That being said, this would not alleviate the problems you mentioned when walking the list. It would need to be arranged once retrieved either by the client, or by the backend before passing to the client. I imagined that I would traverse the list recursively to order it before passing it off to the client. This should take O(n) time, since we only need to traverse the list once, and we haven't retrieved from the database any unrelated rows (list ID, user ID, etc - there has to be some boundary in place).


This is probably decently efficient if you use recursive CTE's to traverse it, and use keyset pagination rather than limit/offset.

    with recursive cte (udo_id, prv, nxt, label) as (
        select * 
        from udo
        where udo_id = :first_item_id
        union all 
        select u.*
        from udo u
        join cte
        on u.prv = cte.udo_id
    ) select * from cte limit 100


How would you query such structure to retrieve elements in-order?


Your best solution would probably be to use a recursive CTE if your DB supported it, but as a sibling post mentioned, it's probably not a good approach.


What's it called when the primary key is also a "foreign" (but not really) key into the same table? Anyway just do that. Each Item has a "PrecededByItemId" value. To insert X between A & B, just make X preceded by A and B preceded by X. Or probably better, go the other way (FollowedByItemId). It's the SQL version of a linked list. Efficient? Yes they're integers and an insert takes only two changes. Robust? Yes, infinite inserts possible. Elegant? Beats any of that complicated stuff in the article, unless I'm missing something. I suppose you could run out of integers but if your to-do list is that long, well...


There may be other reasons, but here's one. The solution with fractions supports ORDER BY syntax. With the self-referential foreign keys, you would have traverse through from the beginning with multiple queries for that behavior.

It's also a lot harder to query results in order while specifying an offset.


Couldn't you just order by preceeded_by ? Offsets should work


I think one of the first design choices you have to make is: can preceded_by be NULL? The first item in the list has to do something with preceded_by. As a general rule making columns NOT NULL is prefereable; here I declared it NOT NULL, and the first row must reference itself as a result:

  nworks=# create table todo (todo_id int primary key, task text, preceded_by int not null references todo (todo_id));
  CREATE TABLE
  nworks=# insert into todo values(1, 'do homework', 1);
  INSERT 0 1
  nworks=# insert into todo values(2, 'clean bedroom', 1);
  INSERT 0 1
  nworks=# insert into todo values(3, 'walk the dog', 2);
  INSERT 0 1
  nworks=# insert into todo values(4, 'call mom', 3);
  INSERT 0 1
  nworks=# select * from todo;
   todo_id |     task      | preceded_by 
  ---------+---------------+-------------
         1 | do homework   |           1
         2 | clean bedroom |           1
         3 | walk the dog  |           2
         4 | call mom      |           3
  (4 rows)

  
  nworks=# update todo set preceded_by = 1 where todo_id = 4; -- call your mother first, it's more important
  UPDATE 1
  
  nworks=# select * from todo order by preceded_by;
   todo_id |     task      | preceded_by 
  ---------+---------------+-------------
         2 | clean bedroom |           1
         1 | do homework   |           1
         4 | call mom      |           1
         3 | walk the dog  |           2
  (4 rows)

I didn't do any operation to fix the sorting order once I updated item #4, so there's more work to do here. I don't think it's a practical solution.


If you’re using a surrogate key in the preceded_by column, you won’t be able to reorder items in the list without updating the relevant keys, which obviates the use of the surrogate key (which is meant to be immutable).


Would you elaborate on how to select data in the custom order efficiently? What you're describing can't be easily indexed, if I understand you correctly. The strategy outlined in the submission ensures minimal table rewrites on updates (insertion, reordering, deletion) as well as efficient retrieval.


  with recursive todo_list_sorted as (
  	select
  		*
  	from
  		todo_list tl1 where prev_id is null
  	union all
  	select
  		tl1.*	
  	from
  		todo_list tl1
  	join
  		todo_list_sorted tl2 on tl1.prev_id = tl2.id
  )
  select * from todo_list_sorted


Yes, there are ways to do it. The question is how to do so efficiently. Recursive queries don't benefit from indexes as efficiently as, for example, the solution outlined in the submission.


Well I think this achieves minimal rewrites, but you do lose a bit on indexing. You can't retrieve by numeric position index for example. But reproducing the order is just a matter of starting from the head (where PrecededById is null) or from any point and then repeatedly selecting the entry whose PrecededById = the current ID.


That's more than a bit on indexing. Following a linked list through repeated selections is not nearly as efficient as an index scan.


Yeah that's true. Oh well, fractions it is then!


For most practical uses, something like the transaction described in Approach 1 should be good enough.

It has no limit imposed by precision, the benefit of a compact and well-supported storage format (just one BIGINT) will probably compensate for any inefficiency caused by updating several rows, and the multiple steps are not fragile at all if you trust PostgreSQL to handle trasactions properly. If you're really worried, just increment the sequence before you update a bunch of rows, not afterward.

But what if you have millions of users reordering billions of todo items?

Well, change the uniqueness constraint to (user_id, list_id, pos) or something composite like that. Only reorder items belonging to the same list owned by the same user. If a person is manually reordering items, there can't be too many items in any given list in the first place. There's no need to touch billions of other items belonging to other lists and other users.


Rewriting even a sublist generates unnecessary updates—on both the table and associated indexes. (While the article doesn't specifically describe using sublists, this is likely for pedagogical reasons: he doesn't include surrogate keys either. The pathological case of a full table rewrite on any update would be avoided in any event.)

As for storage, the rational datatype described in the submission has the same size as a BIGINT (64 bits), so there's no storage disadvantage there.


I encountered this problem on what must have been the first web programming project I ever tried and it’s haunted me since. I still think all the time about ways to solve this issue. It crops up all over the place.

I’ll never brush aside the SQL-object impedance mismatch like many people do, as something to work around in an otherwise great system that you can adapt to anything. SQL and relational modeling is an abstraction that leaks way too easily when it comes to these sorts of “human” use cases, wherein you want to do something that’s not perfectly rectangular like arbitrary sort. SQL probably didn’t leak like this when used for enterprise invoice-orders that it was designed for, but using it for applications where a touch screen is the primary user input that motivates the model: I’m not so sure it holds up.


Related is modeling trees in SQL, which has been discussed on HN: https://news.ycombinator.com/item?id=13517490


I've had success just using letters between a and z.

- The first item added has rank "m"

- If I want to add an item before that, it's m-a/2 (somewhere around "f")

- If I want to put something between "b" and "c" that's "bm"

- The database can sort it just fine.


This is the "what about leaving room?" variant described in the submission with either (a) a smaller range of available values if you're limiting yourself to single characters and/or (b) less efficient storage.


It's less efficient storage-wise, sure, but there's no "occasionally take the hit of shifting" problem.


Your strings could get REALLY long.

Suppose you have 3 items:

    id  pos  name
    1   'a'  apple
    2   'm'  banana
    3   'z'  cherry
You move cherry to the middle of the list. Now, you have:

    id  pos  name
    1   'a'  apple
    3   'g'  cherry
    2   'm'  banana
Now you move banana back to the middle of the list, which gives you this:

    id  pos  name
    1   'a'  apple
    2   'd'  banana
    3   'g'  cherry
Keep doing this, and the gap keeps shrinking. You need to make pos longer to be more precise to fit in the gap. Eventually you run out of space in your string field.


I read their response as not limiting themselves to a single character (which you have too, if I read you correctly). If they use a text column, it's practically arbitrarily large. I believe the limit is around 1GB in Postgres. They've also dismissed concerns of storage, so it's a price they're willing to pay.


Yes, they could get long in theory. In practice, in the place I'm currently using this method, the strings are 3 characters max.


Why not strings? For two different strings A and B it's trivial to build one that is between them in alphabetic order.


That is equivalent to the "Arbitrary Precision" section under Approach 2.


Simple solution: Don't use `unique`.

(Let's assume that your db is very fast, and let your app update all needed positions on save)


Yes! I'm reading through all this discussion and thinking "why are these people so bent on overthinking such a simple concept"? Also using text as the sort order solves most of the issues people are bringing up.


Pretty much every implementation of arbitrary sort order I have seen uses varchar for this very reason, and requires none of the cleverness in the article.


There's an O(log N) data structure for this. Unfortunately I don't know the name of it.

Keep a (balanced) binary tree where each node contains a pointer to one of the items you want to keep ordered. (In this case, each node has the primary key of one of your database rows.)

Every node also contains an integer which tells the size of that subtree (the tree whose root is that node).

To enumerate the entire list, do an in-order traversal of the tree.

To find the Nth item in the list, check the left child to see if it contains enough nodes that your Nth item would be in that subtree. If so, go left. If not, it's either the current node (if off by exactly one) or you go right. But if you go right, you must reduce N by 1 to account for the current node and also reduce N by the size of the left subtree.

To enumerate a range, you basically combine the previous two.

To move an item, just move it in the normal way you'd delete/add a node in any regular tree, but remember to update the counts stored inside any nodes that are affected, i.e. all parents/ancestors of any node that is removed or added. If you delete an internal node, you need to be careful to preserve the order of everything, but there are ways to do that by exchange

This data structure can definitely be modeled in database tables. Whether the operations on it can all be expressed in SQL is an interesting question. It might be impossible, or it might just be really tricky.


While it's vendor-specific, SQL Server has a built-in solution for it: http://technet.microsoft.com/en-us/library/bb677173.aspx


It doesn't do what I supposed it would do.

> A column of type hierarchyid does not automatically represent a tree. It is up to the application to generate and assign hierarchyid values in such a way that the desired relationship between rows is reflected in the values. Some applications might have a column of type hierarchyid that indicates the location in a hierarchy defined in another table.


PostgreSQL has ltree[0] which does the same thing, but that's not exactly what the OP is talking about.

[0] https://www.postgresql.org/docs/current/static/ltree.html


why not use a linked list made of self references?

  id | text               | prev | next
   1 | "steal underpants" | NULL | 2
   2 | "???"              | 1    | 3
   3 | "profit"           | 2    | NULL
For the use case of rearranging and inserting at arbitrary positions, wouldn't a linked list be ideal?


The table you show is not actually a linked list. Traversing a table with self references in order requires a recursive query, and doesn’t benefit from indexes for ordering.

There is also the issue of how many writes you do: finding a fraction between two others let’s you do just one INSERT or UPDATE (effectively the same thing in MVCC); but rewiring a linked list implies altering two rows. That may be okay but it’s not ideal.


For the query to retrieve the items back in order, logically it goes like:

   select * from (items) order by `prev`  --`next` doesn't help
   -- also hope that NULL is at the beginning
So it's no better than Approach 1. in the article, but is actually worse with the extraneous column in storage and head scratching for the next maintainer of the system.

Exercise for the reader: what happens when item id #3 is moved in between item id #1 and #2, how many update statements are needed?


No, this query doesn't retrieve the items in the correct order in general. It only works for the particular data in GP because the row ordering coincides with the user's ordering there. Try ordering by `prev` on the table below, which is a different valid representation of "steal underpants" -> "???" -> "profit".

  id | text               | prev | next
   1 | "profit"           | 2    | NULL
   2 | "???"              | 3    | 1
   3 | "steal underpants" | NULL | 2


Yep, that's fine - highlights the complexity of implementing the ordering in a linked list. The fundamental thing is that both columns aren't needed, only one column is needed to determine the order.


Pagination will also be complicated if not spaghetti code.


How do you make a SELECT that gets the items in the correct order?

The point of the solution here is so you can simply go `SELECT * FROM list ORDER BY pos` and have the correct order.


  with recursive todo_list_sorted as (
  	select
  		*
  	from
  		todo_list tl1 where prev_id is null
  	union all
  	select
  		tl1.*	
  	from
  		todo_list tl1
  	join
  		todo_list_sorted tl2 on tl1.prev_id = tl2.id
  )
  select * from todo_list_sorted


You don't. You order it on the client!


Because you need to optimize for reads. This setup doesn't make it easy to, say, select only the first 5.

I'd say, unless the list is really large, just change all the "pos" values in a concatenated set of update queries (transactioned of course). Also, don't forget the unique constraint on whatever makes the list unique + the pos.


If you have a hashtable-based index on the PK, lookup by it is o(1) so getting a list of first k elements is o(k). So what you'll end up with is: Insert/Update/Delete - o(1) (insert/update/delete new row, update up to 2 other rows -- each individual operation is o(1) and there're fixed number of them) Get list of first k - o(k) Can't be better than this? Elegant, fast, robust, not inventing a wheel?


It's less about whether the index is fast, and more about performing self joins (often hierarchical depending upon the DB engine and what's supported) with arbitrary count. When it comes to DB queries, n queries is not really acceptable for a page of n items.


That's not how databases are generally optimized. You're essentially doing random access, extracting a single tuple each time, in a datastructure optimized for locality. Think of it more like a memory-mapped region.


Nowhere in that article there was anything about hardware-related issues. I believe, what you are talking about has to do with how HDD store db files on the disk, and how it's (much) faster to sequentially read from such disk vs. seek for every record. Hence all sort of optimizations that DBs do - like b+trees, etc. If your dataset fits in memory - it's a nonissue, the engine will never reach for the disk anyway. Even if it doesn't fit in memory (in which case I would start all optimizations by adding more memory, if possible), it's much less of an issue with SSDs. But again - article has more science-ey tone than applied/practical. In some scenarios (large dataset on HDDs) random seek may be an issue.


I must admit, I don't like the the use of floats/decimals/rationals in this way. They are unfit for sorting.

Say I wanted to present a list of users with their top three to-do items, that's impossible using the above method.

    select 
        u.user, item1=i1.text, item2=i2.text, item3=i3.text
      from
        users u
          join items i1 on i1.user = u.user and i1.pos = 1
          left join items i2 on i2.user = u.user and i2.pos = 2
          left join items i3 on i3.user = u.user and i3.pos = 3
Them being integers means I can rationally reason about them. I can also infer their position easily. With floats, I don't know where pos=3 is without counting how many have a lower number.

Floats only solve two issues: Making inserting easier while still being able to use `order by`. But reasoning about your data becomes a lot harder.


Two ways to do this with window functions (there's probably a better one but this is what I came up with quickly):

1. Subquery

    SELECT * FROM (
        SELECT u.user, i.text, row_number() OVER (PARTITION BY i.user ORDER BY i.pos) AS row
        FROM users u INNER JOIN items i USING(user)
    ) numbered
    WHERE row <= 3;
2. CTE

    WITH numbered AS (
      SELECT u.user, i.text, row_number() OVER (PARTITION BY i.user ORDER BY i.pos) AS row
      FROM users u INNER JOIN items i USING(user)
    )
    SELECT * FROM numbered WHERE row <= 3;

Same thing, really. CTE looks a bit clearer, subquery seems to generate a faster plan but I don't really have a large enough dataset handy for it to make a difference.


They are fine for sorting! You just need to think about things differently.

  SELECT user, text FROM (
    SELECT u.user, text, row_number() OVER (PARTITION BY user ORDER BY rank) as rank
    FROM users u
    LEFT JOIN items using(user)
  ) ranked_items
  WHERE rank < 4
This option works for unlimited top-n without creating a bunch of unnecessary columns in the response.


> Say I wanted to present a list of users with their top three to-do items, that's impossible using the above method.

Not really:

    SELECT u.user, i.text
    FROM users u,
    LATERAL (SELECT * FROM items WHERE items.user = u.user ORDER BY items.pos DESC LIMIT 3) i;
(You can also make these columns instead of rows, but as someone already pointed out, you'd usually not do that in SQL.)


You are not wrong, although your example would not work in every SQL dialect (even when rewritten). Sybase's T-SQL does not have equivalent for LIMIT.[0]

My point being, there are some issues with regards to data purity and reasoning, particularly if you want a system where users can build custom reports.

And I just wish the original article highlighted that using floats would have these issues, and if writing SQL like this is important to you, then you might want to reconsider.

(As for your note in brackets, where I work, SQL is basically used as a scripting language. So we would do my example in SQL.)

[0] Yes, I know there is 'set rowcount', but you cannot do that within a sub select.


> Sybase's T-SQL does not have equivalent for LIMIT.[0]

Are you looking for SELECT TOP 3 ...?


First available from 12.5.3. And I cannot be certain we have that version or a newer running.

But you are right, I should have clarified that.


> "I don't like the the use of floats/decimals/rationals in this way. They are unfit for sorting."

Would you elaborate on why you consider these types unfit for sorting? Their orderings are deterministic, and efficiency depends on the implementation of the indexes for those specific types, not inherent in the data types themselves.

> "select u.user, item1=i1.text, item2=i2.text, item3=i3.text"

It would be more natural in SQL to return three rows than return three columns.

    SELECT u.user, i.text
      FROM users AS u
      JOIN items AS i USING (user)
      WHERE u.user = ?
      ORDER BY i.pos ASC LIMIT 3;
Why would you prefer to do it the way you describe?


> Would you elaborate on why you consider these types unfit for sorting?

I should have changed that language, what I meant was that they were unfit for reasoning about your data from an isolated point of view, i.e. standing with just one row. To understand an item's position using floats, you must know the entire context.

> It would be more natural in SQL to return three rows than return three columns.

Look at my select again, yours have a specific user in mind, mine doesn't. I want a list of all users who have at least one to-do item. And I want to see their top three to-do items in the report.

Without doing sub-selects in the joins, there is no way to limit this to three when the position is stored as float. And even your sub-selects are going to be complicated.

It might be better to create a temporary table first with each user and their first three items, but that's hardly efficient.

> Why would you prefer to do it the way you describe?

Because I need an overview. And I don't want up to three rows per user, particularly because some users will only get one row, some only two and most three rows. That creates an inconsistent overview.

(And that's assuming you write some SQL to properly limit the joins to three rows.)


Gotcha. Thanks for clarifying. You're right that you're effectively encoding the position, and that's part of the trade off. Edit to add: As for knowing the position by looking at it, I'd argue that most use cases, one wants to know the positions relative to others, which means you want to know them in context, next to the other rows. If you want to know the absolute numbers, you can use rank or row_number to add it: perhaps even provide a view if this is something you do frequently.

As for the report query you describe, you can get at the data you want with windowing functions. There are strategies for performing the pivot as well, but I'd likely do that in application code rather than in the SQL query.


SELECT * FROM list ORDER BY pos LIMIT 3

If there's an index on pos, there's no cost to the sort.


Sorry, but that's just one user. Assume we have 1000 users, 800 of which have at least one item in the `items` table.

Now I want a full select of those 800 users with their first three to-do items in neat columns next to the user.

Your select doesn't do that.


Ah, ok, I misread. Yeah, that requires a subquery then, but not because of floats being unsuitabe for sorting, but because you're doing a 1-to-many thingy in a 1-to-1 way by hardcoding 3 columns. It's like writing

   items[0].text + items[1].text + items[2].text
In code to simplify it, instead of

   items.slice(0,2).map((e) => e.text).concat('');
The former mostly works, but it's definitely not elegant. And you have to be sure the number of items being 3 is set in stone.


Another approach is that users often want the newest items listed first, but occasionally want exceptions. Thus, have a "sort date" with a time portion that governs the order. The sort-date defaults to the data entry (add) date/time. The display date and sort-date may differ, if present. A sort-date doesn't work for everything, but is simple, intuitive, and effective for news- or blog-like content.


This is pretty cool!

Honestly I don't think there is a "best" solution here. Even in a "regular" programming language where you can easily use any data structure, it's not obvious which is best for maintaining the order.

Updating the order "value" of all the rows = moving a value around in an array (causing you to also shift a bunch of other values around).

Storing the ID of the row that should show up next in the list (not mentioned in the article, but mentioned in the comments below) = moving nodes around in a linked list.

Creating a new index that will fit in between two existing rows = inserting a value into a tree (because of the index).

It is a little bit trickier in a database, just because, no matter what you do, you also have to pay the price of putting something into a tree as well. Still... I can think of situations where I'd use any of these techniques. In general, this is not a problem with a clean solution for every use case.


At Pinterest, we expanded the precision of our existing timestamp-based sequence column to accommodate reordering:

https://medium.com/@Pinterest_Engineering/how-we-built-rearr...


What about storing all the item ids in a text array (on another table, one related to these items somehow)?

  int[]
If the items are being ordered manually by the user, then they cannot be of a very large length.


That may meet your needs. You’ll lose some of the benefits of database constraints and indexing.

And while the submission uses it as an example, the use case of reordering items in a list needn’t be limited to lists manually ordered by a user.


It's not "an example", it's the entire point of the article.


Your to-do list is how big?


I have the same problem right now.

I use an unsigned int "pos" column in the db.

For a new list the "pos" starts with 1..n

When the user changes the order I find the largest "pos" in the list and with a for I set largest + n "pos" for every item.

There will be about 20 items in a list but let's just assume it's a 1000, so with this I'm good if the user edits the list less than 4 294 967 295 / 1000 times.

And if the largest + count(list) < lowest I can set the largest to zero and start over the pos with 1.


I implemented something like this in a naive way. I took the modified date (previous default sort order) and stored the unix timestamp * 100 in a 64 bit indexed SortVal column. Drag-n-drop sets the SortVal +/-1 relative to the item it is being dropped against. There's only ~30 items per page so I figure it won't break except in the pathological case.

Curious if there is some obvious flaw with this plan or if I really need the power of true fractions.


How do you handle situations where two items (lets call them A and B) are both moved before a third item C?

Then wouldn't A and B both have a "SortVal" that's 1 less than C's SortVal? That would make your SortVals non-unique and order arbitrary.


Well, only if your UI allows dragging two items at once. If you reorder multiple items, well, I suppose you can resolve it by passing some relative position. If you want to squeeze something in-between reordered items, then -1 probably is not a good choice, but more like calculating some middle value between previous and next item.


What would the timestamp represent? Are you providing a timestamp which encodes the position somehow, or using CURRENT_TIMESTAMP? Updating the timestamp for all of the rows in the (sub) list? How does this permit you to reorder the items in the list?


I haven't implemented this, just brainstorming.

Timestamp would only represent position and enables to put something in between, rather than an integer which adds +1. If put A (timestamp = 100) between B (timestamp = 200) and C (timestamp = 300), just set that order value to middle (150). Naive

It actually looks the same as "What about leaving room?" and, yeah, introduces complexity when you organize items for too long.


Interesting, thanks for posting,I wish Postgres would add a native rational type to core, would be far more useful than the umpteen other esoteric types they have. How battle tested is your rational number datatype? I just use 2 columns for fractions and some udf’s, is there a huge advantage with a single column? Did you look at the pgmp extension includes !mpq rational type based on GMP library? Looks v powerful but License put me off.


You can also use ORDPATH[1], which is built for tree structures but should work just as well for a flat list as it does support "in-between inserts."

[1]: http://www.dbis.informatik.hu-berlin.de/fileadmin/lectures/W...


There are a lot of solutions to this problem, each with their own merits.

Personally, I'd probably add a second table with FKs to the todos table PK. That allows you to remove the calls to nexval. It doesn't remove the need for a processing language, but it's a much cleaner solution, I think.


The problem of swapping two entries in such a list is also quite funny.


The author has imposed some rather arbitrary constraints on possible solutions, probably for pedantic reasons. I'd prefer to look for ways to use an index.


What about using linked lists? If the write performance is more of a concern than read performance then it is a possible option too.


It's cheaper to write to the DB when doing ordering updates, but you lose easy 'order by sort_order' in queries since the nodes you're linking to will likely be primary keys that are in a different order than the creation sequence. This probably has index ramifications but I don't know enough about them to say.

That's fine if you sort everything on the front end though. It's product-dependent.

Like someone else said, usually these kinds of lists are small enough and PCs/databases are fast enough for it to not really matter.


Relational table is flat row set and linked list is a directed graph. To traverse arbitrary depth graph stored in a relation one needs to use recursive queries - quite expensive operation.

Moreover once the relation data gets filtered (`WHERE`) then some links get broken/lost.


And with a graph database this would be trivial.


You say that as if it solves all problems (e.g. fast reads) at the same time.

You could very easily build a link-list type datastore in SQL and use a CTE to build the list, but if your main use-case is viewing the list rather than updating the order, then it's a slower compromise than something like the solutions OP was discussing.


Would you elaborate? Using a graph database would likely include tradeoffs in data integrity (which generally needs to be handled by application code) guarantees and performance (due to indexing). If you have specifics of how you would implement this with the same guarantees and performance in a graph database, please share. Graph databases certainly have a place, but they're not cure all (just as RDBMSs are not).


Use a blob. Do the reordering in the program. Write the todo list (in order) into a blob with some delimiter between each item.

Pro: simple

Con: might as well use a flat file


arrays ?


I think I would simply use json to store the list in a field. Makes it very easy to read and replace the json but then again, I would not be using SQL for this so I'm cheating.


It depends a lot on how many items are in the list. If there are less than 10000 items in each sorted set you should probably let the client handle the ordering and just do bulk inserts with order index in a denormalised JSON col or something, but after that you probably want something db-native.


What is the use case where your app would need to let the user sort such a large list? I cannot think of any use case. At that point I would let the user set the order field (integer) as it would not be feasible to have a drag and drop UI anymore. That would be cumbersome to use.


It's not 10k but I have Spotify playlists with 4k+ tracks (queue of new albums I listen to as radio at the office) and I always fiddle with the order of the albums at the tail of the playlist.


you can just do ids.join(",") to store just the ordering separately from how you store the list itself. It could still use sql it's just not as normalized. Once you switch to nosql you can just store the list in an array or linked list.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: