Hacker News new | past | comments | ask | show | jobs | submit login

If that is the case then why shouldn't the storage system hash the IDs itself, to spread them as it requires?



Because sometimes you want some data to be collocated, while the rest sharded.

For instance, you might use a random object ID as a prefix value in the index, followed by attribute ID which isn’t. Or a modified time, so you can have a history of values which can be read out linearly.

If using it directly, that means Objects and their data are sharded randomly across, but when looking for an objects attributes (or attribute by time), their index entries are always co-located and you can read them out linearly with good performance.

If blindly hashing keys to distribute them, you can’t do that. Also, you can’t really do a linear read at all, since no data will be ‘associatable’ with others, as the index value is randomized, and what is stored in the index has no related to the key provided by the user.

You can only do a straight get, not a read. That is very limiting, and expensive with large data sets as most algorithms benefit greatly from having ordered data. (Well, you could do a read, but you’d get back entries in completely random order)

Needless to say, this is ‘advanced’ usage and requires pretty deep understanding of your data and indexing/write/read patterns, which is why random hashing is the most common hash map behavior.


Sounds like it should be an attribute of the index and not require a change in the data. To me, anyway.

    CREATE INDEX ... USING HASH;


I’ve never seen that kind of optimization on a dataset that would fit on a database server of any kind. Tens of PB or EB usually, but sometimes only several hundred TB if it’s high load/in-memory only.


Just swizzle the ID.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: