Hacker News new | past | comments | ask | show | jobs | submit login

What about when you delete rows? Do you just leave the space unused now? Or if you update a row to be larger? Rewrite the whole array (so possibly O(n) updates)?

How do you deal with data that gets accessed in different orders based on relationships from multiple other entities? Duplicate the data? If so, updates now get amplified and you can fit less data in RAM so you're more likely to require disk IO. If not, you need a layer of indirection (so you have an array of pointers instead of an array of data).

Even with a layer of indirection, updates that grow a row and require a reallocation will cause you to have to go update all pointers (also, those pointers need to be indexed to find who you need to update). To avoid write amplification, you can use an array of ids instead of an array of pointers. Now you want an id <-> pointer lookup, which could be done with a hash map in O(1), or with a tree in O(logN) if you want to also allow efficient range queries.

I'm not exactly sure on the implementation details, but for ballpark numbers, for an 8 kB page size with 80% fill factor and 16 B entries (8 B key + 8 B pointer), with 10E9 rows, log(N) ought to be ~4 (page IOs). Not ideal, but the point is for a lot of real-world engineering, O(logN) is effectively O(1) (with a small enough constant factor that it's fine for general use).

So if your data is not write-once, trees are a well-rounded default data structure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: