Show HN: How to store a set of four 5-bit values in one 16-bit value

amptorn · on Jan 28, 2018

Ah, without preserving order. That's a set of four 5-bit values, not a sequence of four 5-bit values.

ajkjk · on Jan 28, 2018

To answer the obvious next question:

Four 5-bit values in order have 20 bits of entropy, so cannot be stored in 16 bits.

Four 5-bit values without order have 20 - log2(4!) =~ 20 - 4.59 = 15.41 bits of entropy (corresponding to log(2^20/4!) possible configurations), and thus can fit in 16 bits of data if you're clever about it.

londons_explore · on Jan 28, 2018

Your math is wrong.

If two of the numbers happen to be the same, order no longer matters for those numbers, so your log2(4!) needs to be larger...

scatters · on Jan 28, 2018

Right. So the true answer is

    log2((32*31*30*29)/(4*3*2) + (32*31*30)/2 + (32*31) + (32*31)/2) = 15.675

 which is still smaller than 16.

yorwba · on Jan 28, 2018

Are you sure you don't mean

   log2((32*31*30*29)/(4*3*2*1) + (32*31*30)/(3*2*1) + (32*31)/(2*1) + 32/1) = 15.34

using the number of sets of values from {0, 2^5-1} with at most 4 elements?

If on the other hand you want to store exactly 4 values, possibly with duplicates (while still ignoring order), you need to count multisets ( https://en.wikipedia.org/wiki/Bag_(mathematics)#Counting_mul... )

  log2( (35*34*33*32)/(4*3*2*1) ) = 15.676

pishpash · on Jan 28, 2018

The second answer, 15.676, is the correct one. It can also be arrived at by:

log2(32C4 + 32C3 * 3 + 32C2 * 3 + 32C1)

bonzini · on Jan 28, 2018

What the parent message did is: log2(32+4-1 C 4), because the number of k-sized combinations of n objects with repetitions is (n+k-1)Ck. One way to prove it is that you sort the objects (so you can establish an order among the objects in the combination) and add k-1 dummy objects for "the second is the same as the first, the third is the same as the second,..., the last is the same as the penultimate".

bubblethink · on Jan 28, 2018

I followed the original link as well as this calculation. Why not use this to map them directly (instead of breaking it into 12 bits and 4 bits) ? We'll just have some unused headroom in 16 bits. This would seem more natural than thinking of the 12 + 4 solution.

bonzini · on Jan 28, 2018

It uses more memory.

rcthompson · on Jan 28, 2018

Does this implementation allow duplicates? Typically a set is defined as an unordered collection with no duplicate items.

jack9 · on Jan 28, 2018

The maximum entropy is what is important, to the encoding challenge, as I understood it.

pishpash · on Jan 28, 2018

Actually the log2(4!) needs to be smaller, because as yorwba below says, duplicates don't have 24 permutations.

mturmon · on Jan 28, 2018

His math is right. The factorial already took care of the fact that order does not matter.

If you want to optimize storage of duplicates, you still have to store the number of duplicate numbers, and you are back where you started.

yorwba · on Jan 28, 2018

The factorial is taking too much care of the fact that order does not matter. If I wanted to store four 1-bit numbers, ajkjk's math would suggest that I need log2(2^4/4!) = -0.58, i.e. a negative number of bits. You can't just divide by 4! because not all combinations of 4 numbers have 4! unique permutations.

> If you want to optimize storage of duplicates, you still have to store the number of duplicate numbers, and you are back where you started.

No, because you can shave off a few bits by ignoring the ordering of the non-duplicates.

mturmon · on Jan 28, 2018

You’re right, correction appreciated.

Buttons840 · on Jan 28, 2018

Is there a specific topic of study that taught you those formulas?

SilasX · on Jan 28, 2018

As cameldrv said, information theory. The late David MacKay has a good free book:

http://www.inference.org.uk/itprnn/book.html

The intro chapters cover the topic of measuring informational entropy.

Aside: One interesting application is to the problem of “use a balance scale only three times to determine which of twelve balls has the wrong weight (too heavy or light)”. You choose the weighings so as as to maximize the entropy of the outcome i.e. so there are many possible outcomes of equal probability.

gpm · on Jan 28, 2018

Not the person you asked but

4! is the number of permutations of a list - it's commonly seen throughout CS theory and other forms of discrete math.

log_2(n) is the number of bits needed to store a integer from 0 to n - it's reasonably commonly used in CS theory education.

ajkjk · on Jan 28, 2018

As others have mentioned, it comes up in information theory, or maybe in some subfields of combinatorics and probability. I learned it tangentially in various parts of physics.

But I might just recommend going to the source! which is Claude Shannon's seminal masters thesis on coding: http://affect-reason-utility.com/1301/4/shannon1948.pdf . It's surprisingly readable and worth at least skimming if you like this stuff.

abpavel · on Jan 28, 2018

try combinatorics branch of mathematics

arnarbi · on Jan 28, 2018

https://en.wikipedia.org/wiki/Information_theory

cameldrv · on Jan 28, 2018

Probability theory and information theory.

wildmusings · on Jan 28, 2018

[deleted]

dboreham · on Jan 28, 2018

They're not ordered: that's the whole point.

wildmusings · on Jan 28, 2018

My bad, I misread!

craftyguy · on Jan 28, 2018

Well, the title literally says "a set of" and "not a sequence of".

ChuckMcM · on Jan 28, 2018

Yup so no ARGB5555 16 bit type :-)

kuschku · on Jan 28, 2018

But an MSAA 4-fragment each 5-bit 16-bit type.

Or MSAA4x ARGB5555 in 64-bit (instead of 80 bit)

joshuak · on Jan 28, 2018

Hmm, it's interesting to me that order would be expected. I nearly always think of storage and ordering as entirely separate problems.

yorwba · on Jan 28, 2018

Usually, when you store a value, you want to be able to get exactly that value back. If you store multiple values without keeping the ordering, you lose that capability. I.e. you can't replace all possible uses of four 5-bit variables with one 16-bit variable, only those where the variables are interchangeable.

joshuak · on Jan 28, 2018

Many common data structures fail to preserve order. Maps (aka dictionaries) based on hash tables come to mind.

Slightly adrift of the topic...

I have a HAMT (hash array map trie) implementation that uses a 32-bit unsigned int as a bitfield to indicate which of the 32 possible children nodes are populated. With this trick I could encode any node with 5 or fewer bits flagged with a 16-bit unsigned int instead.

I just checked and 1 million keys created roughly 1.3 million nodes (1 million keys, 300k internal nodes). Nearly 90% of the 300k internal nodes have 5 or fewer bits set. This trick would make the HAMT overhead nearly half the size.

Very useful!

Now, how about 6-bit values in one 32-bit value? I'd like to try this with a 64-bit HAMT. Can this be easily generalized?

Edit: Oops, that should be nodes with 4 or fewer bits set, there are only 4, 5-bit values. Interestingly that apparently doesn't make much of a difference for HAMTs only about 2k internal nodes out of 300k have 5 children. So the results remain the same.

scythe · on Jan 28, 2018

The number of bags of k N-bit numbers is 2^N + k - 1 choose k:

https://en.wikipedia.org/wiki/Multiset#Counting_multisets

The number of bits required to uniquely represent such a bag is thus ⌈log2((2^N+k-1) choose k)⌉. For (N, k) we have:

(6, 6) -> ⌈log2(69 choose 6)⌉ = ⌈log2(119877472)⌉ = 27 bits (6, 7) -> ⌈log2(70 choose 7)⌉ = ⌈log2(1198774720)⌉ = 31 bits

So you can store seven 6-bit integers without order in one int32. I don't know if it'll be slow, though. It should be possible to infer an encoding algorithm from the proof of the counting formula but it might be slow.

slashdev · on Jan 29, 2018

My understanding is you can't tell order using this trick. I guess the bitmap is just an optimization to skip the hash lookup? Order doesn't matter for you?

Something I came up with is packing the pointers/values into a contiguous array and using the bitfield to tell not only if it's present, but what offset is at (mask + popcnt) * size of(value). This eliminates wasted space due to unused buckets in the hash table. I'm sure it's been done before, but I haven't seen it anywhere.

How's your Go? Want a remote contact doing Go microservices that pays very well? Email is in my profile.

joshuak · on Jan 31, 2018

That's pretty much exactly how hash array mapped tries work.

Take the first 5 bits of a 32 bit hash, use it to set a flag in the bitfield, a populated field indicates a populated child array element, find out which one using popcnt. When you find a conflict on insert take a step down the tree, use the next 5 bits in the 32 bit hash rinse and repeat.

A HAMT is the basis of a fully concurrent data structure for maps which can also do fast atomic snapshots called a ctrie.

slashdev · on Feb 1, 2018

I'll add HAMT and tries to my reading list, thanks.

sweettea · on Jan 28, 2018

I believe, based on the math, that 6 6-bit values barely don't fit in 32 bits. :(

yshklarov · on Jan 28, 2018

Taking this to the extreme, you could "store" 65535 1-bit values in one 16-bit value. You just have to keep track of how many of them are equal to 1.

uniformlyrandom · on Jan 28, 2018

Thank you - this is a much better explanation of how this 'storage' works.

ViktorV · on Jan 28, 2018

I think this is not true for single bits this way.

utopcell · on Jan 28, 2018

For completeness: The number of possible sets that include 4 numbers from the range [0..15] is (16+4-1) choose 4, or 3876.

Let's assume that our set is (a, b, c, d). It is ordered, therefore a <= b <= c <= d. For example, a valid set is (2, 4, 4, 10).

Consider all possible numbers that may be in our set, laid out in order.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

For each element in the set, we mark its position in this order by placing a marker to the left of the corresponding number.

For our example, we first mark element 2:

0 1 m 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Then 4:

0 1 m 2 3 m 4 5 6 7 8 9 10 11 12 13 14 15

Then, 4 again:

0 1 m 2 3 m m 4 5 6 7 8 9 10 11 12 13 14 15

And finally, 10:

0 1 m 2 3 m m 4 5 6 7 8 9 m 10 11 12 13 14 15

Now, out of these 16+4 places, only 19 of them could be a valid placement for a marker, as a marker should always be placed to the left of a number (or said differently, "15" will always be in the end.)

Therefore, there are 19 choose 4 ways to pick the locations of the markers.

shaklee3 · on Jan 28, 2018

Thanks. That was really helpful.

bringtheaction · on Jan 28, 2018

That is one big-ass makefile though. I realize it has some extra niceties but I have to ask, did you ever try

    make main

without any makefile at all?

If you haven't done so, delete the makefile now (you have it in version control anyway) and give it a try.

Merad · on Jan 28, 2018

Heh, that big ass makefile is the generic makefile [1] I came up with a few years ago. Definitely a bit of overkill for a single file project, but some of us are lazy when it comes to such things.

[1]: https://github.com/mbcrawfo/GenericMakefile

dataflow · on Jan 28, 2018

I can see why you gave up on handling spaces :-) I made one over a year ago that handles spaces correctly so that I could use it on Windows (MSYS2), and it was quite the 'fun' project... =P

torstenvl · on Jan 28, 2018

GNU Make has a number of implicit rules. A blank target rule (or no target rule) in the Makefile (or no Makefile at all!) will cause `make` to try to create the specified target `foo` from `foo.c` or `foo.cpp` (or others) if they exist. More complete explanation here:

https://www.gnu.org/software/make/manual/html_node/Catalogue...

Note: I was initially annoyed with parent's "do this and see what happens" post and its lack of substantive communication, and responded poorly. This comment is substantially edited.

LukeShu · on Jan 28, 2018

Your response of linking the GNU Make manual comes of as even more self-impressed and coy than the parent.

The parent suggests removing the makefile entirely and running `make main`, and observing the result. That would be a learning experience (and much better than your "RTFM")! But, if you don't want to humor that, the thing to say is: The result is that it works, and correctly builds the program, without a makefile at all.

The explanation is that GNU Make has a catalog of built-in rules (which you linked to); but it's quite a leap from knowing that fact, and even making use of them, to realizing that it means in some cases you don't need a makefile at all.

bringtheaction · on Jan 28, 2018

> The parent suggests removing the makefile entirely and running `make main`, and observing the result. That would be a learning experience! But, if you don't want to humor that, the thing to say is: The result is that it works, and correctly builds the program, without a makefile at all.

Exactly my intention. Thank you, I was worried after the first response from the other person that I'd worded myself too poorly and that more people would think I was being "self-impressed and coy" when that was not the intention at all.

bringtheaction · on Jan 28, 2018

> self-impressed and coy

That's an odd thing to read into my comment. I am merely passing on information that was at one point given to me when I myself had a Makefile that was actually not needed because the program it was for was a single source file project like OP.

Also in what way is it easier to read a long document without reference to a specific section than it is to just delete the makefile and run make and the stem of the name of the source file?

Edit: My second paragraph of this comment was written prior to parent commenter editing their comment.

yorwba · on Jan 28, 2018

I appreciate that you edited your comment to make it more informative. Now if you could also remove the snark, it would be a great comment.

torstenvl · on Jan 28, 2018

There is no snark. I think the GGP's post adds nothing of value, and is in fact rude. It's not snark to point that out, anymore than your critique of my post is snark.

You may feel differently. That's fine. But differing opinions being downvoted to oblivion is ... problematic, at best.

EDIT: After reading your response below, you were right and I was wrong. Editing.

yorwba · on Jan 28, 2018

From the HN Guidelines:

When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

You may think that encouraging experimentation without giving an explanation is rude, but to me it's just a different way of sharing knowledge.

andromeduck · on Jan 28, 2018

Is there a way to make it work with just make? I'm way too used to just typing make at this point and it's not too much work to just copy the same makefile everywhere.

to3m · on Jan 28, 2018

Use a phony target that does whatever you want. Mine typically invoke cc to build the program ("cc -O3 -Wall -o main main.c", something like that?), then run the program ("./main"), which prints its stuff to stdout and stderr. Makes it very easy to build+run from within Emacs, where the default compile command is "make".

chungy · on Jan 28, 2018

echo "all: main" > Makefile

make

majewsky · on Jan 28, 2018

No need create a Makefile.

  make main

is enough.

chungy · on Jan 28, 2018

Indeed, but it doesn't quite answer how to make it work with _just_ `make`

orlp · on Jan 28, 2018

It is possible to do this in O(1) memory, for arbitrary sized collections, efficiently.

The first trick is to make a function that can calculate the nth set with k elements from some universe U (in this case U = {0..2^5-1} without order (no duplicates) directly. This is done using the https://en.wikipedia.org/wiki/Combinatorial_number_system. This is very efficient.

Then, to encode duplicates you use the stars and bars trick.

sirclueless · on Jan 28, 2018

Can you elaborate on the second part? Are you saying you're going to expand the size of the universe by a factor of 2^(k-1) to use as locations of the "bars"?

orlp · on Jan 28, 2018

Yes, the universe is expanded.

sirclueless · on Jan 28, 2018

I really like this mechanism. It seems like it should be the optimal way to store a set of size k.

Out of curiosity, do you know of an efficient way to calculate this without sorting first? Or even better, without knowing k ahead of time? If you can sort your input first it's trivial, but even though it seems like a super efficient way to store a set of numbers I don't know how to use it to solve, for example, the other problem in these comments of sorting 1M 32-bit integers in 2MB of RAM (you can store them with this mechanism in about 1.7MB of RAM, but only if they were already sorted).

orlp · on Jan 28, 2018

I don't know of a way of doing this without sorting or knowing k ahead of time, sorry.

scythe · on Jan 28, 2018

I'm pretty sure that computing N choose k requires O(k log N) memory to store the result (for N >> k). Neat trick though.

orlp · on Jan 28, 2018

I should've said O(small), I'm comparing to the solution in the OP, which computes all possible combinations in a lookup table.

syrrim · on Jan 28, 2018

This seems obviously wrong, though I admittedly don't know the stars and bars trick. Lets say you wanted to store a number a in [0,k), for arbitrarily large k. All you would need to so is encode your value as 0 repeated a times, and 1 repeated k-a times. Because you can store arbitrary integers below k with this method, you would need at least log(k) space. If k can be arbitrarily large, you would require unbounded space.

tveita · on Jan 28, 2018

Arithmetic coding will work too, without any special trick to code duplicates.

rafael859 · on Jan 28, 2018

Can you elaborate? I am not very familiar with arithmetic coding, but I thought it required arbitrary precision floating point numbers, so not O(1) memory.

tveita · on Jan 28, 2018

You don't need arbitrary precision, just enough precision. Specifically, in this case you only need 16 bits. You can start with an upper and lower bound, and narrow it down for each number.

Here's an example with numbers dumped from a quick prototype encoder. The exact range numbers aren't that important, but note how the range gets smaller for every symbol we encode:

Say we want to encode the ordered numbers 10, 15, 31, 31.

We start with the full range [0, 65536)

There are C(32 - 10 + 2, 3) = 2024 combinations that start with 10, out of 52360 possible, so that narrows it down to around 4% of the full range, we'll use [49702, 52236)

For the next number we only allow numbers 10 and above, 15 has 153 out of 2024 possibilities, range [51022, 51214)

For the next number, 31 is only 1 of the 153 possibilities, so it gets a tiny range, [51212, 51214).

And now 31 is the next number in 1 out of 1 of cases, so the range is again [51212, 51214)

We can code the sequence as either 51212 or 51213, since we used a range that is slightly larger than needed some combinations have multiple codes. We could have started with the smaller range [0, 52360) instead to get a bijection.

orlp · on Jan 28, 2018

I don't believe that it does, at least not without some additional trick. How do you handle the case where you repeatedly encode the largest case? E.g. {31, 31, 31, 31}.

mgelbart · on Jan 28, 2018

I wrote a blog post several years ago that's basically about this same effect, although from a different angle: https://hips.seas.harvard.edu/blog/2013/06/05/compressing-ge...

barrkel · on Jan 28, 2018

The fact that it's using lookup tables to store the mapping between the set of numbers and the encoded 16-bit number makes it less interesting; it basically just enumerates the set of all possible combinations of numbers, and uses the ordinal as the encoding. A directly calculated scheme without a lookup would be niftier, though I suspect in practice the way it steals bits from redundancy of duplicates it wouldn't be pretty.

hinkley · on Jan 28, 2018

It does seem like it should be calculable without the lookup table. Something with modular math perhaps.

huftis · on Jan 28, 2018

This reminds me of the following old puzzle.

Ten people are standing in a line. Each person is wearing a black or a white hat and can only see the colours of the hats of the people in front of them. The goal is to guess the colour of one’s own hat. First, the last person in the line shouts out their guess (‘black’ or ‘white’), then the person in front of them shouts out their guess, etc. No other form of communication is allowed (no clapping, no touching, no texting – and no tricks involving, e.g., encoding information in the delay between it being ‘your turn’ and actually shouting out your guess). The group ‘wins’ if at most one person guesses wrong.

The group is allowed to discuss a strategy before taking part in this puzzle (i.e., before being given their hats). Which strategy should they choose to maximise the chance of winning?

They’re only allowed one wrong guess, so this is basically trying to encode 9 bits of information in 1 bit. And here the ordering of the bits actually matters. Still, it’s possible to choose a strategy which gives the group 100% chance of winning! Good luck trying to solve this one. :)

steve_musk · on Jan 28, 2018

Seems impossible to me.

TekMol · on Jan 28, 2018

The way huftis stated it, it is indeed impossible.

In the popular, solvable version of the puzzle, the group also gets the information if each guess was wrong or right. It's not at all about encoding "9 bits of information in 1 bit".

huftis · on Jan 28, 2018

No you don’t need information on whether the each guess is right. (And, of course, if the group is playing it correctly, every guess except possible the first one is correct.)

TekMol · on Jan 28, 2018

Damn, you are right! I wonder why the first two instances of this puzzle I googled included that info.

Still, it's not about encoding "9 bits of information in 1 bit".

huftis · on Jan 28, 2018

You’re of course right about this not being encoding ‘9 bits of information in 1 bit’. That would be impossible; to encode 9 (arbitrary) bits of information, you need 9 bits.

On the other hand, you must encode information on the colour of 9 hats, and only the last person in the line is free to provide some information. He can’t possibly know the colour of his own hat, so he can only guess – or use his turn to provide 1 bit of information to the rest of the group (while the other people have to shout out their correct hat colour). The difficulty lies in figuring out why the puzzle as stated does not need you to encode 9 bits using 1 bit …

Robin_Message · on Jan 28, 2018

Try doing it with just two people. Then extend the concept somehow. All the details of the puzzle are important.

steve_musk · on Jan 28, 2018

Well it’s easy to do with two people because the first person can just guess the color of the second, since we are allowed one wrong guess. I don’t see how this can be extended.

fyi1183 · on Jan 28, 2018

I assure you that it's possible. To give you a hint without giving everything away: the guess of the first person must be derived from the colour of all other person's hats.

You could think of it as a puzzle in coding theory, and it's not implausible. If you sum up the number of hats that each person sees and the number of guesses they hear, you'll note that everybody has 9 bits of information available. And they're supposed to make 9 correct guesses. It adds up.

Another hint would be to think of the case of 3 persons first. That can be done with patient case analysis, and it's likely to get you an idea for generalisation.

OJFord · on Jan 28, 2018

Think about it as if each guess preceding a person is a bit of information, the goal is to come up with a strategy for using that information (you control encoding as well as decoding) to correctly infer another bit.

Too · on Jan 28, 2018

You can extend to three people if the first "guess" indicates if nr 2 and 3 have the same color or not. But doesn't go further because after the first all guesses must be 100% correct and can't be used for communicating additional information.

marcan_42 · on Jan 28, 2018

It does extend. It's just an XOR/parity trick. The last person shouts out the XOR of all the previous colors (if you prefer: whether the number of visible white hats is even or odd). From there, every other subsequent person knows their color by XORing the hats they see together with the guesses made so far.

For example, the second to last person knows their hat is black (0) if the last person's guess matches the XOR of the hats they see - if it doesn't, their hat flipped it, and thus is white (1). And so on and so forth. Your hat color is everything you see and everything that was said XORed together.

huftis · on Jan 28, 2018

That’s correct. Or, to explain in (slightly) more non-mathematical terms: The last person (‘person 10’) shouts out (e.g.) ‘white’ if the number of white hats in front of him is an odd number (1, 3, 5, 7, 9) and ‘black’ otherwise. Let’s say he shouted ‘white’. The person in front of him looks at the people in front him. If it’s still an odd number of white hats, his must obviously be black; otherwise it must be white.

If his hat is black, his shouts out ‘black’, and the person in front of him knows that the rest of the line (8 people) must still have an odd number of white hats, and he applies the exact same logic. But if the hat of person 9 was white, he would shout out ‘white’, and person 8 would know that the rest of the line (including himself) should now have an even number of white hats.

So, basically the rule is: Define ‘the rest of the line’ to mean the people who have not yet shouted out a colour. When the first person (person 10) shouts out ‘white’, this means that ‘the rest of the line’ has an odd number of white hats (parity: odd). Whenever someone shouts out ‘white’, the parity (even/odd) of ‘the rest of the line’ is flipped.

Using this, each person only has to count the number of white hats of the people in front of them and observe if it is even or odd. If it matches the parity of ‘the rest of the line’, the person’s hat is black, otherwise it’s white. (This also works for the person in front of the line (person 1). He sees no (or 0) white hats, i.e. he sees an even number of white hats.)

utopcell · on Jan 28, 2018

The title is deceptive. It should state that the goal is to store a set of four 5-bit values.

raphlinus · on Jan 28, 2018

This is quite related to the problem of sorting a million 32 bit integers using only 2M of RAM (and no disk). It can be done.

Matheus28 · on Jan 28, 2018

A list of deltas should do it. In the worst case, the deltas would use log2[2^32/1000000] * 1000000 bits, so about 1.5 MB. Plus some space because of base 128 encoding (it increases size up to 37/32, rounded up per byte).

I got a worst case of exactly 2 MB (1.907 MiB) (all deltas being 4294, so the list is 0, 4294, 8588...), but maybe it's possible to get better than that.

It would be uber slow though, probably n^2.

lalaland1125 · on Jan 28, 2018

The idea is good, but base128 won't work. The worst case scenario is around 250k offsets of 2^14 (requiring 3 bytes each) and 750k offsets of 2^7 (requiring 2 bytes each). That's 2.25 MB

Matheus28 · on Jan 29, 2018

That's true. A 5 bits header before each number that specifies how many bits a number uses would be enough.

sirclueless · on Jan 28, 2018

How are you encoding the deltas exactly?

Matheus28 · on Jan 29, 2018

Say the first number is 100, then the following one is 300, then 1000. You'd encode [100, 200, 700].

twouhm · on Jan 28, 2018

So what is the trick?

_wmd · on Jan 28, 2018

It's the opening problem from Programming Pearls: http://www.fusu.us/2013/06/bitmap-sort.html

(edit: better link)

sirclueless · on Jan 28, 2018

How is this relevant? A bitmap of size 2^32 is 512MB in size and you only have 2MB of memory.

lalaland1125 · on Jan 28, 2018

The trick is that you only need to store the unordered list of seen integers as your state while you sort. This takes about 1.5MB or so. The reduction in space comes from the fact that you don't have to store the order (for example, you could store the numbers sorted and only record offsets. That saves space because the offsets will be smaller)

londons_explore · on Jan 28, 2018

If I only have 8 bytes of ram and a program counter I can do it...

For (I=0; I<1<<31; I++) For (J=0; J<1e6; J++) If (input[J]==I) print(I);

Shoddy runtime, but hey...

raphlinus · on Jan 28, 2018

Yes, this is a problem that requires careful specification. You only get streaming access to your input, not random.

andromeduck · on Jan 28, 2018

Couldn't you just do that with radix sort or am I missing something?

raphlinus · on Jan 28, 2018

Yes, you're missing something. Radix sort, like most sorting algorithms, requires storing the intermediate results in memory. This problem seems impossible at first blush because storing 1 million 4-byte integers would seem to require 4M of RAM, and only 2M is available.

joshuamorton · on Jan 28, 2018

But if I pick my buckets right, I can store only the lsb in each bucket.

Or more efficient, dynamically construct an implicit trie/heap type thing, where each leaf is the count of times that int appears (and the int itself is encoded in the location)

That feels really, really handwavy but promising?

sirclueless · on Jan 28, 2018

You're sort of on the right track, but of course you have to encode those counts very carefully because 1M counts will exceed 2MB of memory unless you work hard.

joshuamorton · on Jan 28, 2018

Right, you can't store them as ints. 1 million bits is the max you need, but like this whole thing is just super tricky because you can't just address things normally anywhere, since that's too big.

andromeduck · on Jan 28, 2018

So I'm allowed to read/write only once to the backing media?

You're right, that does sound impossible. :O

6nf · on Jan 28, 2018

Only if all your ints are unique I guess?

raphlinus · on Jan 28, 2018

No, that's not a requirement, as it turns out, though of course it would decrease the entropy even further.

Franciscouzo · on Jan 28, 2018

Sure, just pick a sorting algorithm that uses constant memory and the optimal O(n lg n) time complexity:

https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_o...

Matheus28 · on Jan 28, 2018

These need random access to use constant memory, he was talking about streaming access

Franciscouzo · on Jan 28, 2018

Yeah, I didn't pay attention that you can't really store 1 million 32 bit integers in 2MiB without encoding/compressing them somehow.

uiri · on Jan 28, 2018

Conversely: the order for 5 4-bit values itself has 4-bits of information.

olympus · on Jan 28, 2018

The order for five four bit values has about 6.91 bits of information. log_2(5!)=6.91

More relevant to the problem, the order for four five bit values has about 4.59 bits of information (log_2(4!)=4.59). Not all of the sets actually have four numbers- many have duplicate numbers, like the list [1 1 2 3] is a set of numbers {1 2 3}- and that gets us down to 4.08 bits of information which is small enough to make this example work.

At least that's how I understand this. I could be wrong.

AlphaGeekZulu · on Jan 28, 2018

Ahh, just yesterday in HN: Operation Gunman

http://www.cryptomuseum.com/covert/bugs/selectric/index.htm

Quotes:

"The data from the 6 magnetometers (i.e. 6 bits) was somehow digitally compressed into 4-bit words and then stored in a magnetic-core buffer that could hold up to 8 such 4-bit data words." "It is unknown why and how the data was compressed from 6 to 4 bits, and the NSA report is very vague on this point. It is possible that the Soviets used 4-bit logic and had to spread the 6-bit data over more than one 4-bit data word, but it is more likely that they used frequency analysis"

"According to the NSA report, the Russians compressed the 6-bit data into a 4-bit frequency select word. Although the report doesn't explain what they mean by this, we can make a few educated guesses. The reason for compressing it into 4-bits, was probably the fact that the Russians only had access to 4-bit digital technology at the time. The problem with 4 bits however, is that each data word has just 16 possible combinations"

...

Maybe a similar "hack"?

YesThatTom2 · on Jan 28, 2018

Which reminds of of this old programming joke:

While developing a CAD system we saved a lot of RAM compared to our competitor's product. They were storing each (X1,Y1)(X2,Y2) as four integers. We figured out how to the same thing in just 16-bits.

Yes, you guessed it (wait for it....) we could fit a edge in word-wise!

(Disclaimer: I didn't say it was a GOOD joke!)

LukeShu · on Jan 28, 2018

I had to add `#include <algorithm>` to get it to compile.

rasen58 · on Jan 28, 2018

Why is this first line true: "This works because there are 3876 possible unique values for a set of 4 4 bit values"

Each 4 bit number has 16 possible values. And you can order them in the set in 4! = 24 ways. So I thought you would get 24 * 16 = 384. Can someone explain?

karlding · on Jan 28, 2018

They're taking the multiset [0].

So it's \binom{2^4 + 4 - 1}{4} = 3876

In other words, there are 3876 multisets of cardinality 4 with elements taken from the set containing all 4 bit values.

[0] https://en.wikipedia.org/wiki/Multiset

rasen58 · on Jan 30, 2018

Can you explain the top number (2^4 + 4 - 1)? I understand the bottom number 4 is probably because you're choosing 4 values out of the set of numbers from the top, but not sure how you got 2^4 + 4 - 1.

RandomCSGeek · on Jan 28, 2018

I haven't learned about compression of data, but from what I know, one has to make some assumptions to store n bit data into m bit space, where m < n. Here, the assumption seems to be that numbers are necessarily 5 bit, and that there are only 4 numbers.

What I fail to understand is how this helps to compress the data in this case. The code isn't really that nice, and the comments seem to assume prior knowledge is this field, which I don't have.

So can someone give a ELI-Noob?

SilasX · on Jan 28, 2018

It’s not perfect compression; it’s lossy. Per the top comment, this method treats it as a set i.e. ignore the order. So you’re throwing out some information.

https://news.ycombinator.com/item?id=16249092

psyc · on Jan 28, 2018

This is a great little lesson in information entropy. I work on similar problems quite often in my work, and understanding this was a pleasant epiphany.

ttoinou · on Jan 28, 2018

Could one practical use of the general case be to store factorial based numbers efficiently ? https://en.wikipedia.org/wiki/Factorial_number_system

sabujp · on Jan 28, 2018

how are you getting 3876 unique values for a set of 4 4 bit values (16, 16, 16, 16) ?

yorwba · on Jan 28, 2018

It's not the number of sets with at most 4 elements (which would be 1820), it's the number of bags with exactly 4 elements (i.e. the multiplicity of duplicate elements matters).

https://en.wikipedia.org/wiki/Bag_(mathematics)#Counting_mul...

jandrese · on Jan 28, 2018

He doesn't care about the order so 1, 2, 3, 4 is stored the same as 4, 3, 2, 1. That's where the savings is coming from.

lostatseajoshua · on Jan 28, 2018

I really don’t understand why not caring about the order would save space. Could you explain in simple terms please?

jandrese · on Jan 28, 2018

If you care about the order then

1, 2, 3, 4 has to be stored differently than 4, 3, 2, 1, as well as 1, 3, 2, 4, and so on.

If you don't care about the order all of those can be stored in exactly the same way. Think of it like lossy compression, if you don't care about some of the detail you an ignore it (the order of the numbers in this case) and save some space.

pavlov · on Jan 28, 2018

Consider the simplest case where your set members are just bits.

If you have three bits and you care about the order, that's a 3-bit number -- it can take 8 different values.

If you have those same bits but you don't care about the order, the operation is a "bit count" -- you can only have 4 different values.

hikkigaya · on Jan 28, 2018

I don't understand as well...

2opt · on Jan 28, 2018

(n + k - 1) choose k

n = number of unique 4-bit values

k = size of multiset

(2^4 + 4 - 1) choose 4 = 3876

XtremAlRaven · on Jan 28, 2018

Any reason why store last 4 bits as is and not just making a LUT out of 4 sorted 5-bit values?

lostatseajoshua · on Jan 28, 2018

Can someone explain this in very simple terms without math? An ELI5 example. Thank you.

hinkley · on Jan 28, 2018

If you’re playing Yahtzee or cards it doesn’t matter what order you get the dice/cards. It just matters if you got them at all.

If you have two values, you can arrange them two ways. Three values you can arrange six ways. Four values you can arrange 20 ways. 0-20 takes a little over four bits to represent.

If you have a set, you don’t care about order. So if you save the data without an order to it, it takes a little less space to hold it. About four bits worth.

With five values you can save almost seven bits (120) and with six it’s over 9 bits (720).

anonu · on Jan 28, 2018

What's the practical use for this?

andromeduck · on Jan 28, 2018

Saving space in data structures that use buckets such as hash tables. I actually discovered this in a paper on hash tables that kind of just mentioned that it was a thing but I had a harder time finding out exactly how it was done.

pm215 · on Jan 28, 2018

Perhaps you could mention the motivation for the trick in the readme? My initial reaction was "either this is an april-fools style joke, or it's a cute but pointless trick", and it wasn't til I got to the bottom of the HN comments that I found an explanation of why it would be worth knowing...

andromeduck · on Jan 28, 2018

done :)

joshuak · on Jan 28, 2018

Any chance you can remember which paper?

andromeduck · on Jan 28, 2018

https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

espeed · on Jan 29, 2018

De Brujin Sequences

grawprog · on Jan 28, 2018

It's pretty cool and all but what are the uses for this? What set of 4 un ordered 5-bit numbers would i need to store that I couldn't just store as 3 bytes? I waste only 4 bits while preserving the order if need be of the values I'm storing. I can think of a very very small few occasions 4 bits would matter over order but nothing realistic. Again It's a cool trick and I'm not trying to be a dick about it. I like cool little tricks like this even if there's no purpose. I'm honestly curious about some realistic use cases for this or some variation of this with larger numbers.

joshuak · on Jan 28, 2018

It's 4 5-bit values, not 5 4-bit values. 5-bits is enough to enumerate each bit in a 32-bit bitfield. This means for any application that might set 4 or fewer flags out of a possible 32 need only use a 16-bit bitfield.

As I said in another comment I just tested this with a 32-bit hash array mapped trie, and it would result in a significant reduction in the storage overhead of the trie (which is already more efficient than typical hash tables for map like data structures).