Four 5-bit values in order have 20 bits of entropy, so cannot be stored in 16 bits.
Four 5-bit values without order have 20 - log2(4!) =~ 20 - 4.59 = 15.41 bits of entropy (corresponding to log(2^20/4!) possible configurations), and thus can fit in 16 bits of data if you're clever about it.
What the parent message did is: log2(32+4-1 C 4), because the number of k-sized combinations of n objects with repetitions is (n+k-1)Ck. One way to prove it is that you sort the objects (so you can establish an order among the objects in the combination) and add k-1 dummy objects for "the second is the same as the first, the third is the same as the second,..., the last is the same as the penultimate".
I followed the original link as well as this calculation. Why not use this to map them directly (instead of breaking it into 12 bits and 4 bits) ? We'll just have some unused headroom in 16 bits. This would seem more natural than thinking of the 12 + 4 solution.
The factorial is taking too much care of the fact that order does not matter. If I wanted to store four 1-bit numbers, ajkjk's math would suggest that I need log2(2^4/4!) = -0.58, i.e. a negative number of bits. You can't just divide by 4! because not all combinations of 4 numbers have 4! unique permutations.
> If you want to optimize storage of duplicates, you still have to store the number of duplicate numbers, and you are back where you started.
No, because you can shave off a few bits by ignoring the ordering of the non-duplicates.
The intro chapters cover the topic of measuring informational entropy.
Aside: One interesting application is to the problem of “use a balance scale only three times to determine which of twelve balls has the wrong weight (too heavy or light)”. You choose the weighings so as as to maximize the entropy of the outcome i.e. so there are many possible outcomes of equal probability.
As others have mentioned, it comes up in information theory, or maybe in some subfields of combinatorics and probability. I learned it tangentially in various parts of physics.
But I might just recommend going to the source! which is Claude Shannon's seminal masters thesis on coding: http://affect-reason-utility.com/1301/4/shannon1948.pdf . It's surprisingly readable and worth at least skimming if you like this stuff.
Usually, when you store a value, you want to be able to get exactly that value back. If you store multiple values without keeping the ordering, you lose that capability. I.e. you can't replace all possible uses of four 5-bit variables with one 16-bit variable, only those where the variables are interchangeable.
Many common data structures fail to preserve order. Maps (aka dictionaries) based on hash tables come to mind.
Slightly adrift of the topic...
I have a HAMT (hash array map trie) implementation that uses a 32-bit unsigned int as a bitfield to indicate which of the 32 possible children nodes are populated. With this trick I could encode any node with 5 or fewer bits flagged with a 16-bit unsigned int instead.
I just checked and 1 million keys created roughly 1.3 million nodes (1 million keys, 300k internal nodes). Nearly 90% of the 300k internal nodes have 5 or fewer bits set. This trick would make the HAMT overhead nearly half the size.
Very useful!
Now, how about 6-bit values in one 32-bit value? I'd like to try this with a 64-bit HAMT. Can this be easily generalized?
Edit: Oops, that should be nodes with 4 or fewer bits set, there are only 4, 5-bit values. Interestingly that apparently doesn't make much of a difference for HAMTs only about 2k internal nodes out of 300k have 5 children. So the results remain the same.
So you can store seven 6-bit integers without order in one int32. I don't know if it'll be slow, though. It should be possible to infer an encoding algorithm from the proof of the counting formula but it might be slow.
My understanding is you can't tell order using this trick. I guess the bitmap is just an optimization to skip the hash lookup? Order doesn't matter for you?
Something I came up with is packing the pointers/values into a contiguous array and using the bitfield to tell not only if it's present, but what offset is at (mask + popcnt) * size of(value). This eliminates wasted space due to unused buckets in the hash table. I'm sure it's been done before, but I haven't seen it anywhere.
How's your Go? Want a remote contact doing Go microservices that pays very well? Email is in my profile.
That's pretty much exactly how hash array mapped tries work.
Take the first 5 bits of a 32 bit hash, use it to set a flag in the bitfield, a populated field indicates a populated child array element, find out which one using popcnt. When you find a conflict on insert take a step down the tree, use the next 5 bits in the 32 bit hash rinse and repeat.
A HAMT is the basis of a fully concurrent data structure for maps which can also do fast atomic snapshots called a ctrie.
For completeness: The number of possible sets that include 4 numbers from the range [0..15] is (16+4-1) choose 4, or 3876.
Let's assume that our set is (a, b, c, d). It is ordered, therefore a <= b <= c <= d. For example, a valid set is (2, 4, 4, 10).
Consider all possible numbers that may be in our set, laid out in order.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
For each element in the set, we mark its position in this order by placing a marker to the left of the corresponding number.
For our example, we first mark element 2:
0 1 m 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Then 4:
0 1 m 2 3 m 4 5 6 7 8 9 10 11 12 13 14 15
Then, 4 again:
0 1 m 2 3 m m 4 5 6 7 8 9 10 11 12 13 14 15
And finally, 10:
0 1 m 2 3 m m 4 5 6 7 8 9 m 10 11 12 13 14 15
Now, out of these 16+4 places, only 19 of them could be a valid placement for a marker, as a marker should always be placed to the left of a number (or said differently, "15" will always be in the end.)
Therefore, there are 19 choose 4 ways to pick the locations of the markers.
Heh, that big ass makefile is the generic makefile [1] I came up with a few years ago. Definitely a bit of overkill for a single file project, but some of us are lazy when it comes to such things.
I can see why you gave up on handling spaces :-) I made one over a year ago that handles spaces correctly so that I could use it on Windows (MSYS2), and it was quite the 'fun' project... =P
GNU Make has a number of implicit rules. A blank target rule (or no target rule) in the Makefile (or no Makefile at all!) will cause `make` to try to create the specified target `foo` from `foo.c` or `foo.cpp` (or others) if they exist. More complete explanation here:
Note: I was initially annoyed with parent's "do this and see what happens" post and its lack of substantive communication, and responded poorly. This comment is substantially edited.
Your response of linking the GNU Make manual comes of as even more self-impressed and coy than the parent.
The parent suggests removing the makefile entirely and running `make main`, and observing the result. That would be a learning experience (and much better than your "RTFM")! But, if you don't want to humor that, the thing to say is: The result is that it works, and correctly builds the program, without a makefile at all.
The explanation is that GNU Make has a catalog of built-in rules (which you linked to); but it's quite a leap from knowing that fact, and even making use of them, to realizing that it means in some cases you don't need a makefile at all.
> The parent suggests removing the makefile entirely and running `make main`, and observing the result. That would be a learning experience! But, if you don't want to humor that, the thing to say is: The result is that it works, and correctly builds the program, without a makefile at all.
Exactly my intention. Thank you, I was worried after the first response from the other person that I'd worded myself too poorly and that more people would think I was being "self-impressed and coy" when that was not the intention at all.
That's an odd thing to read into my comment. I am merely passing on information that was at one point given to me when I myself had a Makefile that was actually not needed because the program it was for was a single source file project like OP.
Also in what way is it easier to read a long document without reference to a specific section than it is to just delete the makefile and run make and the stem of the name of the source file?
Edit: My second paragraph of this comment was written prior to parent commenter editing their comment.
There is no snark. I think the GGP's post adds nothing of value, and is in fact rude. It's not snark to point that out, anymore than your critique of my post is snark.
You may feel differently. That's fine. But differing opinions being downvoted to oblivion is ... problematic, at best.
EDIT: After reading your response below, you were right and I was wrong. Editing.
Is there a way to make it work with just make? I'm way too used to just typing make at this point and it's not too much work to just copy the same makefile everywhere.
Use a phony target that does whatever you want. Mine typically invoke cc to build the program ("cc -O3 -Wall -o main main.c", something like that?), then run the program ("./main"), which prints its stuff to stdout and stderr. Makes it very easy to build+run from within Emacs, where the default compile command is "make".
It is possible to do this in O(1) memory, for arbitrary sized collections, efficiently.
The first trick is to make a function that can calculate the nth set with k elements from some universe U (in this case U = {0..2^5-1} without order (no duplicates) directly. This is done using the https://en.wikipedia.org/wiki/Combinatorial_number_system. This is very efficient.
Then, to encode duplicates you use the stars and bars trick.
Can you elaborate on the second part? Are you saying you're going to expand the size of the universe by a factor of 2^(k-1) to use as locations of the "bars"?
I really like this mechanism. It seems like it should be the optimal way to store a set of size k.
Out of curiosity, do you know of an efficient way to calculate this without sorting first? Or even better, without knowing k ahead of time? If you can sort your input first it's trivial, but even though it seems like a super efficient way to store a set of numbers I don't know how to use it to solve, for example, the other problem in these comments of sorting 1M 32-bit integers in 2MB of RAM (you can store them with this mechanism in about 1.7MB of RAM, but only if they were already sorted).
This seems obviously wrong, though I admittedly don't know the stars and bars trick. Lets say you wanted to store a number a in [0,k), for arbitrarily large k. All you would need to so is encode your value as 0 repeated a times, and 1 repeated k-a times. Because you can store arbitrary integers below k with this method, you would need at least log(k) space. If k can be arbitrarily large, you would require unbounded space.
Can you elaborate? I am not very familiar with arithmetic coding, but I thought it required arbitrary precision floating point numbers, so not O(1) memory.
You don't need arbitrary precision, just enough precision. Specifically, in this case you only need 16 bits. You can start with an upper and lower bound, and narrow it down for each number.
Here's an example with numbers dumped from a quick prototype encoder. The exact range numbers aren't that important, but note how the range gets smaller for every symbol we encode:
Say we want to encode the ordered numbers 10, 15, 31, 31.
We start with the full range [0, 65536)
There are C(32 - 10 + 2, 3) = 2024 combinations that start with 10, out of 52360 possible, so that narrows it down to around 4% of the full range, we'll use [49702, 52236)
For the next number we only allow numbers 10 and above, 15 has 153 out of 2024 possibilities, range [51022, 51214)
For the next number, 31 is only 1 of the 153 possibilities, so it gets a tiny range, [51212, 51214).
And now 31 is the next number in 1 out of 1 of cases, so the range is again [51212, 51214)
We can code the sequence as either 51212 or 51213, since we used a range that is slightly larger than needed some combinations have multiple codes. We could have started with the smaller range [0, 52360) instead to get a bijection.
I don't believe that it does, at least not without some additional trick. How do you handle the case where you repeatedly encode the largest case? E.g. {31, 31, 31, 31}.
The fact that it's using lookup tables to store the mapping between the set of numbers and the encoded 16-bit number makes it less interesting; it basically just enumerates the set of all possible combinations of numbers, and uses the ordinal as the encoding. A directly calculated scheme without a lookup would be niftier, though I suspect in practice the way it steals bits from redundancy of duplicates it wouldn't be pretty.
Ten people are standing in a line. Each person is wearing a black or a white hat and can only see the colours of the hats of the people in front of them. The goal is to guess the colour of one’s own hat. First, the last person in the line shouts out their guess (‘black’ or ‘white’), then the person in front of them shouts out their guess, etc. No other form of communication is allowed (no clapping, no touching, no texting – and no tricks involving, e.g., encoding information in the delay between it being ‘your turn’ and actually shouting out your guess). The group ‘wins’ if at most one person guesses wrong.
The group is allowed to discuss a strategy before taking part in this puzzle (i.e., before being given their hats). Which strategy should they choose to maximise the chance of winning?
They’re only allowed one wrong guess, so this is basically trying to encode 9 bits of information in 1 bit. And here the ordering of the bits actually matters. Still, it’s possible to choose a strategy which gives the group 100% chance of winning! Good luck trying to solve this one. :)
The way huftis stated it, it is indeed impossible.
In the popular, solvable version of the puzzle, the group also gets the information if each guess was wrong or right. It's not at all about encoding "9 bits of information in 1 bit".
No you don’t need information on whether the each guess is right. (And, of course, if the group is playing it correctly, every guess except possible the first one is correct.)
You’re of course right about this not being encoding ‘9 bits of information in 1 bit’. That would be impossible; to encode 9 (arbitrary) bits of information, you need 9 bits.
On the other hand, you must encode information on the colour of 9 hats, and only the last person in the line is free to provide some information. He can’t possibly know the colour of his own hat, so he can only guess – or use his turn to provide 1 bit of information to the rest of the group (while the other people have to shout out their correct hat colour). The difficulty lies in figuring out why the puzzle as stated does not need you to encode 9 bits using 1 bit …
Well it’s easy to do with two people because the first person can just guess the color of the second, since we are allowed one wrong guess. I don’t see how this can be extended.
I assure you that it's possible. To give you a hint without giving everything away: the guess of the first person must be derived from the colour of all other person's hats.
You could think of it as a puzzle in coding theory, and it's not implausible. If you sum up the number of hats that each person sees and the number of guesses they hear, you'll note that everybody has 9 bits of information available. And they're supposed to make 9 correct guesses. It adds up.
Another hint would be to think of the case of 3 persons first. That can be done with patient case analysis, and it's likely to get you an idea for generalisation.
Think about it as if each guess preceding a person is a bit of information, the goal is to come up with a strategy for using that information (you control encoding as well as decoding) to correctly infer another bit.
You can extend to three people if the first "guess" indicates if nr 2 and 3 have the same color or not. But doesn't go further because after the first all guesses must be 100% correct and can't be used for communicating additional information.
It does extend. It's just an XOR/parity trick. The last person shouts out the XOR of all the previous colors (if you prefer: whether the number of visible white hats is even or odd). From there, every other subsequent person knows their color by XORing the hats they see together with the guesses made so far.
For example, the second to last person knows their hat is black (0) if the last person's guess matches the XOR of the hats they see - if it doesn't, their hat flipped it, and thus is white (1). And so on and so forth. Your hat color is everything you see and everything that was said XORed together.
That’s correct. Or, to explain in (slightly) more non-mathematical terms: The last person (‘person 10’) shouts out (e.g.) ‘white’ if the number of white hats in front of him is an odd number (1, 3, 5, 7, 9) and ‘black’ otherwise. Let’s say he shouted ‘white’. The person in front of him looks at the people in front him. If it’s still an odd number of white hats, his must obviously be black; otherwise it must be white.
If his hat is black, his shouts out ‘black’, and the person in front of him knows that the rest of the line (8 people) must still have an odd number of white hats, and he applies the exact same logic. But if the hat of person 9 was white, he would shout out ‘white’, and person 8 would know that the rest of the line (including himself) should now have an even number of white hats.
So, basically the rule is: Define ‘the rest of the line’ to mean the people who have not yet shouted out a colour. When the first person (person 10) shouts out ‘white’, this means that ‘the rest of the line’ has an odd number of white hats (parity: odd). Whenever someone shouts out ‘white’, the parity (even/odd) of ‘the rest of the line’ is flipped.
Using this, each person only has to count the number of white hats of the people in front of them and observe if it is even or odd. If it matches the parity of ‘the rest of the line’, the person’s hat is black, otherwise it’s white. (This also works for the person in front of the line (person 1). He sees no (or 0) white hats, i.e. he sees an even number of white hats.)
A list of deltas should do it. In the worst case, the deltas would use log2[2^32/1000000] * 1000000 bits, so about 1.5 MB. Plus some space because of base 128 encoding (it increases size up to 37/32, rounded up per byte).
I got a worst case of exactly 2 MB (1.907 MiB) (all deltas being 4294, so the list is 0, 4294, 8588...), but maybe it's possible to get better than that.
The idea is good, but base128 won't work. The worst case scenario is around 250k offsets of 2^14 (requiring 3 bytes each) and 750k offsets of 2^7 (requiring 2 bytes each). That's 2.25 MB
The trick is that you only need to store the unordered list of seen integers as your state while you sort. This takes about 1.5MB or so. The reduction in space comes from the fact that you don't have to store the order (for example, you could store the numbers sorted and only record offsets. That saves space because the offsets will be smaller)
Yes, you're missing something. Radix sort, like most sorting algorithms, requires storing the intermediate results in memory. This problem seems impossible at first blush because storing 1 million 4-byte integers would seem to require 4M of RAM, and only 2M is available.
But if I pick my buckets right, I can store only the lsb in each bucket.
Or more efficient, dynamically construct an implicit trie/heap type thing, where each leaf is the count of times that int appears (and the int itself is encoded in the location)
You're sort of on the right track, but of course you have to encode those counts very carefully because 1M counts will exceed 2MB of memory unless you work hard.
Right, you can't store them as ints. 1 million bits is the max you need, but like this whole thing is just super tricky because you can't just address things normally anywhere, since that's too big.
The order for five four bit values has about 6.91 bits of information. log_2(5!)=6.91
More relevant to the problem, the order for four five bit values has about 4.59 bits of information (log_2(4!)=4.59). Not all of the sets actually have four numbers- many have duplicate numbers, like the list [1 1 2 3] is a set of numbers {1 2 3}- and that gets us down to 4.08 bits of information which is small enough to make this example work.
At least that's how I understand this. I could be wrong.
"The data from the 6 magnetometers (i.e. 6 bits) was somehow digitally compressed into 4-bit words and then stored in a magnetic-core buffer that could hold up to 8 such 4-bit data words."
"It is unknown why and how the data was compressed from 6 to 4 bits, and the NSA report is very vague on this point. It is possible that the Soviets used 4-bit logic and had to spread the 6-bit data over more than one 4-bit data word, but it is more likely that they used frequency analysis"
"According to the NSA report, the Russians compressed the 6-bit data into a 4-bit frequency select word. Although the report doesn't explain what they mean by this, we can make a few educated guesses. The reason for compressing it into 4-bits, was probably the fact that the Russians only had access to 4-bit digital technology at the time. The problem with 4 bits however, is that each data word has just 16 possible combinations"
While developing a CAD system we saved a lot of RAM compared to our competitor's product. They were storing each (X1,Y1)(X2,Y2) as four integers. We figured out how to the same thing in just 16-bits.
Yes, you guessed it (wait for it....) we could fit a edge in word-wise!
Why is this first line true: "This works because there are 3876 possible unique values for a set of 4 4 bit values"
Each 4 bit number has 16 possible values. And you can order them in the set in 4! = 24 ways. So I thought you would get 24 * 16 = 384. Can someone explain?
Can you explain the top number (2^4 + 4 - 1)?
I understand the bottom number 4 is probably because you're choosing 4 values out of the set of numbers from the top, but not sure how you got 2^4 + 4 - 1.
I haven't learned about compression of data, but from what I know, one has to make some assumptions to store n bit data into m bit space, where m < n. Here, the assumption seems to be that numbers are necessarily 5 bit, and that there are only 4 numbers.
What I fail to understand is how this helps to compress the data in this case. The code isn't really that nice, and the comments seem to assume prior knowledge is this field, which I don't have.
It’s not perfect compression; it’s lossy. Per the top comment, this method treats it as a set i.e. ignore the order. So you’re throwing out some information.
This is a great little lesson in information entropy. I work on similar problems quite often in my work, and understanding this was a pleasant epiphany.
It's not the number of sets with at most 4 elements (which would be 1820), it's the number of bags with exactly 4 elements (i.e. the multiplicity of duplicate elements matters).
1, 2, 3, 4 has to be stored differently than 4, 3, 2, 1, as well as 1, 3, 2, 4, and so on.
If you don't care about the order all of those can be stored in exactly the same way. Think of it like lossy compression, if you don't care about some of the detail you an ignore it (the order of the numbers in this case) and save some space.
If you’re playing Yahtzee or cards it doesn’t matter what order you get the dice/cards. It just matters if you got them at all.
If you have two values, you can arrange them two ways. Three values you can arrange six ways. Four values you can arrange 20 ways. 0-20 takes a little over four bits to represent.
If you have a set, you don’t care about order. So if you save the data without an order to it, it takes a little less space to hold it. About four bits worth.
With five values you can save almost seven bits (120) and with six it’s over 9 bits (720).
Saving space in data structures that use buckets such as hash tables. I actually discovered this in a paper on hash tables that kind of just mentioned that it was a thing but I had a harder time finding out exactly how it was done.
Perhaps you could mention the motivation for the trick in the readme? My initial reaction was "either this is an april-fools style joke, or it's a cute but pointless trick", and it wasn't til I got to the bottom of the HN comments that I found an explanation of why it would be worth knowing...
It's pretty cool and all but what are the uses for this? What set of 4 un ordered 5-bit numbers would i need to store that I couldn't just store as 3 bytes? I waste only 4 bits while preserving the order if need be of the values I'm storing. I can think of a very very small few occasions 4 bits would matter over order but nothing realistic. Again It's a cool trick and I'm not trying to be a dick about it. I like cool little tricks like this even if there's no purpose. I'm honestly curious about some realistic use cases for this or some variation of this with larger numbers.
It's 4 5-bit values, not 5 4-bit values. 5-bits is enough to enumerate each bit in a 32-bit bitfield. This means for any application that might set 4 or fewer flags out of a possible 32 need only use a 16-bit bitfield.
As I said in another comment I just tested this with a 32-bit hash array mapped trie, and it would result in a significant reduction in the storage overhead of the trie (which is already more efficient than typical hash tables for map like data structures).