Hacker News new | past | comments | ask | show | jobs | submit login
Shortest bit sequence that's never been sent over the Internet (2017) (seancassidy.me)
193 points by wglb on April 19, 2018 | hide | past | favorite | 117 comments



> We're looking for a specific sequence, though, not a specific number of heads in a row. We don't even know what the sequence is since it hasn't been sent yet. Is that a problem? Not at all! We're looking for some sequence of length n, and given that both 0 and 1 are equally likely, the sequence 00110 is equally likely as 11111.

Interestingly enough, this isn't true!

First, let's test on a small example: how likely are the substrings "11" and "10" to appear in binary strings of length 3? Here's a table with the matches marked.

        "11" "10"
    000
    001
    010        *
    011   *
    100        *
    101        *
    110   *    *
    111   *
"10" can appear in four ways, but "11" can only appear in three. Why is this?

Say you're scanning through a bit string, looking for "11111". You've seen "111" so far -- that's three matches. Now you encounter a "0". Your counter resets to zero matches until you see a "1" again.

Now say you're looking for "00110". You've seen "001" so far. Just like before, you encounter a "0". You still need to reset your counter, but this time the "0" may actually be the start of a new "00110". So your counter resets to one, not zero. This means "00110" matches are easier to find, and happen more frequently!


> This means "00110" matches are easier to find, and happen more frequently!

They do not happen more frequently in a random bit stream. You are absolutely and completely wrong about this.

They only happen more frequently if you switch the problem from counting occurances to counting bitstreams that contain these occurrences.

The reason this is is because the sequence 111 contains two substrings of 11. Thus if this happens in a bitstream and you are counting bitstreams you only get a count of 1. Where as with the counting frequency you would still get two.

This will occur any time a sequence can be overlapped with itself.


You're right, I didn't word that very clearly. I meant "more strings contain this substring", not "this substring occurs more times total" (which, as you say, would be the usual meaning of "matches happen more frequently").

Why care about strings containing a particular substring versus total number of occurrences? The article referenced this PDF: https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf, which is about how many coin flips it takes to see a run of N heads. That's really what the argument in the second half of my post is about, but it applies to the "how many strings contain this substring" problem too, and that one seemed simpler to draw a table for.


You've answered a different question:

How likely are the substrings "11" and "10" to appear in binary strings of length 3?

rather than:

Is the sequence of '111' as equally likely to appear as '010' given that 1s and 0s are equally likely to appear?


Reminds me of a question an old tutor used to ask potential thesis candidates:

If I throw two dice, what's the probability I throw at least one six?


Am I being stupid, or is the answer 11/36? I.e., slightly less than 1/3?

Out of 36 possible throws of a pair of dice, 10 of them have a single six and 1 more has both sixes.


Seems correct. The probability of at least one 6 is one minus the probability that both aren't a 6, which is 5/6 * 5/6 = 25/36; so the result is (36-25)/36 = 11/36.


Why wouldn't it just be the probability of either rolling at least one 6 which would be 1/6 + 1/6 = 1/3? Genuinely curious, I was never super good at probability calculations.

edit: clarification of problem.


Picture two coins and 2 flips and count the H's there (2x2) = 4 options and ((2x2) x 1/2 x 2(flips)) = 4 Heads.

HH, HT, TH, TT

However, if you count the sequences with at least one H there are (4-1) = 3 with at least 1 H (HH, HT, HT) as the HH eats up 2 H's.

Now, Picture 3 sided dice, that's (3x3) = 9 options. You will see ((3x3) x 1/3 x 2(flips)) = 6 2's.

00, 01, 02, 10, 11, 12, 20, 21, 22

However if you count there are (6 - 1) = 5 sequences with at least one 2 you get 02, 12, 20, 21, 22 as the 22 eats up an extra 2.

Now, what do you think happens with an N sided dice? What if N is 6.

PS: This seems less clear as I added more info...


Yeah doing a brute force count of the possibilities the (6,6) option is double counted by the 1/6 + 1/6 method as two distinct possibilities. Or that's what it looks like is happening. Been too long since my college combinatorics class.


Imagine doing that with 7 dice... 7/6 probability!


Which suggests the probability of the event not happening would be -1/6, which suggests the quasiprobability of the event not not happening. Obviously no ambiguity there. /s


Because those are just numbers that work. The only way to roll a 2 (for example) is to roll 2 1's - which is obviously 1/36, not 1/3.


Sorry the problem was probability of rolling at least one 6 not of rolling a 6. Edited my original to make that clearer without the original post.


Because you're not guaranteed to roll a 6, even if you roll 6 times: 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 6/6 = 100% chance? Nope. :)


Nope, you're being clever.


could you explain? I got the same thing...


I think it was meant to be a direct answer to "Am I being stupid [...]?"


Yup... shows the value of HN votes...


Is the tutor fair and unbiased?


I wrote a quick Python snippet to compute the probabilities for all sequences of length 3 over 20 flips:

  [(x, sum(1 for i in range(2**20) if x in '{:020b}'.format(i)) / 2.0**20) for x in map('{:03b}'.format, range(8))]
The output shows a 79% chance for "000" and a 91% chance for "010". "000" and "111" are actually the least likely substrings, with "010" and "101" in second place.

Furthermore, "001" has a 97% chance of occurring. It's pretty easy to see why "001" is so much more likely than the others. Unless your string perfectly alternates 0s and 1s, you're going to get repeating pairs. What happens after you get a "00"? If there's ever a 1 again, there were at least two 0s before it.

So you're almost guaranteed a "001" after "00", unless it's near the very end of the string. You only have a 50% chance of getting "000" from these though. A 1 completely resets the search for "000", but a 0 keeps the search alive for "001".


Again: that looks for whether that sequence is in each tested string, which is a different problem.

It tests if 'x' is 'in', rather than the number of times it is seen.


But the question being tested is whether a given string is in the bits that have been transferred over the internet. The number of times it is seen is not what we're talking about here.


Yes, but the sequence of bits that has been transmitted over the Internet is finite. That makes panic’s argument technically true.

Since we’re dealing with a sequence that’s zettabytes long, I’d guess it falls under the category of technically true but irrelevant.


Wow. That's really nonintuitive to me. Like the Monty Hall problem: you have something that's obviously an irrelevant detail that can't affect the probability… except… somehow it does.


There are problems that I find even more counterintuitive than Monty Hall, such as the sister's paradox.

    Your neighbours has two children, one of which is a girl.
    What is the probability of the other child to be a girl?
A. One may say 1/2, but actually, using basic Bayes theorem: P(2 girls | at least 1 girl) = P(2 girls AND 1 at least girl) / P(at least 1 girl) = (1/4) / (3/4) = 1/3.

I think this problem is actually equivalent to the Monty Hall problem. But it can get much weirder.

    Your neighbour has 2 children, one is a girl.
    You know the other child is born on a Sunday,
    what is the probability that this other child is a girl?
A. Similarly, counting all possible cases, one does not get 1/2 or 1/3, but rather 13/27.


  Your neighbours has two children, one of which is a girl.
  What is the probability of the other child to be a girl?
This is oft-mentioned but with this wording it's ambiguous. Here are three possible interpretations:

1. Your neighbor has two children. We observed both and found that there was exactly one girl. What is the probability that both are girls? (Answer: 0)

2. Your neighbor has two children. We observed one and found that she was a girl. What is the probability that the other (unobserved) child is a girl? (Answer: 1/2)

3. Your neighbor has two children. We observed both and found that there was at least one girl. What is the probability that both are girls? (Answer: 1/3)

Interpretation #1 is kind of silly but it serves to illustrate the ambiguity of the plain English wording. Interpretation #2 is how most people interpret the problem, and it's bolstered by the use of the word "other" in your prompt. For there to be an "other" child, there must have been a specific girl observed originally, rather than a general observation that there was at least one girl. Interpretation #3 is how you intended the problem.


yup, you're right.


You are horribly butchering Bayes' Theorem. Both your examples are incorrect because your two events ('2 girls' and 'at least 1 girl') are not oberved independently of one another, which is a necessary condition to using Bayes' Theorem with a valid result. Great tool but wrong job.


What? Bayes' theorem does not require anything about the events, even more so it is useful when the events are not independent. If events are independent, Bayes' theorem has no usefulness. https://en.wikipedia.org/wiki/Bayes%27_theorem#Statement_of_...

Edit: however now that I look into it, I should rather have used "conditional probabilities" name rather than Bayes'. I always mix both of them, my bad.


Looks like this is the Boy or Girl paradox [0], which has ambiguity problems.

[0] https://en.wikipedia.org/wiki/Boy_or_Girl_paradox


I love how Wikipedia helpfully informs us that 'Biologically, according to studies of the human sex ratio, it is more likely for any one person to be born into the male sex than the female sex.' as though it is somehow relevant or has anything to do with the actual paradox being discussed...

NOTE I've since edited the Wiki page, and deleted the sentence - since it isn't relevant and there was no citation for the claim!


> Your neighbours has two children, one of which is a girl. What is the probability of the other child to be a girl?

The way you have phrased the question, the probability is 1/2.


Actually, it's ever so slightly higher than 1/2 because there is a non-zero chance of them being identical twins. If we assume an identical twin incidence rate of 0.4%, then the actual probability would be 63/125 that the second child is also a girl.


In the first case, I guess what's happening is that when we pick two people randomly with uniform distribution, we expect, with equal probability, any combination of boys and girls: MM MF FF FM. When we observe a girl, we can exclude the case of two boys, so the remaining cases to consider are MF FF FM, where obviously only 1/3 of the cases is two girls.

So I guess what makes this counterintuitive is that we think of the example as independent variables actually affecting each other, but what happens is really that we're drawing from a collection without replacement, and obviously the mix of the collection is going to change as we do so.

This should apply also on a humankind, world-wide scale: if you have met three women in a row, the fourth person you meet is more likely to be a man. Gambler's fallacy! Or just drawing without replacement. A roulette table is drawing with replacement. :)


But MF and FM in this case are the same?


No because children are distinguishable.

If you flip two coins, you get HH, HT, TH, and TT with equal probability. Try it.


But the sequence is irrelevant.

HT is equal to TH, because in this situation we only care whether the two coins are equal to each other or different from each other.


It's the percent chance that matters, so it doesn't matter if you can distinguish them. The cases still exist with equal chance.


To elaborate, the situation as stated in the question is currently Hx, so HH and HT are the only possible results.


You can calculate 50/50 but it's not by ignoring the difference between HT and TH.


The first child is not an influencing factor on the second child. It's either a boy or a girl, 50/50.


My favourite quote from the wonderful David Spiegelhalter, Winton Professor of the Public Understanding of Statistics:

> Why do people find probability unintuitive and difficult? Because it is unintuitive and difficult.


Another counter intuitive probabilistic property that may be relevant here:

- If you make a distribution of the first digit of prices in a webshop you will notice that 1 appears more frequently than other digits.

- If you convert the prices to another currency and redo the distribution, the most frequent first digit will still be 1!

So it's not a given that you can assume the properties of a random bit stream will hold for the calculation in the article.

(As an aside: this weird property explains why you need slightly fewer coins in a currency that has quarters than those with 20 cent coins)


The logic behind this is more or less the same idea behind the Knuth-Morris-Pratt algorithm (https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93P...), which might help with your intuition.


Aren't you miscounting though? 111 contains the 11 sequence twice.


Which is probably the key intuition here - the expected number of matches is exactly the same, but because you can have more matches for the purely-repeated sequence in the same string the total number of strings containing those matches is smaller.


Sure, but the question is "how likely" not "how many times". Counting the expected number of times is a much easier problem because you can use the linearity of expectation.


But you're trying to find a bit-substring of length 2 that is least likely to have been transmitted.

If the bit strings of length 3 are chosen uniformly, 11 is less likely to appear than 10.


You are totally right. He is incorrect.


This is (almost) how Knuth–Morris–Pratt string search algorithm works: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93P...


Another possible detail this misses - I learnt about De Bruijn sequences from Samy Kamkar here:

http://samy.pl/opensesame/

(scroll about 1/3rd way down to the "(U) The OpenSesame Attack" section for the bits relevant to this discussion...)

You don't need to enumerate every n-bit sequence, you just need to enumerate a (shorter - by 62% in Samy's 12 bit case) sequence that contains all the n-bit sequences.


Interestingly, there was an old USAMTS problem that had De Bruijn sequences in it, though I wasn't aware of it at the time: http://www.usamts.org/Tests/Problems_25_1.pdf


This is the heart of the Boyer-Moore string searching algorithm too.


Does this still apply when your search space is on the order of thousands of petabytes, and the smallest unseen sequences are, according to the linked post, around 75 bits?


Wow, how surprising and, in a strange way, kind of beautiful! Thanks for this.


I'm reminded of constructing DFAs that represent regular expressions for class.


Yeah, you can see finite automata explicitly used in the PDF that's linked in the article itself: https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf


Yes, google "penney ante"


As the article notes, encryption does make a huge difference here. For global web use we've reached something like 70% of sessions

https://letsencrypt.org/stats/#percent-pageloads

However, that doesn't cover 70% of bytes. For example, software updates are often downloaded over HTTP (hopefully with digital signatures!). Debian and Debian-derived distributions distribute most of their package updates over HTTP, authenticated with PGP signatures.

Most of those packages are nonetheless compressed, which increases the variety in bit sequences, but then most of the downloads are of identical compressed files, which decreases the variety.

On the other hand, video streaming is often encrypted now, but still sometimes not encrypted. But even when it's not encrypted for confidentiality or authenticity, it's often encrypted in order to impose DRM restrictions. In any case, it's usually also heavily compressed, which again increases the variety of bit sequences transmitted even for unencrypted video.

To give a rough guess, I think the combination of encryption and compression means that the author is roughly correct with regard to information being transmitted today. Even when different people watch the same video on YouTube, YouTube is separately encrypting each stream, and nonrandom packet headers outside of the TLS sessions (and other mostly nonrandom associated traffic like DNS queries and replies) represent a very small fraction of this traffic.

It might be interesting to try to repeat the calculation if we made different assumptions so that compression was in wide use (hence the underlying messages are nearly random) but encryption wasn't (hence very large numbers of messages are byte-for-byte identical to each other). Can anybody give a rough approach to estimating the impact of that assumption?


Even if every single bit of traffic was encrypted, a huge portion of that would still be identical - TCP headers, for one thing, would share a lot of common bits for each packet. Although bandwidth reporting often ignores those, so I am not sure about the data source for world bandwidth.


Wouldn’t some combination of unique session keys, PFS algorithms, or block encryption modes make this false? When would the headers encrypt to the same ciphertext unless you were using ECB mode with the same key?


I think you misinterpreted cortesoft's point, which I could try to make clearer:

"Even if every single bit of payload traffic was encrypted, a huge portion of the traffic actually sent over the wire would still be identical - TCP headers, for one thing, would share a lot of common bits for each packet."

So I think you're in agreement here.


The original question was “sent over the Internet” not “sent over the wire”.


Anything in the IP header (perhaps excluding the TTL and checksum) and below is sent over the Internet.


If you’re looking at TCP headers, you might as well go all the way to the PHY layer. Your TCP headers will get encoded by an error correcting code (good, they’ll look random again) but then the data will be split into frames, and each frame begins with an identical preamble sequence of bits.


> good, they’ll look random again

They might look random, but they contain the same amount of entropy as the original data and are longer. Also, the encoding is deterministic.


The code itself is deterministic but usually bits are are interleaved and scrambled with a long time varying pseudorandom sequence. If you pass a short repetive sequence in, it will look random coming out.


That's interesting. I know a bit about ECC in general, but not too much about what the current state of implementations is. Do you know what the codes in use are called?


The scrambler's actually a separate block, usually immediately before or shortly after the encoder. It's common in wireless modems, which I'm more familiar with, but it seems like it's rare in wireline PHYs. Ethernet apparently uses an 8b10b encoding to maintain a similar number of 0s and 1s.

I can't speak on entropy, but turbo codes use an internal interleaver and the codewords can be quite long, so a short repetitive input sequence turns into a very random looking output sequence. As you mentioned though, two identical input sequences would still map to identical output sequences.


All of this just means that the author's estimate is an upper-bound estimate.


I like this general approach, but presumably you don't want the point where 50% of the sequences have never been transmitted (on average, assuming randomly distributed data), you want the first point where at least 1 hasn't been transmitted, so you should be solving for the point where the probability of a sequence of that length not appearing times the number of sequences of that length is equal to 1.

That said it's a logarithmic scale so the length is probably not that much shorter than the one he came up with.


the length is probably not that much shorter than the one he came up with

What's your intuition for what "not that much shorter" means here? I'm not seeing the math clearly, but it seems like solving for 1/2 vs solving for 1/(2^many) would make a big difference when many is large double digits. I'm not sure how big "big" is, but I'd guess it would be large enough to dwarf the chosen two decimal places and 2-bit difference over a 5 year projection.

Another factor that would move things in the same direction (of a shorter string being the correct answer) is that if we are presuming random bits, I think we can multiply the total number bits transmitted by the length of the bit string. This would be to account for the fact that the match doesn't have to start on any particular boundary, rather the target can be "swept through" the entire corpus bit-by-bit.


How can you ever do better than a probabilistic estimate for the shortest sequence length where one has never been transmitted?


Note: I'm going to use bit strings rather than bit sequences.

Given a total of B = 3.4067 x 10^22 bits sent over the internet, I don't think there is any reasonable way to say for sure what the length of the smallest string that has never been sent is, but we can say that there is definitely a 75 bit string that has never been sent.

Take all of the internet transmissions, and concatenate them, giving a string S of B bits.

Every string that has been transmitted is a substring of S.

There are at most B substrings of length n in S, and hence at most B distinct strings of length n that have been transmitted.

If we pick n such that 2^n > B, there must be at least one string of length n that has not been transmitted. 2^n > B whenever n >= 75.

Hence, if the value of B is correct, then there must be a 75 bit string that has never been sent.


This reasoning seems to better match the question as it was originally posed.


If you concatenate to get S you may introduce substrings that have never been themselves transmitted, though of course I agree that every sequence that has been transmitted is a substring of S.


Yeah which means 74 is an upper bound on the length of bit sequence which might have been transmitted already.


Neat! This analysis is consistent (and more accurate) than the OP; he assumed that the bitstring is random; therefore he arrives to the lower bound. If, somehow we account for the bias in the sequence, we may find a larger n.


Relatedly, there must be a (much longer) shortest sequence of bits that it is impossible to send over the internet, because beyond a certain length the payload must be split over multiple packets, requiring additional header bits in between.

In theory, the largest IPv4 packet that can be sent is 65,535 bytes. IPv6 is the same, but also allows for "jumbograms" that can be up to 4GB long, yeesh. However, the practical limit is the maximum frame size of the underlying link. (Standard) Ethernet and PPP both top out at 1500, ATM can handle 4456. Non-standard Ethernet jumbo frames can go up as high as 9216 bytes.

But assuming you consider "send over the internet" to imply that most destinations on the open internet would be able to receive it, that means a frame size of 1500 bytes, giving you 12000 bits to play with. This leads to the curious result that it's impossible to send a sequence of more than 11870 "1"s.

The shortest header you could have is a 160-bit IPv4 header, of which the last 32 bits (the destination address) can't all be 1 because 255.255.255.255 isn't routable, and actually 127.255.255.255 isn't either, so the earliest place you can start is bit 130. After 11870 "1"s, you reach the 4-bit version field of the next packet, which can't start with a 1 because it would indicate IPv8 or higher, and I don't think that exists... or at least it hasn't seen widespread adoption.


The question that this person posed is

> At what value for X is there a 50% chance there [exists] a sequence of X-bits in length that hasn't been transmitted yet?

But if I understand correctly, the question that they answered is "At what value for X is there a 50% chance that, given some specific sequence of X bits, that sequence hasn't been transmitted yet?" Wouldn't these two questions have different answers?

Edit (now that I've thought about it more): On top of that, the expected value of a random variable isn't necessarily a value that it has a 50% chance to be. So even my interpretation of this result is wrong.


2^n is half of 2^(n+1), so you'd expect that getting 50% coverage of (n+1)-bit sequences would mean that you have 100% coverage of n-bit sequences. In other words, the answer to your question is at most 1 bit lower than the answer given.

But it is slightly more complicated, because this is an example of the Coupon Collector's Problem [1]. It takes about N ln N random samples to cover a set of size N. If there have been 2^74 samples chosen, then we haven't quite gotten enough to cover N = 2^68.

[1] https://en.wikipedia.org/wiki/Coupon_collector%27s_problem


>Wouldn't these two questions have different answers?

Yes, though in practice they're close.

To answer the actual question asked. We can say that there is a length n, such that all sequences S, of length n, cannot be contained in our total bits transferred, B. This is a consequence of the pigeonhole principle.

Naively, this would be B = n * S, and S = 2^n. So B = n * 2^n. Note that we can actually compress things more though, if we assume that in the worst case every window in the internet bit-corpus is unique.

That this is possible isn't immediately obvious, but consider

01, 00110, 0001011100, 0000100110101111000, 000001000110010100111010110111110000 (which, interestingly, isn't in OEIS, but related sequences: https://oeis.org/A052944 and https://oeis.org/A242323 are).

These bitstrings contain, perfectly overlapping, every bitstring of length n, for n from 1-5. In general, it takes 2^n + n - 1 bits to convey every possible bitstring, if you manage to overlap them perfectly (if someone can prove this, please do. I thought it was grey code/hamiltonian path over hypercube related, but I unconvinced myself of that).

EDIT: Someone else mentioned De Brujin sequences, which these are. And they are based on hamiltonian paths, although not over hypercubes :(. And my sequence is on OEIS, as https://oeis.org/A166315, just converted to decimal. /EDIT

So the real answer is just B = 2^n + n - 1, but for n > 15 or so, n - 1 is so small we can ignore it. In other words, our length n is just log(B). Given the assumption in the article of 3.4067e22 bits, the base 2 log of that is...74.85. This is exactly 1 more than what the article says is the point where you'll have seen half the messages. This isn't a coincidence.

Which raises an interesting question that I leave as an exercise to the reader:

Why is the length for which you cannot have conveyed all bit strings exactly 1 bit more than the point where you've conveyed about half of them?


You’re giving the range for a 50% to 100% chance of non-transmission.

But in the spirit of the original question, I’d suppose we want to find n such that the probability of non-transmission is 2^-n, and the expected number of such transmissions is 1? Is that number similarly close?


Back of the envelope to double check:

Some estimates put the internet traffic at 2^70 bits per year and growing. For any short sequence of N bits, there are approximately that 2^70 such sequences transmitted per year. Let's assume that, since they are encrypted, they are uniformly distributed. So what is the sequence length for which our p(seen all) = 50%? Seems like 71-bits is about right.

There traffic growth curve is probably ~2x per year, so each year we add 1 more bit each year. The sum of all internet traffic for all time, at present, is therefore about 1 more bit.

So ~73 bits.

Math checks out.


The way I think of this is:

- nearly everything is compressed

- many things are encrypted

Therefore,

For any given "random" ( "high entropy" ) string of length X, there's some non-negligible chance it's already been sent.

But it's far less likely that a ( partially degraded ) non random string is sent. Why ?

Consider this:

"the cat sat on the hat" ( probably sent )

"the cut sat on the hat" ( still probably sent )

"thx cut set19n the mkt" ( waaaay less likely to be sent )

"thKxc8t suts n x4e m-t" ( probably never sent ... until now :) )

My reasoning is like, all random strings are ( happy / random ) in the same way. They all look alike. High entropy, but low organization / structure. Because of compression and encryption, any random string probably has as good a chance to be sent as any other, so looking at random strings doesn't really get us anywhere ( but I do make a very very rough calculation at the end that says probably all 7 byte strings have been sent ).

It's going to be far easier in my opinion to find an "almost-language" string ( partially degraded, like the above examples ) that's never been sent.

Remember, Google whacks? 1 search result. One tactic was putting together uncommon words. Another was misspellings.

Basically the intuition / intuitive idea I'm trying to convey is : pick any random high entropy string of given short length, and pick any language string of given short length, and they are both, in my opinion, more likely to have been sent than an "almost language" string of same length. The more degraded you make it ( up to a point, heh ) the less likely it was ever sent.

Very rough calculation about random strings

So, assuming the question is for what X is p > 0.5, and assume that 1 zettabyte has been sent through the net through it's entire history, so 10^18*8 bits, or roughly 2^(63.8), so roughly every 58 bit string has been sent.

So roughly every 7 byte string ever possible has been sent on the internet. Probably.


It'd be interesting to know the length of a sequence that will never be sent over the Internet. In B. Schneier's "Applied Cryptography" there is a discussion of 256-bit numbers, the minimal amount of energy and mass required to guess them and conclusion that "brute-force attacks against 256-bit keys will be infeasible until computers are built from something other than matter and occupy something other than space."

The length of a never-transmitted-on-the-internet sequence is probably much shorter than 256 bits, even 128 bits (as the article mentions).

More details here:

https://security.stackexchange.com/a/25392


>>> len(''.join(map(lambda x: bin(ord(x))[2:], "I'm wrong")))

61


Hee hee!

But your joke is probably more precise if you write it as

8 * len("I'm wrong")

since the leading 0 bits still get sent on printable ASCII characters.


Come on, you need to encode it if you're counting bytes not characters, even if that's equivalent for this particular string. :)

8*len("I'm wrong".encode('utf-8'))


It's a bit ironic in a special paradoxical math like way (like [0]) that if someone discovers it and posts it online it'll instantly (ca. HTTPS vs. HTTP, other encryption, whether or not someone sees the page, etc.) be wrong by definition. It's like figuring what do you call an UFO that has been identified.

[0] - https://en.wikipedia.org/wiki/Russell%27s_paradox


I am suddenly nostalgic for the Googlewhack fad: https://en.wikipedia.org/wiki/Googlewhack

"A Googlewhack is a contest for finding a Google search query consisting of exactly two words without quotation marks that returns exactly one hit. A Googlewhack must consist of two actual words found in a dictionary.

Published googlewhacks are short-lived, since when published to a web site, the new number of hits will become at least two, one to the original hit found, and one to the publishing site."


Does anyone remember a comedian that did a comedy special many, many years ago centered around Googlewhacks? I remember he got a tattoo of a Texas ID on his arm.


Interesting to think about. Would probably make a good interview question just to see how people try to solve it.


If the method stands up and considered fine, that means the shortest string is increasing at a rate of ~1 bit every 3 years.

That implies that if the Internet does not grow any more, we should be confident of a > 50% chance of a UUID not being universally unique in about 150 years.

Is this the next Y2K/2038 problem? :)


If the traffic doesn't go up significanly, the shortest string increase will drop exponentially. Even with only the last 5 years, you can observe it's slowing down, and that's with an hefty amount of traffic increase to back it up. 128 bits is super safe for a long long time !


Only if you expect the entirety of traffic on the internet to be delivering UUIDs. It doesn't matter when non-UUID data matches a UUID.


  This is how my intuition went: it's probably less than 128 
  bits because UUIDs are 128 bits, and they're universally 
  unique.
But what's in a name? There's no natural law constraining the UUID standard, such that they must be actually universally unique. And 128 bits isn't such an incredible bit space.

MD5 hashes are 128 bits, and prone to manipulating in favor of collisions.

Don't get me wrong, 3.402823669209385*10^38 is a huge number, and we haven't used enough passwords to occupy every value in that key space, but I still don't imagine 128 bits provides truly universally unique coverage, but really just pretty okay uniqueness coverage.


MD5 is prone to manipulation due to its logic, not the size of the output. It also has many very efficient collision attacks but no feasible preimage attack (AFAIK) and they shave just few bits off 2^128 (which is still a lot): https://crypto.stackexchange.com/a/41865


Md5 collisions are not the result of 2^128 being too small. 2^128 is hilariously big.


At one time, UUIDs were made by combining a fine timestamp with MAC-style hardware ids, giving a value that was conceptually unique in time and space. IIRC, it was eventually realized that choosing a value at random (e.g. using true RNG) was just as good.

2^128 is about 300 undecillion. Roughly 800 billion values per nanosecond for the age of the universe.


Version 4 UUIDs are preferred not because random is as good, indeed for some usecases it is worse, but typically because leaking information about time of creation and MAC address of the creator is undesirable


Why not just hash it then?


Because the probability of random collision for hash function is same as probability of generating two colliding random bit strings of same length as the hash.


I think you should take a gander at this: http://nowiknow.com/shuffled/



One thing that this shows is that many accepted cryptographicaly secure RNGs are in fact insufficient for shuffling decks of cards. I'm not sure how one would go about exploiting the resulting bias, but it is probably somehow exploitable.


"This matched my intuition nicely!"

His intuition was that it would be between 48 and 128, and he's patting himself on the back that his calculation resulted in a number between those? Those goalposts are super far apart!


That's pretty close for what was basically an off-the-cuff estimation. He chose those two numbers because their use in the real world gives him a reasonable point of reference.

The calculated value of 74 bits is pretty close to the mid-point of his estimation.


How about this question: If all matter in the universe was dedicated to producing new bit sequences until the heat death of the universe, what is the length of the shortest sequence that would never be generated?



Yeah, I sort of assumed that the Berry paradox was going to be the point too - Anyone who found the shortest bit sequence could never display it on a web page since that would cause its transmission over the Internet, making it ineligible...


Displaying a series of bits on a web page doesn't cause that series of bits to be transmitted.


Yeah, I realise that, but what if there was a link to a binary with those bits, etc.


I'm a little surprised it is as high as it is. I am used to thinking "64 bits is enough for anyone."


Enough for anyone, but not necessarily enough for everyone.


hmm... I was expecting the answer to be zero.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: