Hacker News new | past | comments | ask | show | jobs | submit login

> We're looking for a specific sequence, though, not a specific number of heads in a row. We don't even know what the sequence is since it hasn't been sent yet. Is that a problem? Not at all! We're looking for some sequence of length n, and given that both 0 and 1 are equally likely, the sequence 00110 is equally likely as 11111.

Interestingly enough, this isn't true!

First, let's test on a small example: how likely are the substrings "11" and "10" to appear in binary strings of length 3? Here's a table with the matches marked.

        "11" "10"
    000
    001
    010        *
    011   *
    100        *
    101        *
    110   *    *
    111   *
"10" can appear in four ways, but "11" can only appear in three. Why is this?

Say you're scanning through a bit string, looking for "11111". You've seen "111" so far -- that's three matches. Now you encounter a "0". Your counter resets to zero matches until you see a "1" again.

Now say you're looking for "00110". You've seen "001" so far. Just like before, you encounter a "0". You still need to reset your counter, but this time the "0" may actually be the start of a new "00110". So your counter resets to one, not zero. This means "00110" matches are easier to find, and happen more frequently!




> This means "00110" matches are easier to find, and happen more frequently!

They do not happen more frequently in a random bit stream. You are absolutely and completely wrong about this.

They only happen more frequently if you switch the problem from counting occurances to counting bitstreams that contain these occurrences.

The reason this is is because the sequence 111 contains two substrings of 11. Thus if this happens in a bitstream and you are counting bitstreams you only get a count of 1. Where as with the counting frequency you would still get two.

This will occur any time a sequence can be overlapped with itself.


You're right, I didn't word that very clearly. I meant "more strings contain this substring", not "this substring occurs more times total" (which, as you say, would be the usual meaning of "matches happen more frequently").

Why care about strings containing a particular substring versus total number of occurrences? The article referenced this PDF: https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf, which is about how many coin flips it takes to see a run of N heads. That's really what the argument in the second half of my post is about, but it applies to the "how many strings contain this substring" problem too, and that one seemed simpler to draw a table for.


You've answered a different question:

How likely are the substrings "11" and "10" to appear in binary strings of length 3?

rather than:

Is the sequence of '111' as equally likely to appear as '010' given that 1s and 0s are equally likely to appear?


Reminds me of a question an old tutor used to ask potential thesis candidates:

If I throw two dice, what's the probability I throw at least one six?


Am I being stupid, or is the answer 11/36? I.e., slightly less than 1/3?

Out of 36 possible throws of a pair of dice, 10 of them have a single six and 1 more has both sixes.


Seems correct. The probability of at least one 6 is one minus the probability that both aren't a 6, which is 5/6 * 5/6 = 25/36; so the result is (36-25)/36 = 11/36.


Why wouldn't it just be the probability of either rolling at least one 6 which would be 1/6 + 1/6 = 1/3? Genuinely curious, I was never super good at probability calculations.

edit: clarification of problem.


Picture two coins and 2 flips and count the H's there (2x2) = 4 options and ((2x2) x 1/2 x 2(flips)) = 4 Heads.

HH, HT, TH, TT

However, if you count the sequences with at least one H there are (4-1) = 3 with at least 1 H (HH, HT, HT) as the HH eats up 2 H's.

Now, Picture 3 sided dice, that's (3x3) = 9 options. You will see ((3x3) x 1/3 x 2(flips)) = 6 2's.

00, 01, 02, 10, 11, 12, 20, 21, 22

However if you count there are (6 - 1) = 5 sequences with at least one 2 you get 02, 12, 20, 21, 22 as the 22 eats up an extra 2.

Now, what do you think happens with an N sided dice? What if N is 6.

PS: This seems less clear as I added more info...


Yeah doing a brute force count of the possibilities the (6,6) option is double counted by the 1/6 + 1/6 method as two distinct possibilities. Or that's what it looks like is happening. Been too long since my college combinatorics class.


Imagine doing that with 7 dice... 7/6 probability!


Which suggests the probability of the event not happening would be -1/6, which suggests the quasiprobability of the event not not happening. Obviously no ambiguity there. /s


Because those are just numbers that work. The only way to roll a 2 (for example) is to roll 2 1's - which is obviously 1/36, not 1/3.


Sorry the problem was probability of rolling at least one 6 not of rolling a 6. Edited my original to make that clearer without the original post.


Because you're not guaranteed to roll a 6, even if you roll 6 times: 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 6/6 = 100% chance? Nope. :)


Nope, you're being clever.


could you explain? I got the same thing...


I think it was meant to be a direct answer to "Am I being stupid [...]?"


Yup... shows the value of HN votes...


Is the tutor fair and unbiased?


I wrote a quick Python snippet to compute the probabilities for all sequences of length 3 over 20 flips:

  [(x, sum(1 for i in range(2**20) if x in '{:020b}'.format(i)) / 2.0**20) for x in map('{:03b}'.format, range(8))]
The output shows a 79% chance for "000" and a 91% chance for "010". "000" and "111" are actually the least likely substrings, with "010" and "101" in second place.

Furthermore, "001" has a 97% chance of occurring. It's pretty easy to see why "001" is so much more likely than the others. Unless your string perfectly alternates 0s and 1s, you're going to get repeating pairs. What happens after you get a "00"? If there's ever a 1 again, there were at least two 0s before it.

So you're almost guaranteed a "001" after "00", unless it's near the very end of the string. You only have a 50% chance of getting "000" from these though. A 1 completely resets the search for "000", but a 0 keeps the search alive for "001".


Again: that looks for whether that sequence is in each tested string, which is a different problem.

It tests if 'x' is 'in', rather than the number of times it is seen.


But the question being tested is whether a given string is in the bits that have been transferred over the internet. The number of times it is seen is not what we're talking about here.


Yes, but the sequence of bits that has been transmitted over the Internet is finite. That makes panic’s argument technically true.

Since we’re dealing with a sequence that’s zettabytes long, I’d guess it falls under the category of technically true but irrelevant.


Wow. That's really nonintuitive to me. Like the Monty Hall problem: you have something that's obviously an irrelevant detail that can't affect the probability… except… somehow it does.


There are problems that I find even more counterintuitive than Monty Hall, such as the sister's paradox.

    Your neighbours has two children, one of which is a girl.
    What is the probability of the other child to be a girl?
A. One may say 1/2, but actually, using basic Bayes theorem: P(2 girls | at least 1 girl) = P(2 girls AND 1 at least girl) / P(at least 1 girl) = (1/4) / (3/4) = 1/3.

I think this problem is actually equivalent to the Monty Hall problem. But it can get much weirder.

    Your neighbour has 2 children, one is a girl.
    You know the other child is born on a Sunday,
    what is the probability that this other child is a girl?
A. Similarly, counting all possible cases, one does not get 1/2 or 1/3, but rather 13/27.


  Your neighbours has two children, one of which is a girl.
  What is the probability of the other child to be a girl?
This is oft-mentioned but with this wording it's ambiguous. Here are three possible interpretations:

1. Your neighbor has two children. We observed both and found that there was exactly one girl. What is the probability that both are girls? (Answer: 0)

2. Your neighbor has two children. We observed one and found that she was a girl. What is the probability that the other (unobserved) child is a girl? (Answer: 1/2)

3. Your neighbor has two children. We observed both and found that there was at least one girl. What is the probability that both are girls? (Answer: 1/3)

Interpretation #1 is kind of silly but it serves to illustrate the ambiguity of the plain English wording. Interpretation #2 is how most people interpret the problem, and it's bolstered by the use of the word "other" in your prompt. For there to be an "other" child, there must have been a specific girl observed originally, rather than a general observation that there was at least one girl. Interpretation #3 is how you intended the problem.


yup, you're right.


You are horribly butchering Bayes' Theorem. Both your examples are incorrect because your two events ('2 girls' and 'at least 1 girl') are not oberved independently of one another, which is a necessary condition to using Bayes' Theorem with a valid result. Great tool but wrong job.


What? Bayes' theorem does not require anything about the events, even more so it is useful when the events are not independent. If events are independent, Bayes' theorem has no usefulness. https://en.wikipedia.org/wiki/Bayes%27_theorem#Statement_of_...

Edit: however now that I look into it, I should rather have used "conditional probabilities" name rather than Bayes'. I always mix both of them, my bad.


Looks like this is the Boy or Girl paradox [0], which has ambiguity problems.

[0] https://en.wikipedia.org/wiki/Boy_or_Girl_paradox


I love how Wikipedia helpfully informs us that 'Biologically, according to studies of the human sex ratio, it is more likely for any one person to be born into the male sex than the female sex.' as though it is somehow relevant or has anything to do with the actual paradox being discussed...

NOTE I've since edited the Wiki page, and deleted the sentence - since it isn't relevant and there was no citation for the claim!


> Your neighbours has two children, one of which is a girl. What is the probability of the other child to be a girl?

The way you have phrased the question, the probability is 1/2.


Actually, it's ever so slightly higher than 1/2 because there is a non-zero chance of them being identical twins. If we assume an identical twin incidence rate of 0.4%, then the actual probability would be 63/125 that the second child is also a girl.


In the first case, I guess what's happening is that when we pick two people randomly with uniform distribution, we expect, with equal probability, any combination of boys and girls: MM MF FF FM. When we observe a girl, we can exclude the case of two boys, so the remaining cases to consider are MF FF FM, where obviously only 1/3 of the cases is two girls.

So I guess what makes this counterintuitive is that we think of the example as independent variables actually affecting each other, but what happens is really that we're drawing from a collection without replacement, and obviously the mix of the collection is going to change as we do so.

This should apply also on a humankind, world-wide scale: if you have met three women in a row, the fourth person you meet is more likely to be a man. Gambler's fallacy! Or just drawing without replacement. A roulette table is drawing with replacement. :)


But MF and FM in this case are the same?


No because children are distinguishable.

If you flip two coins, you get HH, HT, TH, and TT with equal probability. Try it.


But the sequence is irrelevant.

HT is equal to TH, because in this situation we only care whether the two coins are equal to each other or different from each other.


It's the percent chance that matters, so it doesn't matter if you can distinguish them. The cases still exist with equal chance.


To elaborate, the situation as stated in the question is currently Hx, so HH and HT are the only possible results.


You can calculate 50/50 but it's not by ignoring the difference between HT and TH.


The first child is not an influencing factor on the second child. It's either a boy or a girl, 50/50.


My favourite quote from the wonderful David Spiegelhalter, Winton Professor of the Public Understanding of Statistics:

> Why do people find probability unintuitive and difficult? Because it is unintuitive and difficult.


Another counter intuitive probabilistic property that may be relevant here:

- If you make a distribution of the first digit of prices in a webshop you will notice that 1 appears more frequently than other digits.

- If you convert the prices to another currency and redo the distribution, the most frequent first digit will still be 1!

So it's not a given that you can assume the properties of a random bit stream will hold for the calculation in the article.

(As an aside: this weird property explains why you need slightly fewer coins in a currency that has quarters than those with 20 cent coins)


The logic behind this is more or less the same idea behind the Knuth-Morris-Pratt algorithm (https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93P...), which might help with your intuition.


Aren't you miscounting though? 111 contains the 11 sequence twice.


Which is probably the key intuition here - the expected number of matches is exactly the same, but because you can have more matches for the purely-repeated sequence in the same string the total number of strings containing those matches is smaller.


Sure, but the question is "how likely" not "how many times". Counting the expected number of times is a much easier problem because you can use the linearity of expectation.


But you're trying to find a bit-substring of length 2 that is least likely to have been transmitted.

If the bit strings of length 3 are chosen uniformly, 11 is less likely to appear than 10.


You are totally right. He is incorrect.


This is (almost) how Knuth–Morris–Pratt string search algorithm works: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93P...


Another possible detail this misses - I learnt about De Bruijn sequences from Samy Kamkar here:

http://samy.pl/opensesame/

(scroll about 1/3rd way down to the "(U) The OpenSesame Attack" section for the bits relevant to this discussion...)

You don't need to enumerate every n-bit sequence, you just need to enumerate a (shorter - by 62% in Samy's 12 bit case) sequence that contains all the n-bit sequences.


Interestingly, there was an old USAMTS problem that had De Bruijn sequences in it, though I wasn't aware of it at the time: http://www.usamts.org/Tests/Problems_25_1.pdf


This is the heart of the Boyer-Moore string searching algorithm too.


Does this still apply when your search space is on the order of thousands of petabytes, and the smallest unseen sequences are, according to the linked post, around 75 bits?


Wow, how surprising and, in a strange way, kind of beautiful! Thanks for this.


I'm reminded of constructing DFAs that represent regular expressions for class.


Yeah, you can see finite automata explicitly used in the PDF that's linked in the article itself: https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf


Yes, google "penney ante"




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: