Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] Musk: Twitter legal said I violated NDA by revealing bot check sample size = 100 (twitter.com/elonmusk)
24 points by pcbro141 on May 15, 2022 | hide | past | favorite | 26 comments


A sample size of 100 is fairly reasonable, isn't it? The worst-case standard deviation is 5%, and 100 is a small enough number that you can take the time to carefully inspect each account (by humans) to actually determine if it's a bot or not. So 100 is something like the sample size I'd expect them to use here. Not sure why that would need to be a secret.

(The key thing is that the sample has to be truly random, which is usually very hard to do, but Twitter can easily pick 100 random accounts in their own database.)


I'm not sure that I follow. Why is the worst-case standard deviation 5%? And for which random variable?

I vaguely do remember the standard way of coming up with a reasonable sample size.

Assuming:

1) We want the commonly used confidence level of 95% (corresponding to a a Z-score of ~1.96).

2) We want a margin of error of 1 percentage point (kinda reasonable, since their estimate is stated as a percentage without explicitly stating the margin of error).

3) We don't know anything about the expected percentage of bots a priori.

4) We ignore the error in determining an account is a bot or not.

Then, we'll need a sample size of 1.96^2/(2*0.01)^2 = 9604.


(See below for two answers about the math that gets to 5%.)

> We want a margin of error of 1 percentage point (kinda reasonable, since their estimate is stated as a percentage without explicitly stating the margin of error)

If the question they want to answer is "give me the exact % of bots on Twitter", then yes, you probably want a standard deviation of less than 1%. That would take many thousands of samples, as you say - much larger than sample sizes used in presidential elections, even.

OTOH, my guess is that the question for them is "are bots a big part of the Twitter population?" In that case, 97% of accounts being real versus 96% doesn't matter much, since both show bots are a very small group. (But maybe bots write a disproportionately large # of tweets?)


I don't think 100 is terribly reasonable. For a company with the size and resources of Twitter, it should be trivial to check several thousand.

I'm not sure why the standard deviation worse case is 5%. Presumably the accuracy of the bot decision plays into it. But I would imagine a couple thousand would get a far more accurate answer.


The goal is to estimate the proportion of accounts that are bots. Let that number equal p. The variance of the proportion is p(1-p). The highest that can be is (0.5)0.5. Then, the standard deviation of that is the square root, which is 0.5.

Now, we want to know the standard error for our estimate of the bot proportion. That is sqrt(p(1-p)/n). Suppose 50% of accounts are bots (I assume that would be very high), then our estimate of p would be 0.5 and our standard error with a sample of 100 would be 0.05. Hence, our 95% confidence interval is roughly 0.4–0.6 in the worst case (with a sample of 100).

If the proportion is under 0.1 (let's assume 0.05), then the standard error would be sqrt(0.05(1-0.05)/100) = 0.022. Our 95% confidence interval in this case would be roughly 0.01–0.09.

These seem like large ranges to me. Hence, I would expect them to use a larger sample too.


The worse case is 5% because "is a bot" is a boolean variable. That has a standard deviation of 0.5 in the very worst case (50% on value 0, 50% on value 1, so the mean is 0.5, and the deviation can't be more than 0.5). And the standard deviation of a random sample scales like 1 over the square root of the sample size, so 0.5 divided by 10 => 0.05 (5%).

For presidential elections it's common to see samples sizes like 1,000 or so, which have standard deviations of around 1.5%. That's better than 5%, and it makes sense since elections are often won by just a few %. But here, IIUC, the goal was to see if bots are a big part of Twitter or not, and so the answer "there are fewer than 5% bots" is enough. That is, we don't care if the % of bots is 3.5% or 4.7%.


The standard deviation of 50 one's and 50 zero's is 0.50

What is your 5% value referencing?

https://www.calculator.net/standard-deviation-calculator.htm...


Didn't azakai just say that?

> That has a standard deviation of 0.5 in the very worst case

followed by

> the standard deviation of a random sample scales like 1 over the square root of the sample size, so 0.5 divided by 10 => 0.05 (5%).


100 gets you roughly a 90% confidence level w/ an 8 point margin of error. Not very good.


This business deal is far too dramatic for me to continue caring


Yet you commented about not continuing caring...


I'm surprised the comment here isn't about violating NDA.

I'm interested to hear how Musk violates NDA, not whether his bot checking approach is good or bad.


I'm confused by your confusion. Pretty clearly, Musk, as acquirer, was made privy to the methodology Twitter uses to estimate and report the number of bot accounts. This methodology is a trade secret, and therefore was disclosed under an NDA. Therefore Musk cannot tell anyone the details (until he buys the company, then he can do whatever he wants).

People focused on the bot checking approach because the NDA question is pretty dull and whether he technically did or not depends primarily on how paperwork was filed.


Wait, you mean that the methodology of sampling 100 followers to check for bot is on the NDA.

This is the trade secret of twitter bot checking approach?

Wow, this is so mundane.


NDAs can cover material mundane, dull or arbitrary. A password could be a secret, and it's all those things. The number of users sampled could easily be a trade secret.

It's not like the NDA says "keep our secrets unless you think they're too boring to keep". It says "anything marked in the following way is a secret you cannot reveal." If Musk thought that was public knowledge, he could fight them in court. Of course, that would torpedo his "this is new information to me, I need to get out of the TWTR acquisition" claims.


I meant it was mundane in the way that any stat new grad could have come up with this methodology.

If you give anyone 5 seconds to come up with a methodology to check for bot, they would probably come up with "yeah, let's sample 100 of our followers and check manually".

It is the most obvious simplest method that it surprises me twitter can even call this trade secret under NDA.


Here's an example of something more obvious. We have a new finance grad. He comes up with the methodology of "let's take all the corporate profit for the quarter, divide it by the outstanding shares, and distribute it". It's a bit simple, but it's fine. However, if you know that comes out to $7.22/share and announce that ahead of the announcement, you would breach your NDA. It's just a stupid number, but it means something.

Similarly, the fact that Twitter used 100, instead of 1000 or 2700 gives you error bars. It means something.

Semi-unrelated: a new stat grad would almost certainly not choose 100. They would feel compelled to run some math and come up with a different sample size that matched the results of some equations for power analysis, etc.


Parag just went on a long twitter thread basically saying that this is not their approach of checking for bots.

So, I guess it is not that mundane.


Is there really a need to repost everything Elon Musk says on Twitter here?


You can easily skip the thread, no?

Downvote it if you like.


It’s not possible to downvote posts, only comments.


Then, skip it?

I'm interested in the discussion, so ....


Then let’s check it out and participate on Twitter! That is literally where is it happening.


Twitter discussion is markedly different from HN discussion.

Therefore, I'd prefer discussing it here.

Your opinion is certainly valid, so is mine.

If there was only some sort of democratised mechanism to settle this as a community and a way to easily hide the threads one isn't interested in.... I could only wish.


It's a discussion forum.


That’s what Twitter itself is.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: