Hacker News new | past | comments | ask | show | jobs | submit login

Here, I wrote it up to explain more clearly:

http://magarshak.com/blog/?p=318

Your script only tests samples which are from ranges where the max is a power of 10.

I’m sorry to tell you this, but you inadvertently misled people with that empirical test. This just goes to show that we have to check our assumptions, as scientists or mathematicians trying to prove a statement. (Even with empirical tests :)

Hopefully this message will fix that at least for those people that are reading this thread! The rest will be confused. But that’s what happens in science all the time.

PS: I edited the original Wikipedia page with the explanation :)




You are artificially constructing a very limited range (upper bound and lower bound) in which of course the likelihood of any digit being the first digit won't be equal. This is nothing new. This does not contradict what other commenters have said about uniform distribution. With large ranges, even if you exclude a power of 10 in the upper bound, it does not change the 11.11% chance of each digit being the first digit.

But let us accept your very limited range for a moment and go along with it. Then you say that the numbers in this range follow Benford's law. But clearly, it doesn't. None of the probabilities in this range obey the probabilities in Benford's law.

Someone needs to revert the dubious edit (https://en.wikipedia.org/w/index.php?title=Benford%27s_law&d...) you have made in Wikipedia.


This is simply wrong, and creating a new green account and downvoting me isn’t going to change that.

It is trivial to see that literally any range with min = 0 and max = any number other than a power of 10 makes it LESS likely that a 9 will come up as the first digit. For example the range 0-300 has 1 and 2 come up as the first digit way more than the rest. Don’t you think the same is true of 0-30000 and 0-300000000000000000000000? The size of the range doesn’t make your assertion any more true, that for large ranges every leading digit begins to have an equal chance of appearing.

My point is that, given a uniform distribution from 0 to a max, it has to have a max somewhere. If we assume that max itself is uniformly distributed then we derive the proportions you find in Benford’s law.

Look to put it another way, Benford’s law comes from the numbers which are the same number of digits as the max. The rest are evenly distributed but those numbers are the most numerous at that point and they contribute the phenomenon. Ok?

Are you convinced?

PS: There has got to be someone who figured this out before 2020. Come on. Someone post a link to this derivation.


> creating a new green account and downvoting me isn’t going to change that.

It is impossible on Hacker News for a new green account with less than 500 points to downvote someone else.


Do you admit this statement is wrong?

With large ranges, even if you exclude a power of 10 in the upper bound, it does not change the 11.11% chance of each digit being the first digit.

The empirical test is cherrypicked also.

If you don’t admit this then there is really no point to continue.


Can you provide a concrete example of a range of numbers that you think obeys Benford's law?


This is exactly the question I was going to ask.

I wrote: > The leading digits of a uniform distribution does not follow Benford's law.

And @EGreg wrote: > I’m sorry to tell you this, but you inadvertently misled people with that empirical test. This just goes to show that we have to check our assumptions, as scientists or mathematicians trying to prove a statement. (Even with empirical tests :)

So, what specific range of the uniform distribution yields leading digits that follows Benford's law?


Literally any range with min = 0 and where the max isn’t a power of 10.

For example 0-300

One third of numbers are evenly distributed: 0-100

One third starts with 1: 100-200

One third starts with 2: 200-300

Do you understand?


I understand.

There is a distribution of leading digits that looks like:

    d   P(d)
    1   30.1%   
    2   17.6%   
    3   12.5%   
    4   9.7%    
    5   7.9%    
    6   6.7%    
    7   5.8%    
    8   5.1%    
    9   4.6%    
 
As wikipedia says, "It has been shown that this result applies to a wide variety of data sets, including electricity bills, street addresses, stock prices, house prices, population numbers, death rates, lengths of rivers, physical and mathematical constants."

Neat! For each of those data sets you get the same distribution. Now, someone (I won't say who), says that it also is true for the uniform distribution.

But it isn't.

It simply isn't.

And I said as much when I said, "The leading digits of a uniform distribution does not follow Benford's law."

And your counter example is if you take a uniform distribution from 0-300, the leading digits go to something like:

    d   P(d)
    1   36.7%   
    2   36.7%   
    3   3.7%   
    4   3.7%    
    5   3.7%
    6   3.7%    
    7   3.7%
    8   3.7%
    9   3.7%
Great, so I don't know how we can disagree at this point. The above distribution is not Benford's Law.

> "The leading digits of a uniform distribution does not follow Benford's law." -- me

And you, directly disagreeing with that correct statement:

> This just goes to show that we have to check our assumptions, as scientists or mathematicians trying to prove a statement. -- EGreg

Indeed.


That's not Benford's law though. That's just a weird distribution due to a weird cutoff.

Bensford's law is 1:30.1%, 2:17.6% 3:12.5% etc.


For the record you’re changing the goalposts. The op claimed that his example proves that the digits always have the same chance of appearing, which is clearly false.

When the max is uniformly distributed then Benford’s law emerges. I mean, all you have to do is read the link - where I derive it.

What exactly is the law — please don’t handwave. If the law is those exact point values mentioned in the article then I just showed you how we arrived at them.


What you are describing is not even the result of a uniform distribution. It's a two step process involving two uniform distributions. The end result is some weird non uniform downward sloping distribution.


That’s because we aren’t trying to look at one specific uniform distribution. We were asking why Benford’s law happens for almost all processes that follow a uniform distribution and record the result as positional notation with digits — namely that 1 appears a lot more than 2, which appears a lot more than 3, etc. Roughly in the proportion that 1 is twice that of 2, which is 1/3 more than 3, etc.

(Btw it is NOT true for eg dictionary words for example, an initial A doesnt appear more than B. That should tell you something!)

And to understand the reason we just have to look at the family of uniform distributions, and see that for almost all of them, this proportion holds. Sure, for some of them, the 1,2,3 may be even MORE prevalent relative to 4-9 because the maximum value was 400 or 4000 or 40000. Ok? You can see this. For a uniformly distributed process that happens to have that as the maximum, Benford’s law will have the same proportions between 1,2,3 but then drop for 4-9 since they didn’t get that “boost”.

But if you keep sampling and this maximum keeps growing by some continuous distribution that’s not perfectly synced with the metric system, then it’s as likely to be in the range 100-200 as it is to be in 200-300. And then as likely to be in 1000-2000 as in 2000-3000. Given that, we get something like Benford’s law.

Now, perhaps it is ALSO TRUE for other distributions. I just explained why it’s true for uniform ones.


If you took a random letter in the alphabet and then sampled from any letter before this letter you would get more samples from the earlier letters of the alphabet. That is because the two step process discards higher letters in the first step. This is not a uniform distribution and is not Bensford's law. It's just a weird two step process that over-samples earlier letters.


You are right, it would — but that’s not how it works. People don’t sample words by beginning with AAA and moving on to larger ones. So your point is just being put out there to “win”?


You just keep going round and round with handwaving that makes no sense. I read your link. I did not see Benford's law emerging anywhere in your link.

What does "max is uniformly distributed" even mean? If you think that the Benford's law holds good for a set of uniformly distributed numbers, why not simply provide that set? It would be so easy to prove your claim if you just provide an example set of numbers that obeys Benford's law.

All sets of numbers you have presented so far (0-300, 0-30000, 0-300000000000000000000000) do not follow Benford's law. It is very simple to show. In all these sets, the probability of first digit as 1 is equal to the probability of first digit as 2 which contradicts Benford's law.


That’s because you aren’t trying to find the probability of a digit given any SPECIFIC maximum, you are trying to sum the probability of the digit given that the maximum is in a certain range, over all ranges.

With large ranges, even if you exclude a power of 10 in the upper bound, it does not change the 11.11% chance of each digit being the first digit.*

That is JUST FALSE ok? For for pretty much any distribution you choose for the max, other than 100% chance it is a power of 10 and 0% chance other numbers, you’ll get that the digit 1 comes up way more than 2, which would come up more than 3, etc. How much more? This comes from the fact that there are just as many numbers 100-200 as there are 0-100. Ok? And that’s all 1s. Then you hit the 2s, and so on.

If the max happens to be anywhere in the range 100-1000 with equal probability, you get that result. Benford’s law. If the max is distributed as some sort of continuous distribution — and not that ridiculous distribution of ONLY ever being powers of 10 — then you likely get something similar.

What are you arguing about?? If you are saying it’s mysterious why the lower digits come up more than higher ones, well the mystery is over. If you want an EXACT fit to the numbers in the article then I think they come out whenever the max is uniformly distributed between 10^n and 10^(n+1). But they may also have a sort of “law of large numbers” thing where pretty much any continuous distribution of the max leads to this law. That part I can’t tell you. What I can tell you is OBVIOUSLY the lower digits will come out more frequently.


The numbers in the range 0-300 do not obey Benford's law. In base 10, a set of numbers that Benford's law if the leading significant digit d (0 < d < 10) occurs with probability log10(1 + 1/d). This isn't the case for the set of numbers between 1 and 300, inclusive.


Your assertion that for large ranges every digit has the same chance of appearing is very wrong. Your empirical test is rigged by choosing a very rare max, literally the only one where it would “prove” your assertion.

Benford’s law appears when the max of your range is uniformly distributed


If you present a weird distribution to begin with, it should not be surprising that every digit does not have the same chance of appearing. That's not the point. We are not talking about weird distributions here.

If we are going to argue like this, I might as well present a set of two numbers S = {1, 2} and claim that when we choose numbers from uniform distribution, the probability of 3 occurring as the first digit is 0. Other commenters are not assuming weird distributions like this because this kind of discussion does not provide any new insights and is just a waste of time.


You can create all the strawmen you want. I am going to quote from Wikipedia:

The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small.

I have explained why that happens for the vast majority of UNIFORMLY DISTRIBUTED VARIABLES.

The vast majority. That implies that there is a collection of all possible uniformly distributed variables, and in particular those that are sampled from real world processes.

As long as they are uniformly distributed, with 0 as the minimum and M as the maximum, the first digit will appear more commonly.

I explained it several times. Why are you still insisting that statements about MAJORITY of uniform distributions are weird?

Yes statements about collections of uniform distributions are not statements about ONE SPECIFIC uniform distribution. And?


Can you provide an example range of uniformly distributed integers that obeys Benford's law?


* The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small.*

Pretty much all of them.


You are quoting from Wikipedia and that quote is an oversimplification.

If you redefine the law like that, then sure, I agree that there are many uniform distributions too where the 1st digit is likely to be small. Here is another simple example: Consider the distribution of positive integers from 1 to 2. If we pick a number at random from {1, 2} then the 1st digit is likely to be small. This kind of analysis is boring.

But (fortunately!) that's not what Benford's law says. Benford's law provides a specific formula. Check https://en.wikipedia.org/wiki/Benford%27s_law#Definition to see the specific formula that must hold good for a set of numbers to be said to obey Benford's law. That's what makes Benford's law so interesting whereas your example ranges are degenerative cases where nothing new, surprising, or interesting is going on.


Actually if you look at examples in the real world, you don't always get that EXACT distribution, but rather the phenomenon, that the smaller digits are way more prevalent.

And once again, this is because of a simple analysis. Let me state it a DIFFERENT way, maybe this is something that you will take note of: the only time all possible digits have equal probability of being leading digits is when we have a uniform distribution with a max that is a power of 10. Literally every other distribution starts to exhibit that phenomenon. Now you can quibble as to what distributions lead to that EXACT curve fit. And there can be explanations for why power law distributions do. But other distributions exhibit this same PHENOMENON, while not necessarily converging to that exact proportion. As I said before, any other uniform distribution would not have those exact proportions, but would exhibit the phenomenon.

Basically, every continuous distribution is highly unlikely to have a cliff at a power of 10. It is going to go down gradually, and therefore if it includes the range 8000-9000 then it will probably include numbers above 10000. And even a discontinuous one with a uniform distribution (with a cliff at the end) will exhibit the phenomenon. OK? So if you have the range 8000-9000 in there, that means 1s and 2s were a lot more prevalent, and if you have a continuous distribution then 10,000+ numbers will be there, but perhaps not numbers 90,000+.

Do you at least get the intuition behind this? As soon as you get numbers close to a power of 10, your distribution probably includes numbers in the next order of magnitude, i.e. a lot of leading 1s. The more numbers you get starting with 9, the more it is highly unlikely you're right at the max of your distribution, with a cliff. Unless that happens to be the contrived "empirical test" that was linked to as "proof" that uniform distributions lead to equal changes for every digit to be leading.

The intuition is what matters. Now, maybe for UNIFORM distributions, or POWER distributions, that exact curve fit can be worked out. Perhaps you can show there is a "large" family of distributions for which the curve fits. Kind of like the law of large numbers. But I didn't do anything quite as ambitious. I simply showed why it's not just true for normally distributed processes, but others which you would think are uniformly distributed. Because chances are, that process has more 1s than 2s, and 2s than 3s, in exactly the proportion that a uniform distribution with some max would have, and the chances are the max isn't exactly a power of 10.

In practice, all this means is that the distribution may look like Benford's law for 1, 2, 3 and then drop to equally small for 4, 5, 6, 7, 8, 9. The 1 is going to be 10x more prevalent than the 9. The 2 would 5x more prevalent or whatever, UNLESS the distribution had a cutoff right before 200 or 2000. Understand? And this DOES HAPPEN.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: