Hacker News new | past | comments | ask | show | jobs | submit login
Benford's Law (wikipedia.org)
145 points by kjhughes on Feb 16, 2020 | hide | past | favorite | 93 comments



Benford's law, Zipf's law and other power laws tend to fascinate but they are a manifestation of something really simple. It's how things behave when they are evenly distributed over multiplicative operations instead of additive operations. By this I mean things are as likely to be x% bigger than x% smaller instead of the more common definition of evenly distributed (things x unit larger are as likely as x unit smaller).

Being evenly distributed over multiplication is natural for many measures. I think it is considered the most even, "maximum entropy" distribution for things that can't be negative.

The most even, maximum entropy distribution for an unbounded (-infinity to +infinity) prior is an unbounded uniform distribution but you can't apply it, for example, in the case of the size of objects since it would imply that negative sizes are as likely as positive sizes (and objects don't have negative sizes).

In general position parameters tend to be uniformly distributed over addition/subtraction whereas scale parameters tend to be evenly distributed over multiplication (uniform over log(x)). It's the reason some graphs use log axis. For Bayesians, it's a fairly straightforwards concept. See also improper priors (https://en.wikipedia.org/wiki/Prior_probability#Improper_pri...)


The max entropy distribution over (-oo,+oo) is usually taken to be either Gaussian or Laplace (two sided exponential), depending on moment conditions. There is no such thing as a uniform distribution over the real line.


Of course there is. It's an improper prior (doesn't sum up to one), see link above. Theoretically the Gaussian is only maximum entropy given a mean and stddev. An infinite uniform is maxent without any knowledge of mean and stddev.

In practice though you can use a wide Gaussian if it is more mathematically convenient and get about the same results.


The entropy of any improper prior diverges. There is no point in arguing over which one maximizes entropy; they all do, vacuously. Max entropy is not useful for discriminating amongst probability distributions unless you require them to actually be probability distributions.


Side note: great how you spell infinity. Never saw this before. So simple and smart


Came here to say the same thing. Certainly haven't seen such depiction of infinity earlier. But this is a nice way. I am going to use it myself.


It's used as the symbol for infinity in Sympy, and probably other (and older) computer algebra systems.


Listen to the paper on this topic here you can: https://www.youtube.com/watch?v=pGBcBeX1PSA :)


As the article explains, benford's law applies to distributions over values that evolve over time with multiplicative fluctuations, not multiplicative distributions themselves. A simple uniform distribution over [0,n] obeys benford's law, averages over any range of N.


> A simple uniform distribution over [0,n] obeys benford's law

A simple uniform distribution does not obey Benford's law because each digit from 1 to 9 is equally likely to be the leading significant digit. Thus each digit occurs as the leading digit with 11.1% probability. See https://news.ycombinator.com/item?id=22340018 for an example.


What if N is 19?


It does not matter what N you choose. The experimental results are always an approximation of Benford's law. The approximation gets better and better as N becomes larger and larger. The fact that you do not see Benford's law working with N = 19 is not in anyway a flaw in the way the experimental results are set up. It just means N is too small to approximate Benford's law closely.


The larger the value of N is, the more the experiments agree with the Benford's law.


Of course it does, because there is going to be a maximum of the range A, B. Even if the maximum itself would have a uniform distribution, the fact is that all numbers drawn from the samples will be below it.

So the probability that the number starts with a 5 (in decimal representation) is sort of like the probability of the sum over all N of:

(Maximum > N) (Number starts with 5 given that Maximum is N)

The thing is that when N is anything less than 9,999 etc. the second probability IS NOT 1/10. It is skewed towards the lower numbers due to the most significant digit being less than 9 more frequently!!!

So this is the real explanation. And I just came up with it on the spot while writing this. Yay!

Hehe


The leading digits of a uniform distribution does not follow Benford's law.

Look at: https://news.ycombinator.com/item?id=22340018 from elsewhere in this thread or https://news.ycombinator.com/item?id=21541264 of mine from a couple of months ago.


Here, I wrote it up to explain more clearly:

http://magarshak.com/blog/?p=318

Your script only tests samples which are from ranges where the max is a power of 10.

I’m sorry to tell you this, but you inadvertently misled people with that empirical test. This just goes to show that we have to check our assumptions, as scientists or mathematicians trying to prove a statement. (Even with empirical tests :)

Hopefully this message will fix that at least for those people that are reading this thread! The rest will be confused. But that’s what happens in science all the time.

PS: I edited the original Wikipedia page with the explanation :)


You are artificially constructing a very limited range (upper bound and lower bound) in which of course the likelihood of any digit being the first digit won't be equal. This is nothing new. This does not contradict what other commenters have said about uniform distribution. With large ranges, even if you exclude a power of 10 in the upper bound, it does not change the 11.11% chance of each digit being the first digit.

But let us accept your very limited range for a moment and go along with it. Then you say that the numbers in this range follow Benford's law. But clearly, it doesn't. None of the probabilities in this range obey the probabilities in Benford's law.

Someone needs to revert the dubious edit (https://en.wikipedia.org/w/index.php?title=Benford%27s_law&d...) you have made in Wikipedia.


This is simply wrong, and creating a new green account and downvoting me isn’t going to change that.

It is trivial to see that literally any range with min = 0 and max = any number other than a power of 10 makes it LESS likely that a 9 will come up as the first digit. For example the range 0-300 has 1 and 2 come up as the first digit way more than the rest. Don’t you think the same is true of 0-30000 and 0-300000000000000000000000? The size of the range doesn’t make your assertion any more true, that for large ranges every leading digit begins to have an equal chance of appearing.

My point is that, given a uniform distribution from 0 to a max, it has to have a max somewhere. If we assume that max itself is uniformly distributed then we derive the proportions you find in Benford’s law.

Look to put it another way, Benford’s law comes from the numbers which are the same number of digits as the max. The rest are evenly distributed but those numbers are the most numerous at that point and they contribute the phenomenon. Ok?

Are you convinced?

PS: There has got to be someone who figured this out before 2020. Come on. Someone post a link to this derivation.


> creating a new green account and downvoting me isn’t going to change that.

It is impossible on Hacker News for a new green account with less than 500 points to downvote someone else.


Do you admit this statement is wrong?

With large ranges, even if you exclude a power of 10 in the upper bound, it does not change the 11.11% chance of each digit being the first digit.

The empirical test is cherrypicked also.

If you don’t admit this then there is really no point to continue.


Can you provide a concrete example of a range of numbers that you think obeys Benford's law?


This is exactly the question I was going to ask.

I wrote: > The leading digits of a uniform distribution does not follow Benford's law.

And @EGreg wrote: > I’m sorry to tell you this, but you inadvertently misled people with that empirical test. This just goes to show that we have to check our assumptions, as scientists or mathematicians trying to prove a statement. (Even with empirical tests :)

So, what specific range of the uniform distribution yields leading digits that follows Benford's law?


Literally any range with min = 0 and where the max isn’t a power of 10.

For example 0-300

One third of numbers are evenly distributed: 0-100

One third starts with 1: 100-200

One third starts with 2: 200-300

Do you understand?


I understand.

There is a distribution of leading digits that looks like:

    d   P(d)
    1   30.1%   
    2   17.6%   
    3   12.5%   
    4   9.7%    
    5   7.9%    
    6   6.7%    
    7   5.8%    
    8   5.1%    
    9   4.6%    
 
As wikipedia says, "It has been shown that this result applies to a wide variety of data sets, including electricity bills, street addresses, stock prices, house prices, population numbers, death rates, lengths of rivers, physical and mathematical constants."

Neat! For each of those data sets you get the same distribution. Now, someone (I won't say who), says that it also is true for the uniform distribution.

But it isn't.

It simply isn't.

And I said as much when I said, "The leading digits of a uniform distribution does not follow Benford's law."

And your counter example is if you take a uniform distribution from 0-300, the leading digits go to something like:

    d   P(d)
    1   36.7%   
    2   36.7%   
    3   3.7%   
    4   3.7%    
    5   3.7%
    6   3.7%    
    7   3.7%
    8   3.7%
    9   3.7%
Great, so I don't know how we can disagree at this point. The above distribution is not Benford's Law.

> "The leading digits of a uniform distribution does not follow Benford's law." -- me

And you, directly disagreeing with that correct statement:

> This just goes to show that we have to check our assumptions, as scientists or mathematicians trying to prove a statement. -- EGreg

Indeed.


That's not Benford's law though. That's just a weird distribution due to a weird cutoff.

Bensford's law is 1:30.1%, 2:17.6% 3:12.5% etc.


For the record you’re changing the goalposts. The op claimed that his example proves that the digits always have the same chance of appearing, which is clearly false.

When the max is uniformly distributed then Benford’s law emerges. I mean, all you have to do is read the link - where I derive it.

What exactly is the law — please don’t handwave. If the law is those exact point values mentioned in the article then I just showed you how we arrived at them.


What you are describing is not even the result of a uniform distribution. It's a two step process involving two uniform distributions. The end result is some weird non uniform downward sloping distribution.


That’s because we aren’t trying to look at one specific uniform distribution. We were asking why Benford’s law happens for almost all processes that follow a uniform distribution and record the result as positional notation with digits — namely that 1 appears a lot more than 2, which appears a lot more than 3, etc. Roughly in the proportion that 1 is twice that of 2, which is 1/3 more than 3, etc.

(Btw it is NOT true for eg dictionary words for example, an initial A doesnt appear more than B. That should tell you something!)

And to understand the reason we just have to look at the family of uniform distributions, and see that for almost all of them, this proportion holds. Sure, for some of them, the 1,2,3 may be even MORE prevalent relative to 4-9 because the maximum value was 400 or 4000 or 40000. Ok? You can see this. For a uniformly distributed process that happens to have that as the maximum, Benford’s law will have the same proportions between 1,2,3 but then drop for 4-9 since they didn’t get that “boost”.

But if you keep sampling and this maximum keeps growing by some continuous distribution that’s not perfectly synced with the metric system, then it’s as likely to be in the range 100-200 as it is to be in 200-300. And then as likely to be in 1000-2000 as in 2000-3000. Given that, we get something like Benford’s law.

Now, perhaps it is ALSO TRUE for other distributions. I just explained why it’s true for uniform ones.


If you took a random letter in the alphabet and then sampled from any letter before this letter you would get more samples from the earlier letters of the alphabet. That is because the two step process discards higher letters in the first step. This is not a uniform distribution and is not Bensford's law. It's just a weird two step process that over-samples earlier letters.


You are right, it would — but that’s not how it works. People don’t sample words by beginning with AAA and moving on to larger ones. So your point is just being put out there to “win”?


You just keep going round and round with handwaving that makes no sense. I read your link. I did not see Benford's law emerging anywhere in your link.

What does "max is uniformly distributed" even mean? If you think that the Benford's law holds good for a set of uniformly distributed numbers, why not simply provide that set? It would be so easy to prove your claim if you just provide an example set of numbers that obeys Benford's law.

All sets of numbers you have presented so far (0-300, 0-30000, 0-300000000000000000000000) do not follow Benford's law. It is very simple to show. In all these sets, the probability of first digit as 1 is equal to the probability of first digit as 2 which contradicts Benford's law.


That’s because you aren’t trying to find the probability of a digit given any SPECIFIC maximum, you are trying to sum the probability of the digit given that the maximum is in a certain range, over all ranges.

With large ranges, even if you exclude a power of 10 in the upper bound, it does not change the 11.11% chance of each digit being the first digit.*

That is JUST FALSE ok? For for pretty much any distribution you choose for the max, other than 100% chance it is a power of 10 and 0% chance other numbers, you’ll get that the digit 1 comes up way more than 2, which would come up more than 3, etc. How much more? This comes from the fact that there are just as many numbers 100-200 as there are 0-100. Ok? And that’s all 1s. Then you hit the 2s, and so on.

If the max happens to be anywhere in the range 100-1000 with equal probability, you get that result. Benford’s law. If the max is distributed as some sort of continuous distribution — and not that ridiculous distribution of ONLY ever being powers of 10 — then you likely get something similar.

What are you arguing about?? If you are saying it’s mysterious why the lower digits come up more than higher ones, well the mystery is over. If you want an EXACT fit to the numbers in the article then I think they come out whenever the max is uniformly distributed between 10^n and 10^(n+1). But they may also have a sort of “law of large numbers” thing where pretty much any continuous distribution of the max leads to this law. That part I can’t tell you. What I can tell you is OBVIOUSLY the lower digits will come out more frequently.


The numbers in the range 0-300 do not obey Benford's law. In base 10, a set of numbers that Benford's law if the leading significant digit d (0 < d < 10) occurs with probability log10(1 + 1/d). This isn't the case for the set of numbers between 1 and 300, inclusive.


Your assertion that for large ranges every digit has the same chance of appearing is very wrong. Your empirical test is rigged by choosing a very rare max, literally the only one where it would “prove” your assertion.

Benford’s law appears when the max of your range is uniformly distributed


If you present a weird distribution to begin with, it should not be surprising that every digit does not have the same chance of appearing. That's not the point. We are not talking about weird distributions here.

If we are going to argue like this, I might as well present a set of two numbers S = {1, 2} and claim that when we choose numbers from uniform distribution, the probability of 3 occurring as the first digit is 0. Other commenters are not assuming weird distributions like this because this kind of discussion does not provide any new insights and is just a waste of time.


You can create all the strawmen you want. I am going to quote from Wikipedia:

The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small.

I have explained why that happens for the vast majority of UNIFORMLY DISTRIBUTED VARIABLES.

The vast majority. That implies that there is a collection of all possible uniformly distributed variables, and in particular those that are sampled from real world processes.

As long as they are uniformly distributed, with 0 as the minimum and M as the maximum, the first digit will appear more commonly.

I explained it several times. Why are you still insisting that statements about MAJORITY of uniform distributions are weird?

Yes statements about collections of uniform distributions are not statements about ONE SPECIFIC uniform distribution. And?


Can you provide an example range of uniformly distributed integers that obeys Benford's law?


* The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small.*

Pretty much all of them.


You are quoting from Wikipedia and that quote is an oversimplification.

If you redefine the law like that, then sure, I agree that there are many uniform distributions too where the 1st digit is likely to be small. Here is another simple example: Consider the distribution of positive integers from 1 to 2. If we pick a number at random from {1, 2} then the 1st digit is likely to be small. This kind of analysis is boring.

But (fortunately!) that's not what Benford's law says. Benford's law provides a specific formula. Check https://en.wikipedia.org/wiki/Benford%27s_law#Definition to see the specific formula that must hold good for a set of numbers to be said to obey Benford's law. That's what makes Benford's law so interesting whereas your example ranges are degenerative cases where nothing new, surprising, or interesting is going on.


Actually if you look at examples in the real world, you don't always get that EXACT distribution, but rather the phenomenon, that the smaller digits are way more prevalent.

And once again, this is because of a simple analysis. Let me state it a DIFFERENT way, maybe this is something that you will take note of: the only time all possible digits have equal probability of being leading digits is when we have a uniform distribution with a max that is a power of 10. Literally every other distribution starts to exhibit that phenomenon. Now you can quibble as to what distributions lead to that EXACT curve fit. And there can be explanations for why power law distributions do. But other distributions exhibit this same PHENOMENON, while not necessarily converging to that exact proportion. As I said before, any other uniform distribution would not have those exact proportions, but would exhibit the phenomenon.

Basically, every continuous distribution is highly unlikely to have a cliff at a power of 10. It is going to go down gradually, and therefore if it includes the range 8000-9000 then it will probably include numbers above 10000. And even a discontinuous one with a uniform distribution (with a cliff at the end) will exhibit the phenomenon. OK? So if you have the range 8000-9000 in there, that means 1s and 2s were a lot more prevalent, and if you have a continuous distribution then 10,000+ numbers will be there, but perhaps not numbers 90,000+.

Do you at least get the intuition behind this? As soon as you get numbers close to a power of 10, your distribution probably includes numbers in the next order of magnitude, i.e. a lot of leading 1s. The more numbers you get starting with 9, the more it is highly unlikely you're right at the max of your distribution, with a cliff. Unless that happens to be the contrived "empirical test" that was linked to as "proof" that uniform distributions lead to equal changes for every digit to be leading.

The intuition is what matters. Now, maybe for UNIFORM distributions, or POWER distributions, that exact curve fit can be worked out. Perhaps you can show there is a "large" family of distributions for which the curve fits. Kind of like the law of large numbers. But I didn't do anything quite as ambitious. I simply showed why it's not just true for normally distributed processes, but others which you would think are uniformly distributed. Because chances are, that process has more 1s than 2s, and 2s than 3s, in exactly the proportion that a uniform distribution with some max would have, and the chances are the max isn't exactly a power of 10.

In practice, all this means is that the distribution may look like Benford's law for 1, 2, 3 and then drop to equally small for 4, 5, 6, 7, 8, 9. The 1 is going to be 10x more prevalent than the 9. The 2 would 5x more prevalent or whatever, UNLESS the distribution had a cutoff right before 200 or 2000. Understand? And this DOES HAPPEN.


Time doesn't have to be a component. Also data distributed uniformly in a range would not obey Benford's law. Each digit would appear with equal frequency.


Quoted from The New York Times, Tuesday, August 4, 1998 http://www.rexswain.com/benford.html (linked in Wikipedia):

To illustrate Benford's Law, Dr. Mark J. Nigrini offered this example: "If we think of the Dow Jones stock average as 1,000, our first digit would be 1.

"To get to a Dow Jones average with a first digit of 2, the average must increase to 2,000, and getting from 1,000 to 2,000 is a 100 percent increase.

"Let's say that the Dow goes up at a rate of about 20 percent a year. That means that it would take five years to get from 1 to 2 as a first digit.

"But suppose we start with a first digit 5. It only requires a 20 percent increase to get from 5,000 to 6,000, and that is achieved in one year.

"When the Dow reaches 9,000, it takes only an 11 percent increase and just seven months to reach the 10,000 mark, which starts with the number 1. At that point you start over with the first digit a 1, once again. Once again, you must double the number -- 10,000 -- to 20,000 before reaching 2 as the first digit.

"As you can see, the number 1 predominates at every step of the progression, as it does in logarithmic sequences."


I used this, well the idea that numbers often tend to be distributed evenly in exponential space, in my MEng thesis. The idea was that because so often in computers our additions or other arithmetic operations involve at least one small number then most of the switching activity in an adder should be concentrated in the lower order bits.

So the plan was to make the low order bits of the adder slower and more power efficient. Make the high order bits faster and less power efficient. Meet the same overall clock rate target given by the worst case but use less power in the average case.

Sadly memory operations, which look a lot like pure entropy, were sufficiently common that the gains just weren't worth the design effort.


This law actually led to Bernie Madoff being caught. I worked at a hedge fund at the time and we applied this to his returns and immediately knew it was a scam.


Many people got wise to Madoff before his scheme collapse - Notably Harry Markopolos. But Madoff was essentially never caught. Rather, his scheme ended the way most Ponzi schemes end - he no longer had enough money to pay withdrawal requests from the money that was coming in.

https://en.wikipedia.org/wiki/Bernie_Madoff#Investment_scand...


I’ve heard of this before, but part of me thinks also: How many people are getting away with fraud because they are utilizing Benfords law in order to make their data appear more normal? Or is there another attribute that can’t be accounted for when aiming for that?


I've read that they have multiple "tells" for catching fraudsters. There's almost 100 of them and not all of them are as known as Benford's Law. At that point, it's just easier to run a legit business.


Any pointers to details or research?


A lot of them probably deals with timing and some correlation type analysis


Sources?


Not to refute you but my understanding was that many people simply assumed something was wrong if only because of the consistency of returns he claimed were pretty much impossible.

IIRC at one point his numbers showed on a month to month basis losses only 4% of the time.


Can’t people use this law to defraud? I bet the more sophisticated ones already are



If anyone's interested, this is the breakdown of the leading digits of the number of comments for these threads:

{'1': 0.6896551724137931, '2': 0.13793103448275862, '3': 0.06896551724137931, '4': 0.06896551724137931, '5': 0.0, '6': 0.0, '7': 0.0, '8': 0.034482758620689655, '9': 0.0}

It's not exactly Benford's distribution. It doesn't span more than two orders of magnitude, but it's somewhat Benfordian, and it is what you might expect from a distribution of comments.


Now do lengths of comments.

Also, please don't report 16 digits when your data set only has 1 sigfig.


Some threads have hundreds of comments, so 3 sig figs. The single digits are just categories/labels, not data.


You're right, the insignificant digits were misleading.


ASCII bar chart or get out.


    i: 10  min: 0.000000  max: 0.689655  range: 0.689655  width: 70  scale: 101.500000
    1   0.69 **********************************************************************
    2   0.14 **************
    3   0.07 *******
    4   0.07 *******
    5   0.00
    6   0.00
    7   0.00
    8   0.03 ***
    9   0.00


Thanks for doing these on topics that cyclically come up. It's nice being able to see what has been previously said, in addition to seeing shifts in how topics are approached over time.


One could wonder why it isn’t automated.


The "past" link on topic posts will provide that.

Though it doesn't indicate how many previous discussions there were, or what discussion occurred. Dang's comments (or occasionally, those by others, I've been known to add similar myself) manifest that information into the thread in a more immediately tangible way.

That's ... an interesting property of UI/UX and simple perception that itself leads to some interesting questions. I've been looking at the notion of manifest vs. latent functions (Bronislaw Malinowski and Robert K. Merton), which implies the related notion of manifest vs. latent perceptions or awareness (I'm not finding any ... manifest ... discussion of this, though I'm sure it exists).

And there's a strong psychological element here.

Dang's moderation often hinges on this -- retitling of posts and detaching of threads all change the degree of manifestation of certain aspects of a topic. Dang's guidance is generally toward promoting productive discussion, and of avoiding derailment.

Which might start approaching a latent answer to your question ;-)


I wasn’t suggesting that Dan G’s entire job could be automated. Merely that scanning for historical matches with sufficient karma and posting an algolia link would be a few lines of trivial Python.

Clearly even the mere suggestion touched the community’s proverbial nerve. I wonder if Dan G has life aspirations beyond HN moderator and how the community shall survive without his touch.


And I wasn't shooting down the suggestion, but noting what parts of it are implemented, how that does/doesn't seem to work, and potential considerations in potentially automating the process.

Of dang's future aspirations I know not.


I don't think I've seen any kind of bots on HN.


I'm certain this was reposted because they talked about this law on today's Radiolab broadcast (a repeat of a previous episode): https://www.wnycstudios.org/podcasts/radiolab/episodes/91697...


Isn't there a bot for posting job threads every month?


Interestingly enough, it's also a decent fit for fundamental physical constants, as one can read off from a table of them. [0]

0: https://physics.nist.gov/cuu/pdf/all.pdf


That's interesting indeed. Any idea why Physics constants obey Benford's Law? Did the universe pick the physical constants from an exponential distribution?


In this particular case it's mostly reflecting our choices of units, and the fact that changing the units can change the constants by many orders of magnitude. Benford's law comes from a distribution which is invariant under that kind of change.

However, it's an interesting question whether the dimensionless fundamental constants have a deeper explanation, since they don't depend on unit choices. The fermion masses are roughly evenly distributed logarithmically, for example (which leads to Benford's law). There isn't a clear reason why.


Spent some time in college studying the use of this to detect election fraud. A really intriguing concept, but then the question becomes: what next? Is the statistical data enough to trigger a deeper investigation?


You are phrasing your question like it might be controversial. Why wouldn't 'statistical data' be enough to trigger a deeper investigation?


Everything in politics is controversial. "Let's feed kids more veggies in school lunches" somehow became socialism.


Where does benfords law apply in elections? Not in vote counts (unless the set of precinct sizes is largely fabricated) or insignificant digits of vote counts.


It is not straight forward but look into the Second-Digit Benford's Law (2BL-test).


VSauce did a pretty fun video on the same topic (sort of)

https://www.youtube.com/watch?v=fCn8zs912OE


I thought I could try this really easily in Python.

Here's what I did:

Open IDLE and type:

   from random import seed
   from random import random
   seed(1)
You can then try typing random() and get values:

   >>>random()
   0.13436424411240122

   >>>random()
   0.8474337369372327

   >>>random()
   0.763774618976614
I actually forgot that random() doesn't just return an integer, like in c where if you type rand() you get a random integer. But these numbers looked great to me. I could easily imagine the first one 0.13436424411240122 just being 13436424411240122 and so on. I thought it's a great way to get the first digit of a random numbers with an unlimited size, and for this result I thought this is a good way to go ahead.

So with the above basis, I want to look at the distribution of the first nonzero digit after the decimal point.

So next put the following function in. (basically it just gets a random as above, but keeps multiplying by 10 as long as the first digit of the string conversion is a "0". this returns the first non-zero digit at the left.)

    def getone():
       floaty = random()
       while(str(floaty)[0] == '0'):
          floaty *= 10
       return int(str(floaty)[0])
This now gets us the leftmost digit, check it out:

   >>>getone()
   3
   >>>getone()
   2
   >>>getone()
   4

   >>>getone()
   2
All right. Now let's just do this a few million times and keep track of buckets. Make some buckets:

   buckets = [0,0,0,0,0,0,0,0,0,0] # buckets for: 0,1,2,3,4,5,6,7,8,9
This next line will run for up to a few minutes:

   for tries in range(10000000):
      buckets[getone()] += 1
Now print the results:

   for result in range(10):
      print (result, ":", buckets[result])
This gets me....

   0 : 0
   1 : 1112927
   2 : 1110821
   3 : 1109741
   4 : 1110793
   5 : 1111604
   6 : 1112178
   7 : 1111272
   8 : 1110402
   9 : 1110262
Clearly Benford's law does NOT apply here.

Why not?


Your experiments show that when we pick numbers from a uniform distribution, each digit from 1 to 9 is equally likely to be the leading significant digit. This is perfectly fine. Numbers picked from a uniform distribution indeed do not obey Benford's law.


thanks for the analysis! I thought I must have made a mistake.


If we counted in hex, how would it vary? It appears to me to be "not much" because it says most things scale to express as instances of the first non-zero value in the number system plus an arbitrary number of powers of the base. So.. it's Fermi arithmetic 0, 1, 10, 100 ...


In base b, the probability of the leading significant digit to be d (where 0 < d < b) is log_b(1 + 1 / d).

That's why for base 10, the probability of the leading significant digit to be 1 is log_10(1 + 1/1) = 0.301. In base 2, the probability of the leading significant digit to be 1 is, quite trivially, log_2(1 + 1/1) = 1.0. Of course, all numbers in binary must begin with the digit 1.


Benford's law works in any base.


Except base 1.


The law definitely has a human psychology to it. When stock prices are close to say 1000, it will like gets pushed over to the psychological level.


There are a number of fun random generators online - worth googling.


Experimental evidence for uniform distribution:

  # uniform.py

  import random

  n = 10**4
  samples = 10**6
  counts = [0] * 10

  for i in range(samples):
      random_number = random.randint(1, n)
      first_digit = int(str(random_number)[0])
      counts[first_digit] += 1

  for i, count in enumerate(counts):
      print('{} => {:.3f}'.format(i, counts[i] / samples))
Output:

  $ python3 uniform.py
  0 => 0.000
  1 => 0.111
  2 => 0.111
  3 => 0.111
  4 => 0.111
  5 => 0.111
  6 => 0.111
  7 => 0.111
  8 => 0.111
  9 => 0.111
Experimental evidence for exponential distribution:

  # exponential.py

  import random

  n = 10**4
  samples = 10**6
  counts = [0] * 10

  for i in range(samples):
      random_number = 2 ** random.randint(1, n)
      first_digit = int(str(random_number)[0])
      counts[first_digit] += 1

  for i, count in enumerate(counts):
      print('{} => {:.3f}'.format(i, counts[i] / samples))
Output:

  $ python3 exponential.py
  0 => 0.000
  1 => 0.301
  2 => 0.176
  3 => 0.125
  4 => 0.097
  5 => 0.079
  6 => 0.067
  7 => 0.058
  8 => 0.051
  9 => 0.046


You chose N in such a way to make your hypothesis true.


The variable 'n' has to be sufficiently large. The larger 'n' is, the more it agrees with Benford's law. For example, with n = 10 (as opposed to n = 10000 in my original code), we get results that disagree with Benford's law quite a bit:

  0 => 0.000
  1 => 0.301
  2 => 0.200
  3 => 0.100
  4 => 0.100
  5 => 0.100
  6 => 0.100
  7 => 0.000
  8 => 0.100
  9 => 0.000
In fact, the results for digits 7 and 9 are in complete disagreement with Benford's law. That's not surprising because there is no integer n between 0 and 10 such that 2^n has 7 or 9 as its first digit. Therefore, no matter how many trials we perform, the frequency of the first digit as 7 or 9 would always be 0. The first power of 2 for which we get 7 as the leading digit is 2^46. Similarly, the first power of 2 for which we get 9 as the leading digit is 2^53. So unless the powers are much larger than these numbers, we won't get frequencies for 7 and 9 that agree with Benford's law.

With n = 100, the experimental results become closer to Benford's law:

  0 => 0.000
  1 => 0.300
  2 => 0.170
  3 => 0.130
  4 => 0.100
  5 => 0.070
  6 => 0.070
  7 => 0.060
  8 => 0.050
  9 => 0.050
But with n = 1000, the results become even closer to Benford's law:

  0 => 0.000
  1 => 0.301
  2 => 0.176
  3 => 0.125
  4 => 0.097
  5 => 0.079
  6 => 0.069
  7 => 0.056
  8 => 0.052
  9 => 0.045


Instead of performing trials in an experiment, another way to look at this problem is to simply analyze the frequency distribution of the leading significant digit for a function that grows exponentially. Like before, let us choose 2^n as that function.

Here is the code:

  import random

  N = 10
  counts = [0] * 10

  for n in range(N):
      first_digit = int(str(2**n)[0])
      counts[first_digit] += 1

  for i, count in enumerate(counts):
      print('{} => {:.3f}'.format(i, counts[i] / N))
For N = 10, i.e., 0 < n < 10, we get:

  0 => 0.000
  1 => 0.300
  2 => 0.200
  3 => 0.100
  4 => 0.100
  5 => 0.100
  6 => 0.100
  7 => 0.000
  8 => 0.100
  9 => 0.000
For N = 100, i.e., 0 < n < 100, we get:

  0 => 0.000
  1 => 0.300
  2 => 0.170
  3 => 0.130
  4 => 0.100
  5 => 0.070
  6 => 0.070
  7 => 0.060
  8 => 0.050
  9 => 0.050
For N = 1000, i.e., 0 < n < 1000, we get:

  0 => 0.000
  1 => 0.301
  2 => 0.176
  3 => 0.125
  4 => 0.097
  5 => 0.079
  6 => 0.069
  7 => 0.056
  8 => 0.052
  9 => 0.045
For N = 10000, i.e., 0 < n < 10000, we get:

  0 => 0.000
  1 => 0.301
  2 => 0.176
  3 => 0.125
  4 => 0.097
  5 => 0.079
  6 => 0.067
  7 => 0.058
  8 => 0.051
  9 => 0.046
These results are consistent with the experimental results in my previous comment. As we can see, the larger N is, the more the frequencies of the leading digit obey Benford's law. Similar results can be reproduced with many other exponentially increasing functions too, e.g., 3^n, 5^n, Fibonacci numbers, Lucas numbers, and even factorials that grow faster than exponential functions.


You keep choosing N to be an integer power of 10. Make it a power of 2 or 7 or pretty much any number, and you'll see different results.


I do not see different results:

  import random

  N = 8192
  counts = [0] * 10

  for n in range(N):
    first_digit = int(str(2**n)[0])
    counts[first_digit] += 1

  for i, count in enumerate(counts):
    print('{} => {:.3f}'.format(i, counts[i] / N))
Output:

  0 => 0.000
  1 => 0.301
  2 => 0.176
  3 => 0.125
  4 => 0.097
  5 => 0.079
  6 => 0.067
  7 => 0.058
  8 => 0.051
  9 => 0.046
I still see results that agree with Benford's law.


I did something similar (except you're a better coder, haha), but when I looked up how to get a random number in Python the first thing I saw was how to get one between 0 and 1 so I went with it. Here is my comment (very similar in spirit to yours):

https://news.ycombinator.com/item?id=22341416

Any comments?


Replied on your comment thread.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: