This is a really common web app flaw as well; there's almost always something your app does that reveals how many customers you have (or how much business you're doing) to any stranger with an account.
You can use it to your advantage too. Start your IDs off at a big number, and your first few savvy customers that see their ID think you're a lot bigger than you really are.
So instead of knowing they're user #8, they think they're user #14221, and are a lot more likely to trust you with their money.
It works well for throwing off the competition too. "Holy Crap! They have 150k users and they're only 3 months old! We're screwed!!!"
When I started consulting, I initially made the date part of the invoice number, e.g., 04100501 was the first invoice sent on October 5th of 2004. This sidestepped the problem of showing customers how many invoices I had already sent. But it did show them how many I sent that day, which was usually not many.
I eventually switched to a sequential monotonic invoice numbering scheme for other reasons, mostly because it made it easier on the webapp I use for invoicing.
yymmdd<seq-#> is monotonically increasing. Though you could also just combine the first 6 or so digits of unix time and tack on a sequence number to the end of that.
I had always understood it to be the other way around. Companies won't accept checks below a certain number, so you just start your check run at an "acceptably high" number.
Perhaps your comment was meant to be light-hearted and I am just being pedantic, but all an interested party has to do is come back a few more times over the course of a few days and note the values of new IDs. At that point they know the rate at which you are (adding new customers/adding new content/whatever) and can project backwards from there (completely ignoring the absolute value of the initial ID.)
Would a repete customer really care what your rate is? If this is about impressions, the fact that they are returning indicates that you made a good enough impression that should far overshadow your invoicing rate.
What a coincidence! Just today I used the sequential ID numbers of a companies deployment to work out their user numbers, it's interesting to compare things like that with reported figures. I wonder what the best approach to it is (when protecting those numbers) maybe something like Twitter where they would take a chunk of numbers and then stagger deployment through those numbers (If I remember correctly, I'm not all that familiar with their system)
One thought I've had is encrypting sequential IDs with DES. The block size is 64 bits which conveniently maps to int types and the token you get is half the size of a UUID.
DES' key is short enough to brute-force (see below for time/cost). It's not tremendously difficult to obtain some output samples: make the first account, you now know DES(key, 0), where key is the actual key used by the site. Then run through all 2^56 keys k until DES(k, 0) = DES(key, 0); for some extra assurance, you could also check DES(k, 1) = DES(key, 1) by making another account.
Once you have key, the proposed scheme is no better than sequential account numbers.
This gets slightly more difficult if you don't know your account id; in that case, simply create a couple of accounts immediately after each other (script it), and check whether DES(k, i) = DES(key, counter), DES(k, i + 1) = DES(key, counter + 1) etcetera, where again key is the real key, counter is the real counter at the time of creation of the first account. You now have to bruteforce counter (i) as well as key (k), but that's still doable.
Brute-forcing DES is not easy on a desktop, but http://www.sciengines.com/copacobana/faq.html offers a $10,000 off-the-shelf solution that can break DES in 8.7 days (source: their presentation at CHES 2006). Note that this is uses components from 2006 and that it's easy to trade cost for speed (i.e. just buy half as many boards.)
The above is the most obvious attack; I'm in no way saying there are no others. It's not impossible to make schemes like the above work (e.g. using the pair (AES(key1, id), HMAC(key2, AES(key1, id))) and a good library); but even the scheme proposed is more complex than just picking random account numbers.
Surely for a startup this sort of information is not worth $10000? It just seems like you've got so many more important things to worry about than to worry if someone,s going to spend thousands of dollars figuring out how many users you have (which is likely an effect, not a cause, of any competitive advantage).
If Moore's law holds and you're willing to wait a month instead of ~9 days, it costs <$1500 in hardware, and you can keep/resell that. Or, again, rent time. Or just use a distributed.net-esque approach and bruteforce using (a lot of) standard PCs. Or...
Seriously, "people aren't going to hack us" is the new "premature optimization is the root of all evil": a convenient excuse to do it horribly wrong.
I wouldn't call that a "good" solution per se, it's highly vulnerable to pre-computed table cracking (or merely discovering and making use of the algorithm).
You're pretty much stuck with doing that for invoicing, at least in Europe, where invoice numbers must be sequential. OK, you'll have to spend some money to receive an invoice. But then, at least in the UK, limited companies' accounts are effectively public anyway.
I had an idea for an online business a few years back that I never really pursued, but I did work out a scheme to assign key objects random ids so that when those ids appeared in a URL, you couldn't tell how many of those had been created. I keep meaning to do a write up, but, until now, I've never heard anyone else notice the potential to leak information, so I just assumed that no one else cared.
The django-extensions project provides several simple ways to avoid this - just make URL's reference UUIDs or slugs. I imagine RoR has something similar.
For RoR, you can override to_param on the model in question, something like:
class User < ActiveRecord::Base
def to_param
self.login
end
...
end
Assuming self.login is reliably unique. Although I've found this can still have some follow on effects that could need to be handled in a situation similar to the above. There are also some heavier weight, more feature filled options of course.
Some of the heavyweight ones are nice as far as human readability goes. Consider the AutoSlugField (in Django). Say you have an object, Item(name='Phillip Lim Pleated printed silk-chiffon dress', description='...).
Funny! I've come to associate slug-type URLs (like the former, that have no numbers or anything in the URL) with content-free websites that have been SEO'd. The latter is also scary though, because it's too MANY random-ass numbers. I have sort've come to expect a combination of a numeric identifier and readable words for reputable sites.
The second link although ugly is less "guessable" if you want to skimp on the authorization check. That aside, sometimes an entity might not have a natural key that can be slugified (e.g. an invoice), in which case UUID is better. Also, using a generating/looking-up a UUID might be faster?
on the other hand, in many cases this information is revealed, to no apparent ill effect.
Look at Linode's display of how many servers they have is available. They are giving away how much business they do in a day, but is that hurting them? if anything, I think it helps them.
several other popular companies here on HN post their revenue numbers publicly.
So yeah, while you should be aware of such things, and make a conscious choice, well, for many of us, this isn't data that really needs to be kept secret.
Worse yet, this sort of behavior can make security flaws much worse. Imagine a comparatively minor security flaw that allows an attacker to view otherwise secret information for customers, orders, members, etc. If you send plain-text sequential ids around then it becomes trivial to exploit that security hole to gain access to a lot of juicy data. However, if you hash or obfuscate every identifier then the problem for the attacker becomes much, much harder, since now the search space is vastly larger.
Defense in depth, belt and suspenders, and all that.
Obviously the company and nature of a site will dictate whether leaking that information is considered a "flaw" and how bad a one.
Depending on how user sessions are tracked, being able to predict other valid "user ids" based on your own is an important first step to attacking other accounts.
It isn't unusual to find other flaws that will enable you to pull more (potentially sensitive) information about users or even "impersonate" users when armed with knowledge of someone else's valid user id.
Non-public companies certainly don't have too many obligations to publish information on the amounts of customers, numbers of transactions, etc they are doing. Even public ones won't break a lot of that out.
One of tptacek's strangers (competitors?) being able to tell how many paying online subscribers a newspaper has signed up would probably make someone in management squirm.
Likewise with being able to tell how many transactions an Internet Banking application is pumping.
I found out about the story via Twitter. Someone I follow (cited in the original post here) mentioned it as an aside in a blog post of his.
I write a free daily email newsletter of interesting things I come across -- that's at http://dlewis.net/nik. The original post here is the issue from, I believe, Tuesday, but it's been a long week :)
I submitted the link to reddit (http://www.reddit.com/r/todayilearned/comments/dstpi/til_tha... -- please pardon my misuse of "reverse engineered") and it went over very well, hitting either the front page or the second page. So a lot of people noticed it. I'm not surprised there are other submissions. I've seen pickup across a number of sites.
The title and timing?
Well, the title was straightforward -- I just took the one that worked so well on reddit and whittled it down to 80 characters.
The timing was accidental. I was queuing up today's issue late last night my time, and I was crediting the guy behind http://hackernewsletter.com for tipping me off to the story. Then I had an "a ha" moment and figured, hey, this would be welcome on HN, too. So I submitted it.
Just a nitpick, but the article didn't explain how this piece of intel gave them a big strategic advantage, it just explained that they got this piece of intel.
Had to look this up on Wikipedia for a more thorough explanation.
My impression is that it is a strategic advantage of course, but not a "big advantage". Knowing that the enemy is weaker then expected is good, but it's not directly useful like decrypting Enigma communications or acquiring research material. If it really affected the war, I would have liked to see why.
Here are a few possibilities of the advantages afforded by knowing an enemy makes 255 tanks/month:
- You may figure it's worth your time to go out of your way to bomb a few extra tank factories, as the payoff will be 5.5x greater than it would if they were making 1400 tanks/month.
- You may save money on R&D. Instead of devoting tons of resources to anti-tank weaponry, your more realistic estimate will lead you to spend money countering more important threads.
- You'll have more realistic estimates of the number and type of German forces at each battle, and thus will be less likely to over (or under) commit your own forces, and also less likely to bring ill-suited weaponry and tech.
- You'll be less likely to avoid or back down from winnable battles you would have previously considered to be un-winnable.
Etc. Fighting a war (or doing anything for that matter) without proper intelligence turns it into a guessing game. It's always to your advantage to know rather than guess :)
One thing to remember is that German tanks generally outclassed the western Allies tanks by the time of D-day. The M4 would lose going one on one with almost any of the main line German tanks, unless it had very favorable terrain.
The number of Panthers that Germany could field in the west was very much important to the success of the D-day invasions. Had the allies had solid intelligence that Germany had several times greater number Panthers and Tigers, it's entirely possible that they would have done D-day completely differently.
I intentionally left it out -- it would have made the email too long. The fact that it's valuable, I think, is obvious (450 tanks versus 1400). How valuable? Not as important.
For the same reason, I also left out that the Allies used the same mathematical tools to estimate supply lines. If you're seeing a "rate of production" in Berlin of 100 units/mo, and for the same unit type, only 25 units/mo in Paris, that's probably evidence that it's taking a really long time for that unit to get to Paris.
"the Germans produced 255 tanks per month -- a fraction of the 1,400 estimate produced by conventional intelligence. (Want to see the math? Click here.) And it turns out, this method worked best: after the War, internal German data put the number at 256 tanks per month. "
Obviously, we were counting from 0. Stupid off by one errors. :)
Your tables almost always have a better unique key to use as a primary key other than auto increment (which sorts your table by the order they were inserted in - almost always useless outside of a blog).
Getting out of the habit of beginning the design of your tables with ID int autoincrement is a good thing. Even better if web frameworks stopped depending on it and setting it as the default primary key and ID used in views/routes.
Autoincrement is good sometimes: we had some problems with slow inserts in a production SQL server.
It turned out that the problem was that the primary key was set up as a clustered index, and we were doing inserts in a non sequential way, so the server had to rearrange the B-tree every time a row was inserted.
So yes, sorting your index sequentially can have important performance benefits (at least for Microsoft SQL Server and clustered indexes, and a table with lots of inserts).
Edit: read the above link, it only says not to use an URL with identifiers (I can agree with that), and that MySQL with InnoDB has some problems with clustered indexes. The title is very misleading.
The article linked here makes a bad case against auto incrementing. In fact, I'm the opposite. Every field should have an ID that has nothing to do with the data being stored. That, the record should not contain the (in the case of MySQL) primary key.
Take the traditional users table.
username
password
email
You might suggest that username can be the key. And sure, it will be a unique key. But, internally, I don't want to operate on the username. I'd rather operate on an internal value that is not related to the rest of the record. So I add an ID.
This ID doesn't have to be exposed to the user. However, should it ever pass that I need to let users change their username, I can easily by editing a single entries field.
Basically, your application logic shouldn't impact your business logic. Referencing record 12345 should also reference the same record, regardless of what else is changed.
Maybe. If he was, it wasn't clear, especially with "Your tables almost always have a better unique key." I took that as meaning using some element of the record.