Hacker News new | past | comments | ask | show | jobs | submit login
A Bad Privacy Bug (blog.pinboard.in)
110 points by stilist on Aug 1, 2012 | hide | past | favorite | 46 comments



As much as anything else, this is another great example of MySQL letting you shoot yourself in the face by default. I also love that you can save "y" to an integer field without complaint. It seems obvious that accurate data validation and comparison would be the cornerstone of a successful relational database, yet MySQL defaults fail to provide any domain-relevant value whatsoever.


bin is not an integer field. It's a binary field - it can save anything.

And no, you can not save 'y' to an integer field, so your rant is misplaced.


I think tdavis is referring to this example from maciej's post, where inserting 'y' into a tinyint field silently converts it to 0 instead of throwing an error:

     mysql> create table demo(int tinyint(1), bin binary(1));
     mysql> insert into demo(int, bin) values ('y', 'y');
     mysql> select * from demo;

     +------+------+
     | int  | bin  | 
     +------+------+
     |    0 | y    |
     +------+------+


It's not silent:

    mysql> insert into demo values ('y', 'y');
    Query OK, 1 row affected, 1 warning (0.01 sec)

    mysql> show warnings;
    +---------+------+------------------------------------------------------+
    | Level   | Code | Message                                              |
    +---------+------+------------------------------------------------------+
    | Warning | 1366 | Incorrect integer value: 'y' for column 'n' at row 1 |
    +---------+------+------------------------------------------------------+
If you want this type of warning to be treated as an error then adjust the setting:

    mysql> SET sql_mode = 'TRADITIONAL';
    mysql> insert into demo values ('y', 'y');
    ERROR 1366 (HY000): Incorrect integer value: 'y' for column 'n' at row 1


Which brings us right back to what tdavis said in the first place: "As much as anything else, this is another great example of MySQL letting you shoot yourself in the face by default."


You are aware that this whole thing is theoretical since he set the field to bin (i.e. binary)? No database at all would have prevented this.

MySQL chose to coerce fields to fit the type. You don't like that choice, but other people find it useful. And MySQL gives you a choice to change it if you wish.

Judging by how popular and successful they are I think those chose correctly.


This article is a success story for MySQL because maciej is using MySQL.

This article is a failure story for maciej. MySQL may have chosen correctly for its own popularity (although that's debatable; this choice might not be a success factor) but it did not choose correctly for the success of its users, with this article being a clear example.


It's not theoretical. MySQL by default allowed type coercion out-of-box that is nonsensical. Sure, type coercion can be nice, but I'd prefer none over if it not failing and coercing bin('y') to 0.

Attributing their success to their choice of DB is insulting at best.


MySQL chose correctly, not maciej.

The coercing bin('y') to 0 makes no sense though - I agree with you there.


Those two statements conflict with each other and I don't see how you reconcile those without placing blame on MySQL.


Every product has bugs.

The philosophy of attempting to work when possible (coercion and other things) is good.

They clearly messed up here, but it doesn't mean they should never try to coerce values.

Even the warning when you try it makes no sense (what double?):

   1292: Truncated incorrect DOUBLE value: 'y  '
It's clearly a bug, and not something fundamental to MySQL.


Sure, I think we agree. I agree it's a mistake here and should be corrected, I wasn't advocating removing type coercion by any means, nor do I think that this one-off (?) incident is worthy of making broad statements about MySQL.


MySQL's warnings are silent, in the sense that the web application will never act on them. Guessing instead of failing early is a poor choice for a relational database, and is only explained by MySQL favouring ease of use and porting over data integrity.


  Instead of automated test suites, I use checklists before deploying major changes,
  performing a series of actions (like creating and editing bookmarks)
  to make sure everything works as expected.
Somehow I always shiver when I read stuff like this. Some manual smoke testing is never bad, but I really can't understand how you can feel comportable about not having automated test suites on (larger) projects.

Stuff like this is so easy to catch in a unit or integration test.


Genuinely curious - if you find a project in that sort of condition, how do you go about fixing it? Are there respected guides to introducing proper unit and integration tests to established projects (that may not be structured amenably to them)?


The first step is to create an empty test suite with at least one test that runs. This will mean that when you go to add a test later you have no barriers — you already have a suite. 

You then need to follow two rules. First, write tests for all new features. Second, add a failing test that exposes the bug before you fix any bug. 

By doing these two things you will end up testing two very important areas of your code — new code (which will always be buggiest) and code that has recently proven to be buggy. 


This is so incredibly important. So many people do not spend time upfront getting projects structured and setup for builds, testing, deployment and the appropriate automation for each. I should not be copying and pasting out of date scripts from random folders on a network share to build. Or to package something. Or grabbing an EXE from one place and a DLL from another to run my unit tests.

People don't like writing tests because it's not as fun as writing code, but because it also requires a different setup, configuration and flow to get them executing. Figuring that out and making it easy to execute those tests is important in getting people to want to write them.


  > Figuring that out and making it easy to execute
  > those tests is important in getting people to
  > want to write them.
This is so incredibly important. Anyone on your team should be able to run the test suite from the command line by invoking a single command. The easier and the quicker a test suite runs, the more people will run it.

Also, spending a day to setup a CI server (Jenkins, circleci.com or similar) is a good use of your time. You want to have the CI run the test suite after every commit to your mainline branches in SCM.

Another key aspect to testing is generating clear and concise error output. One thing I find very frustrating is when a CI build fails, but it is not clear why. I find most CI systems are very good at listing all the passing tests, or when a few tests fail in the expected way. When tests fail hard (exceptions thrown, processing not start or not running) I find that I often need to go reading through build logs to find out what went wrong and why.


Essentially, my understand of best practice is to write high level functional tests for the features that appear to work and then use them to ensure there are no regressions as a result of your changes. Someone people even define legacy code as "code without tests".

http://www.amazon.com/Working-Effectively-Legacy-Michael-Fea...


There were a couple Destroy All Software screencasts on that topic [1].

In a nutshell, you start by writing a lot of high level integration tests, which will serve you as some guarantee you don't screw up. Then you start refactoring, writing fine-grained unit tests along the way. Once you settle with code structure, you might want to start removing/rewriting test cases you wrote initially in case they seem redundant.

[1] https://www.destroyallsoftware.com/screencasts/catalog/untes... and further parts 2-4. This screencast is done mostly in ruby.


I find introducing high level integration tests to be much harder than introducing lower level unit tests as code starts getting modified. In theory, yes, that's the way to go, but in practice, you probably can't get there without significant refactoring.


> Stuff like this is so easy to catch in a unit or integration test.

Unit and integration tests are great, but you can't rely on them for everything. There is always a temptation to trust the tests instead of the real world, and that's where you get in trouble. The caption for this photo might as well have read 'all tests are green' http://25.media.tumblr.com/tumblr_m7v0yrMrJ21rzupqxo1_500.jp...


I agree, that's why I said that smoke testing is never bad and neither is a good QA guy.

But as a developer unit and integration tests are what gives me trust in what I ship. I wouldn't feel comfortable shipping something without having a proper test suite, let alone making a change to such codebase.


  But I do this from my own Pinboard account, which has 
  superpowers. Specifically, it lets me see private bookmarks
  on all accounts. And since I already see everything, I don't
  notice when everybody else starts to see everything, too.
This is not okay.


I'm a drooling Pinboard fanboy, so I'm probably cutting him more slack here than I would for other services. However: it's been obvious to me since the beginning that "private" in Pinboard-context just means, "this bookmark is not publicly available on my account page." I've had absolutely no expectation that it wasn't accessible by a site admin -- although that would be nice too.

I can't think of a way to store retrievable data, like a bookmark, on his servers without leaving some way for him to access the data if he wanted to.


Yes, retrievable data is obviously accessible by a site admin somehow. But the way he's got things set up:

1) It's trivial for him to inadvertently see something deeply personal to someone just by browsing the 'recent' list or doing a search.

UPDATE: I overstated this one - Maciej let me know by email that he can only access private data on the search / recent page if he intentionally masquerades a user. He can only inadvertently see private data when viewing individual user pages.

2) If his account's ever compromised (let's hope he's not reusing that password elsewhere!) then someone else gets that ability as well, accessible from any browser anywhere.

It's one thing when you have to ssh into a server somewhere and do a SQL query to access someone's private information. It's another thing to set up your admin account so you're casually exposed to it.

I like Pinboard's service too, but this isn't remotely cool.


That's a pretty convincing argument you have there. I'll go along with that.


I'm a paying user and this won't make me stop using the service, but there should be a reasonable expectation of privacy here. Obviously the database has to store this information in a retrievable way, but to expose it so carelessly on a regular basis is completely unnecessary.


He recently said on Twitter that correcting that is on his to-do list.


As an admin, having the ability to masquerade as a particular user is invaluable for debugging & support.


this is... a hard choice. I'm not saying I know which way is right, but I think that either way, revealing to your users which choice you make is the right thing.

The choice, really, is "do I look at my user's personal stuff and have a good idea that my service is working for them" vs. "do I not look at my user's stuff, and have much less of an idea if the service is working for them."

(yes, yes, good automated tests are the right way to solve the problem, but good automated test are difficult, especially when a broad range of behaviors could be 'correct')

In my own case, there are two places where I'm erroring on the "don't look at the user's data, even if it would allow you to provide better service" side. First? I don't mount a user's disk. it's a block device, as far as I am concerned.

As you might imagine, this makes backups way, way more difficult. Moves, too. It's possible that if I were to ignore this rule, I'd have enough spare disk bandwidth to do at least half-assed backups (I'm not doing any backups of customer data right now, which is pretty scary for me, because I'm in a business where if you lose the customer's data, you lose the customer in the worst way.)

Now, the right way to solve this problem is some kind of 'snapshot over the network' like zfs has or some other mechanisim that only transfers the change in block devices. Unfortunately, Linux has no such tools. (yes, yes, ZFS on linux is a possibility. So is a NAS. Rsync or something rsync like won't work because the problem is not network bandwidth but disk bandwidth. I mean, it's a problem that can be solved, a better way, but solving it the better way is more work.)

Next, from a technical support perspective? I could provide dramatically better service if I logged all the serial consoles. Dramatically better service. but last time I asked, people seemed uncomfortable with it, so I don't. (the thing is, it's only easy to log all consoles or log none of the consoles; I'd have to spend time and effort building a 'log consoles except for users that opt out' mechanisim, which is the right thing to do here, but due to time constraints that hasn't been done.) (to be clear, when a customer asks for help, or even if soemthing happens with a domain and I'm not sure it's working, I look at the serial console as is. My job would be nigh impossible without serial consoles. Also, by default I log the serial consoles on dedicated servers, though unlike Xen, it is easy for me to opt people out of that. Conserver is the tool I use there.)

Personally, I think I made the right choice when it comes to block devices (I have had good luck with my RAIDs, and treating disks as block devices means my customers can use whatever freaky filesystem they like.) but I think I made the wrong choice on the serial console; there really isn't that much personal stuff that comes up on the serial console, and it gives me a lot of clues.

I guess what I'm saying is that from a testing perspective? being able to see the user's bookmarks has a lot of advantages. It's a valid choice; and he's sharing that choice with the customers, which is the right thing to do.

(the interesting thing here, of course, was that if his account couldn't see the users bookmarks, while usually that would have given him much less diagnostic information, it would have given him more diagnostic information this time.)


Of course it's a tradeoff, but there's only one choice guaranteed not to piss off your users. It's an easy decision. Don't look at private data.


Being down also pisses off your customers, so there is no 'guaranteed' choice here. (at least so long as it is a choice between good privacy and good testing.)

I think the key really is communication; letting the user know what you are doing and reading the feedback. It wouldn't occur to me that some people would think serial console logs are super private things, but turns out? many people think they are.

I mean, from that perspective, it seems like pinboard ought to take this feedback into consideration and maybe change that policy. I'm just saying that there were reasonable reasons why that original choice was made, even if it sounds like the decision may have been the wrong one.


I respect maciej more for the candor about this incident. My wife had private bookmarks exposed, but I recommended she continue to use the service. These particular mistakes are not ones I would have made myself, but the important thing is that he's examining them intelligently and taking steps to prevent their recurrence.


The schema should really be:

  private bit not null default 0
If you use something like tinyint and have a range of values that are acceptable, bake this into the database with a check constraint:

  myfield tinyint not null default 0 check (myfield between 0 and 3)
Regardless of how comprehensive your unit tests are, you should prevent as much bad data as you can at the DB layer.


I don't think I saw it mentioned anywhere, but another thought that occurs to me is it might have been initially useful when he created the table to have the "private" field actually be called "public" instead.

The idea being that if the field is likely to be initialized to a default, you want the safer default. (IE, in that case things would be created as accidentally private, rather than as accidentally public)

Kudos to the author for the transparency. I actually have never used pinboard, but that blog post actually makes me more likely to use it rather then less.


Man, the more I see and experience things like this in MySQL, the more impressed (horrified?) I am at how it ever became popular. So glad I've made the switch to Postgres.


This isn't really a MySQL issue though. He used bin instead of int, and MySQL did exactly what he asked - stored the raw binary data as is.


Why are rows with 'y' in them being returned when he asked for that value equal to zero?


No idea. I would expect it to convert the 0 to a binary format (since that's the format of the column) but it's not.

It does raise warning

   1292: Truncated incorrect DOUBLE value: 'y  '
   1292: Truncated incorrect DOUBLE value: 'y  '
(Yes, twice.)

Which makes no sense to me.


> So even though the row has (from our perspective) a non-zero value, MySQL sees it as matching zero.

What? Why does MySQL return the row with 'y' in it (in a 'binary' column) when you ask for "where row = 0'? wtf is going on here?

That seems to be the real culprit here. I assume it's MySQL's 'helpful' conversion routines again, but I'm just assuming, I don't understand the details.


If only other companies would handle issues with this level of transparency...


So you browse "private" bookmarks of your users on a daily basis...


I was affected by this. I received the email notification from Maciej and was offered compensation for the mistake. I declined the compensation and instead upgraded my account to include archival service. I did this partially because I've been considering the upgrade for a while, but mainly because I was so impressed by the way the incident was handled.

Keep in mind that I don't have state secrets bookmarked in Pinboard. I'm just an anti-social type of user as their tag line notes and I appreciate how Maciej runs his business. 


I recently watched this video which is mostly about Postgres but also pokes some fun at MySQL's weirdness. I didn't realise how scary MySQL can be

http://vimeo.com/m/43536445


Implicit type conversion can be evil : http://bhati.livejournal.com/3352.html




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: