Nice. Here's a similar personal story with a PSA that sometimes blurring is NOT sufficient.
A friend of mine posted on Instagram a picture of a U.S. visa (or something similar; it was probably five years ago) to announce her trip to the U.S., and she took care to blur out sensitive information such as her passport number. But a Gaussian blur is easy to reverse and I successfully unblurred it and told her my discovery. I didn't use any specialized software; it was just Mathematica with its built-in ImageDeconvolve function with guessed parameters for the Gaussian kernel.
I personally recommend blacking out (add a black rectangle) instead of blurring, and if it is a PDF, convert to an image afterwards because too many PDF editors use non-destructive operations to add a new object instead of changing what's underneath.
Your advice is good, and I agree that you didn't use specialized software to reverse the blur, but this
> I didn't use any specialized software; it was just Mathematica with its built-in ImageDeconvolve function with guessed parameters for the Gaussian kernel.
is one of the most HN comments I've come across recently :)
> is one of the most HN comments I've come across recently :)
That gave me a laugh. I don't have any experience with Mathematica, but everytime I see it mentioned (usually on HN) I'm amazed at the sheer breadth the system is capable of. The amount of use cases and possibilities blows my mind.
The other answers are also very clever and interesting. There are quite a few ways to determine whether the goat is up or down, and some are very simple.
If it is in the installable version now, it will be in Wolfram Alpha in 5 years if you can guess the right command, and in 10 year Wolfram Alpha will just automatically select the blurred part and make a fake unblurred versions of the jpg.
> I personally recommend blacking out (add a black rectangle) instead of blurring, and if it is a PDF, convert to an image afterwards because too many PDF editors use non-destructive operations to add a new object instead of changing what's underneath.
We had a similar issue in Australia as well.
Politicians phone bills are published on the government website in summary form.
Someone in 2017 decided to blank out their phone numbers by changing the phone number text colour to white (same as background).
End result - hundreds of politicians and former prime ministers had their phone numbers leaked.
I used to work in IT for a state based police force in Australia. Traffic reports can be requested by those involved in traffic accidents, which includes parties to the accident and their details.
People used to be able to get the personal information of police officers if they were involved, intentionally or not, in a traffic accident with a police car. They would request for the traffic accident report, and that included the personal information (including home address) of the police officers in the car. I was in QA and I tested the change when it was fixed. It now includes the address of Police HQ when a police officer is involved in a traffic incident.
With Gaussian kernels, besides deconvolution you can sometimes also dictionary attack them if you have the original font and if the kernel is properly normalized kernel (i.e. most gaussian blurs).
Although I haven't tried, I think there may even be neural network based techniques that can perform even more effectively than a dictionary attack.
Separately, if the image editing tools added sufficient random noise to their mosaic filters they might be able to thwart most of these attacks, or at least make them significantly harder.
Interesting, thank you for the link. I had a hunch this should be possible but I wasn't aware that it was already proven. I used a similar trick on image recognition: turn images into a single 32 bit word by heavy pixelation and then look up a matching description. It's interesting how often that will work once you feed it with enough data. After all, that gives you 4 billion inputs mapped onto 4 billion descriptions, and plenty of those will contain the Eiffel tower with various cloudy backgrounds apparently recognized perfectly.
It's a total cheat but it is funny how close that can get you to something that might be actually useful.
I wonder if you could use adaptive optimal kernels, AOK[0]? I had used this for work on multiphase flow recognition from an electrical capacitance tomography, ECT, as a proxy for void fraction. We wanted to tinker with time-frequency representations.
Yes, that is cool. I had just come back from an internship in Wireline at Schlumberger where I was exposed to tools like one that did nuclear magnetic resonance, NMR, thousands of metres below. Pretty sweet tech. Transitioned to ECT for that project, then ECG for anomaly detection on anonymized hospital patient data. I never will underestimate the effect hair and sweat have on data. That was a cool year with lessons that served well later.
I once had to provide my employer copies of court documents proving something or other in order to qualify for the benefits plan I was attempting to enroll. The part of the document that contained the info they required also contained other information I did not want them to have, and I was more than irked at having to do this in the first place. I used Photoshop to draw a 99% black box as the redaction, but then using a 100% black font color typed in a nasty little message. Nobody was ever going to see it, but just knowing that if they did it would be a shock. I qualified for the package.
> and if it is a PDF, convert to an image afterwards because too many PDF editors use non-destructive operations to add a new object instead of changing what's underneath.
You'd be surprised at how many times this happens on Government documents with redaction.
If it is hard to get wrong, is it still dumb? Being able to verify with your own eyes that the redacted parts are indeed redacted is a pretty strong benefit to that process. You'll need to train staff to properly black out stuff (no idea what they do, heavy cardboard cut-outs or cutting out the censored content and using a black background for the scan?), but once that process is in place, it works.
With software you either need vetted and approved, very expensive software, or you have to accept a much higher error rate, because the operator cannot verify the results of the process with certainty.
I think the correct solution is a machine that prints out both a human- and machine-readable representation of the vote. The voter can confirm that the human-readable representation is correct, and you can randomly hand-count a few boxes of ballots to check that the hand-count matches the machine-count.
An election doesn't need to be tamper-proof we just need to be able to detect tampering well enough to make tampering a loser's game.
You could do such a hybrid system, but honestly purely paper based systems seem to work well enough in practice. Eg Germany uses paper and human counting, and the results are usually available fairly quickly.
The problem with randomly hand-counting a few boxes of ballots is that you then need to convince people that the random selection was uniform and fair and actually random.
There are methods to do that, but there are at least as complicated and full of cryptographic finesse, that they ain't simpler than vetting an electronic voting system in the first place.
Having said that: human counting isn't fool proof and is still open to abuse and tampering.
It's mainly that any village idiot can in-theory audit the human-run system, and that it would take a conspiracy with lots of people to engage in wide spread tampering.
The more people involved, the harder it is to prevent leaks.
Right, otherwise the problem would be trivial. If it wasn't clear, the plan was the printed ballot would anonymously go in a box to be machine counted.
Yup, but they can do so with old-fashioned paper ballots too. Any security measures for paper ballots will also work with my idea, and the machine could also do fancier things like printing out a timestamp and signature of the timestamp . I really want things to be simple though: if the system of voting is too complex, then it will be distrusted, and distrust in the voting system is toxic to democracy.
What they can't trivially do with any system including paper ballots is remove ballots, compared to digital voting machines where you can add e.g. -100 votes to candidtate A, 100 votes to candidate B, thus ensuring that the total-votes field is correct while advantaging candidate B -- this was actually demonstrated by a security researcher on a Diebold touch-screen machine.
FOIA reports usually have a small textbox over the redacted information with a reference to the reason for redaction, likely made in Adobe PDF. Then the docs are either printed and scanned or just converted to an image only PDF.
Then they use the big multifunction networked printer’s built in scanner, which saves a copy to the “little” hard drive they all tend to have in them now, and forget to ensure these things get wiped/destroyed... years later they sell the printer once the lease ends and the surprise inside is months to years of raw scanned documents the new owner gets access to with very little effort.
Why don't they convert the PDF to image and convert back? This approach seems to be a lot more efficient, and less prone to other type of human errors (e.g. missing page). Is there still an attack vector?
The Japanese train system utilizes similar concepts IIRC. When I first read about this I was astonished about how effective it was [0](up to 85% error reductions)!
If you do that, look at the document, hit CTRL+Z, then look at the document again, it will likely look identical, thanks to the fact that rendering a PDF to a JPEG with 70-90% quality... at ~600DPI... then scaling it back out to a 75-150DPI screen... is going to look visually lossless.
So, not only do you have the energy-investment thing noted in the/a sibling comment, you have the issue that there's no giant "THIS IS AN IMAGE" or "THIS HAS TEXT IN IT" that you can just Look At and know that yeah the document is okay. There's no lowest-common-denominator provability thing. You have to hyperspecifically know what to look for (render to image) then know how to verify whether it's an image or not.
And... how do you verify if it's an image? I don't have any PDF authoring/editing software on this machine, so the only thing I can think of is checking the Undo menu for "convert to image" or similar.
There will be no CTRL + Z, as it can only be used to save to a new document (just like scanning).
Under the hood, you created a new document, rasterize the original document page by page as JPEG, and insert the JPEGs back to the new document.
You can even create a fake "printer", that outputs a PDF with rasterized images as pages, so you don't have to teach the office clerks to anything extra.
To me, it seems to be indistinguishable from printing and scanning.
PS: It's pretty easy to verify if the page contains nothing but an image, programmically, especially if you also wrote the software that rasterize it in the first place.
> It's pretty easy to verify if the page contains nothing but an image, programmically, especially if you also wrote the software that rasterize it in the first place.
It's pretty easy for a computer to verify any of this, the point is making it idiot proof. You don't have to be much of an idiot, if you process hundreds of documents a year where there's no way to visually verify the difference between a badly redacted document and a well redacted document, to screw up once. Especially when the difference between them is that you remembered to push the "redact correctly button", and if you forgot that, remembered to push the "verify if is redacted correctly programmatically" button before hitting send.
What you do is create a ritual where you have to walk across the room and use a physical machine. You'll remember doing that. And if you don't, since the output will look a bit crap, you can confirm it trivially.
Creating a process that has to be done perfectly every time or it fails catastrophically, and has few indications of failure during the process, is worse than having no process at all.
Even when the black box is done right, sometimes there are quasi side-channel leaks of the size. The box covering a name for instances may be discoverable if there are only a few names possible, and it's a small box, meaning it's the shortest name.
A friend of mine once had to review some (Swedish) court document with redacted witness names. It was a word document with history intact. Just undoing a few steps was all it took.
One of my lecturers did that back at university - they generated an Excel spreadsheet containing everyone's marks, then for each student, deleted all but that student and saved as a different file.
Document history was turned on and anyone who hit ctrl+z got the full class marks.
(The same lecturer initially failed me because they forgot to add my final exam score to my assignments score, and then took four months to fix it. They weren't very competent.)
"Information to be withheld should be black highlighted using a tool such as the word highlighter tool like this ⬛⬛⬛⬛⬛ and then printed off. This print out should then be scanned in and saved as a PDF."
...shred cut out parts, burn remains, mix with water, encase in cement, explode, divide rubble into four parts, disperse one part each in Lake Superior, Pacific Ocean, Atlantic Ocean, and the Great Salt Lake; assume an alias, move to Alaska...
This is how military redactions have been done forever. If a soldier writes home to his family and includes classified details (“I watched the sun rise over Mt Vesuvius yesterday but today we are moving west”) the censors just cut out the text with a knife.
> I personally recommend blacking out (add a black rectangle) instead of blurring
I've seen people use image editors on mobile and they'll "scribble" out sensitive information, but one of the problems is that if you pick the wrong pen it'll blend your strokes so it's not 100% opacity (but on a casual glance it's close enough). You can zoom in and change the contrast of a photo that has been redacted this way and recover information.
> I personally recommend blacking out (add a black rectangle) instead of blurring
Real life document workflows can be really tricky. What if one is required to print or photocopy the obscured document? Devastating for printer's toner or cartridge lifetime... In some cases opaque grayish rectangle does the job.
I found many years ago that my pay statements suffered from the last item you mentioned. My personal info had a black box over things like the SSN...but if I just moved the window around the black box followed slower than the document so everything was visible. ADP never acknowledged the problem when I brought it to their attention, but they did eventually fix it.
Did the blog author actually un-blur the booking reference though? He states he tried to un-blur the barcode, was unsuccessful and then realized the booking reference was right there in the picture. Nothing about un-blurring it.
The original image was not blurred, he simply read off the plaintext booking reference. (After first trying and failing to scan the also unblurred bar code.)
It's lossy, but not destructive, and a 'sharpen' operation is technically the same as blur but in reverse. So you won't end up pixel-perfect after doing an 'unblur' but you will be able to make out more than you could before.
Also if you have multiple pixelated/blurry images that helps you can reconstruct it more easily, e.g. if different newspapers print pixelated picture of the "suspect" you can reconstruct it pretty accurately.
The thing to remember here is that the only way to hide (real world) data in an image is to reduce the amount of data in the picture... a blur or swirl leaves most if not all data just in the picture (although distorted) Any filter that removes data (such as pixelate or blacking out / whiting out) can be used to safely hide this data... Just remember to also strip out any unwanted meta data (Exif-data) and do not use layers but a 'flattened' version of the picture.
Pixelation is also attackable. Generate input (e.g. GAN) and apply pixelation until it converges. Probably won't be super accurate but enough to probably ID someone.
Black/delete (and flatten/rebroadcast) is the only way.
I'd worry about hallucinations when applying a GAN to a pixellated image. You'll get out a face, but who's to say that it's the correct face? Lots of people look similar.
"I personally recommend blacking out (add a black rectangle) instead of blurring, and if it is a PDF, convert to an image afterwards because too many PDF editors use non-destructive operations to add a new object instead of changing what's underneath."
I have this at work, with engineering drawings. With mobile equipment often were not dealing with engineering companies per se, and they won't or don't know how to get us CAD models of their equipment. And we often don't have the equipment on have at the time we need to make drawings.
But if you have a PDF with vector drawings, often a manual, and one or two good dimensions you can make a reasonably accurate model. AutoCAD even makes this easy with the PDFIMPORT function.
More often than I would expect, there's a whole other drawing view either covered by a white box or off-page. Once it looked like it had been drawn over with a white paintbrush tool, and if course the path of that too was also visible.
Why not use a randomized blur so people who like to do such things can waste time trying to figure it out when it's actually nothing but random numbers and has none of the original info?
Sometimes a black bar or even cropping isn't sufficient. You still have to trust the editing software.
There was a scandal around 2003 when a TV host took a topless photo, cropped it and shared the cropped photo online. Unfortunately, the software (Photoshop—I think CS3) she used to crop the photo stored the original photo as metadata if you didn't change the original filename. The original (uncropped) photo could be seen in the "Open File" preview dialog when opening the cropped version.
Not cutting it so that it becomes transparent since this may still preserve the color component of the RGBA-pixels, even if it is invisible and blended with a black background.
A friend of mine posted on Instagram a picture of a U.S. visa (or something similar; it was probably five years ago) to announce her trip to the U.S., and she took care to blur out sensitive information such as her passport number. But a Gaussian blur is easy to reverse and I successfully unblurred it and told her my discovery. I didn't use any specialized software; it was just Mathematica with its built-in ImageDeconvolve function with guessed parameters for the Gaussian kernel.
I personally recommend blacking out (add a black rectangle) instead of blurring, and if it is a PDF, convert to an image afterwards because too many PDF editors use non-destructive operations to add a new object instead of changing what's underneath.