A bit tangential, here's a few unix utils you can use to inspect text:
`vis`: display non-printable characters in a visual format
`cat -e`: Display non-printing characters and display a dollar sign (`$') at the end of each line.
`hexdump -c`: Display the input offset in hexadecimal, followed by sixteen space-separated, three column, space-filled, characters of input data per line.
Further tangent: a colleague and I talked about creating a derivative of one of the programmer fonts, replacing smart/curly quotes and another such nuisances with unmistakeable indicators.
This would be a useful feature for any programming font. I can totally imagine writing exploits through zero-width characters; smart insertion into interpreted strings (scanners/parsers) can potentially insert a security leak by an attacker. Though input can come from many places off course, this would at least make one vector visible.
what is cat-v.org? (Obviously I visited that link, read the about page - I'm asking for your summary. it's still not clear to me after my initial visit.)
I'm sure there are better people to answer this question, but cat-v.org is the personal blog & collection of interesting stuff by former plan9 personality uriel (sadly no longer with us).
It hosts random thoughts and lots of interesting stuff (including man pages) about plan9 and early unix systems, and miscellaneous other unix trivia.
It really is a treasure. Take the time to browse, it's well worth it if you're somewhat interested in unix history and not overly familiar with the subject already.
Yeah that's me :) At least the second part, because with Emacs I don't really need to be invidious.
TBH I do actually like Unix and Plan9, but having read a big part of cat-v website in the past, most of the views there are fanatical, and are more bitter, cynical and disingenuous than I can ever force myself to be. Especially the way incredibly useful and important OSS projects are denigrated and misrepresented is really annoying (see the harmful stuff section, for example).
Well, I apologize and withdraw the 'bitterly invidious', you aren't.
One part is a matter of taste, and having been around both styles, I like better the 'keep it simple' approach. Too many functions/options overload my simple brain ;-)
Do you have any useful commands like these that can handle multi-byte characters in UTF-8? For instance, handling the zero-width space U+200B, which in UTF-8 takes up more than one byte. I've got some custom scripts that do, I was wondering if there was already something out there.
I agree with other users who said that it would be nice if this happened automatically instead of on click--I think after the initial novelty wore off, I'd probably forget to click it. It might be cleaner to just strip the non-printing characters and display a number of characters stripped, similar to how ad blockers display the number of ads blocked.
I think this extension has a very specific use case, particularly that you want to leak confidential information and you don't want it to trace back to you.
As others have mentioned, I could see this being used regularly by someone who plays Eve
If you're an English speaker that doesn't often handle languages which gain from zero width characters you could just have a listener scan your clipboard for zero-width characters, silently strip them, and then re-populate the clipboard.
Although I've been thinking about building a clipboard filter for Windows a lot lately, I'm getting tired of copying text to the address bar or notepad to strip text formatting information.
Zero width characters can also be used to force long strings of text to wrap correctly, so this may break the layout of sites that allow things like URLs in UGC text.
In my experience, adding a zero width character into something that someone might copy/paste is generally a bad idea because it'll lead to some very irritated users having to manually remove them from copied data.
In Office, you can hit Ctrl after pasting, then "t" to select paste without formatting, which is nicer than switching from keyboard to mouse. The obnoxious thing is that, if the source text did not actually have formatting, then there is no menu that opens by Ctrl, and "t" is instead inserted verbatim. Wrecks havoc with my muscle memory, that I need to keep track of where I copied text from when pasting.
That just removes formatting. zero-width characters are not formatting so they wouldn't be removed. They are as real as the letter A - you wouldn't want ctrl+shift+v to remove all A's from your text, would you?
It's starting to seem like the universe has some fundamental order for things that we can escape temporarily, but that are inescapable in the long run.
1. Photo and video evidence was a game-changer for establishing facts and chronologies. As we've seen, it's becoming harder and harder to distinguish fake photos and videos from real. It's not a stretch to predict a time when we'll only be able to make probabilistic statements about the veracity of photos or videos. (E.g., "60% liklihood of being undoctored.") This is probably already true about photos, although the expense of making a perfect fake is still pretty high, in terms of expertise.
I'd argue the "natural state" is one where word-of-mouth and first-hand accounts are the most authoritative evidence we can have (other than physical evidence like DNA left behind). And even physical evidence left behind can't tell us what the person did while there or how an event transpired.
2. Tracking communications. In the digital age, we've some people have come to assume that all digital text is untrackable and anonymous. Your "11001110" is the same as mine. Historically, it was pretty difficult to transcribe information without leaving traces of the origin of that info. These zero-width characters plus all the other text fingerprinting methods, and ubiquitous tracking in communication logs make it nearly impossible, again, to communicate with others without leaving a trail. And then there's the writing style analysis which makes it tough to write anything without leaving telltale fingerprints.
So, I'm proposing that we are returning to the "natural state" of things. Probably overstating things a bit, but still an interesting thought to consider.
First-hand accounts are quite literally one of the least reliable sources of truth regarding many kinds of events. Brains fill in a lot of gaps with heuristics.
> ... the "natural state" is one where word-of-mouth and first-hand accounts are the most authoritative evidence we can have (other than physical evidence like DNA left behind)
I think they're proposing more that these accounts are authentic, not necessarily reliable. It's common sense that eyewitness testimony isn't 100% reliable.
Photographers have been begging the big camera makers (Nikon/Canon) to add cryptographic signatures to photos from their cameras for years, but so far they've resisted doing so.
Ca/Nikon’s main targets are people who shoot fast, a lot and publish very quickly (e.g. sport photographers) and people who will spend an awful lot of time post processing (wedding, art, commercial shootings)
I’m not sure there’s enough benefits for spending processing time, effort and resources on cryptography on any of these fronts.
Now smartphone apps taking “secure” pictures have been there for a while, for use cases like crash site photo for insurance claims for instance. They do fill the niche I think.
Like others pointed out, the "natural state" is even further from ideal. I'm not an expert on cryptography at all, but I think signatures would help in certain situations (and /u/jonahhorowitz brings this up above me). More advanced media - the next step beyond videos, maybe involving 3D or VR - would make fabrication harder. I'm sure there are solutions out there.
Have you thought about revealing things like non-breaking space, thin space, hair space, and so on? I think at least Fecebutt replaces non-breaking spaces with regular spaces in comments, perhaps precisely to frustrate such steganography. (Ironically, last I recall, they do preserve the difference between double and single spaces, e.g. after periods, which is even more invisible in HTML.)
I second the recommendations of vis, cat -e (or cat -vte), and od -a. I didn't know about hexdump; I always used xxd.
The project is clearly motivated by Be careful what you copy: Invisibly inserting usernames into text[0] (posted 12 hours ago on HN), and many people here might think zero-width characters (or rather, any esoteric Unicode characters) are the only (or the prominent) way to watermark texts, which is wrong. The whole topic -watermarking- is worth at the very least a lengthy blog post, and maybe even several articles, so keep in mind that this comment is just a very brief introduction to different methods (of watermarking)[x]:
(from trivial to more complex ones)
----
1. Using Invisible Characters
The linked[0] article basically talks about a very basic version of it, which should be enough to get the idea. Of course, more sophisticated techniques will use more than just two characters, and will take the position of each invisible character into consideration while encoding & decoding watermarks, ensure it's uniformly distributed throughout the paragraphs etc.
Can be defeated by simply removing invisible characters.
----
2. Using Unicode Characters That Look Alike
The same working mechanism as IDN homograph attack[1].
Can be defeated by simply removing "out-of-place" letters/characters, after determining the language of a given text/paragraph/sentence etc.
----
3. Using Unicode Equivalence
> Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet).
Can be defeated by simply normalising the text, and line endings!
----
(Now it's getting harder!)
4. Changing the Layout of Documents
You can change (the rendering of):
(a) the margins
(b) the ligatures
(c) the space between
(i) specific characters [kerning]
(ii) consequent words/lines/paragraphs
to embed fingerprints. This is especially dangerous as documents are often leaked by taking screenshots or photocopies, which is secure against Unicode attacks, but not of these.
Can be defeated by copying the plaintext, and pasting it to a text editor, and applying steps 1, 2, 3.
Also, bear in mind that if you can still -unintentionally- leak information:
(a) when you use a document editor (LibreOffice Writer, Microsoft Word) as the layout engines might act differently depending on your software version, platform, file format etc.
(b) paper size (A4, US Letter, ...)
----
5. Substituting with Synonyms
If characters can be replaced by their equivalents, why not replace words or sentences even? Words can be substituted by their synonyms based on an algorithm that can create fingerprints accordingly, and I presume even sentences can be rephrased with the latest advancements in AI/ML.
Also, some of the substitutions will be unintentional: a scribble on a piece of leaked document can also leak information about its leaker (different spellings in American and British English, ways of writing date & time, decimal separators etc).
Can be defeated by paraphrasing.
----
The list can probably extended even further but this was all I could remember on the top of my head. =)
[x]: Not that I'm working in a related field, but I researched watermarking & fingerprinting techniques for a similar but much more extensive project for journalists/whistle-blowers to detect fingerprinting/watermarking in documents.
Let me know if you are interested and we can collaborate!
Excellent list. For completion's sake, I'd include intentional typos as an instance of category 5. This can be hard to catch if the typo is in a name. A logical extension of that would be entirely made-up names for non-essential people/places.
At the age of 5, Hawking visited Oxford, incidentally passing through Fakenamingham-on-Watermarkshire.
Dictionaries/encyclopedias have been known to insert entirely fake entries as a way of proving ownership.(http://articles.chicagotribune.com/2005-09-21/features/05092...) In the age of ebooks and print-on-demand, those could be tailored to the individual licensee.
Anything like this for VSCode? Just adding option-space accidentally in a ruby file (after an `end`) will cause ruby to explode and not tell you anything. Randomly deleting code until it runs was my only fix.
It feels like you're trying to make a point on that site but I cannot fathom out what that point was. Are you saying zero width spaces are bad? Are you educating people that they exist? It's not all too clear what your point is.
Also I disagree with the following:
> If you're working on a website there's a good chance that someone will want to copy/paste it, automate using it, etc. Zero-width spaces will create no end of hassle to those people.
I could work around that with literally just 1 line of code:
But these days many HTML parsers will tidy up the output for you so depending on the frameworks you're using, you might not even need to use the above line of code.
If I could have registered "zerowidth.character" I would have. You've got to handle different characters and such.
I should say that the site is mainly a joke because we had to use a particular tool which had a bunch of zero width spaces in it, and it got on our nerves.
> If I could have registered "zerowidth.character" I would have. You've got to handle different characters and such.
Ahhh I didn't get that from the site. I thought it was focused on one specific type of zero width character
> I should say that the site is mainly a joke because we had to use a particular tool which had a bunch of zero width spaces in it, and it got on our nerves.
Suppose, for example, that you want the emoji "Man Facepalming Medium Skin Tone". There is no single Unicode code point for that. Instead, it uses five code points:
U+1F926 FACE PALM
U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4
U+200D ZERO WIDTH JOINER
U+2642 MALE SIGN
U+FE0F VARIATION SELECTOR-16
The first two compose to make the "facepalm" with desired skin tone. The last two make "man". The zero-width joiner composes the first two and last two together, so that the whole sequence renders as a single emoji.
Here's a list of emoji which use the ZWJ this way:
It's a pretty obscure edge case. You could retain ZWJs when they were between two Unicode points with the Symbol or Emoji Symbol classes, but on further reflection, it would probably be more honest to just strip the ZWJs everywhere and decompose the emojis.
Does anyone remember how many years of pain were caused by different line endings -- "\n" vs. "\r" vs. "\r\n"? Unicode is 10x that, and the pain is just beginning.
Probably, since it's an omnishambles on a scale I won't pretend to understand. Eight-fingered and two-thumbed humans can't type it, but they've managed to make it a joke (see zalgo text), and it only gets more ridiculous with time (see how flags are done, or emoji skin-tone modifiers). It's only a matter of time before Metafont will be easier to read and write than the hieroglyphics we're supposed to call "text."
I know it's pretty basic stuff but it does the job. The encoder outputs both in stdout and into a file so you can copy/past it more easily (from Sublime Text for example)
`vis`: display non-printable characters in a visual format
`cat -e`: Display non-printing characters and display a dollar sign (`$') at the end of each line.
`hexdump -c`: Display the input offset in hexadecimal, followed by sixteen space-separated, three column, space-filled, characters of input data per line.
`od -a`: Output named characters.