I'd posit the idea that we primarily use the punctuation that we find on the keyboard of the device we use.
On Windows/Linux, I use a 101 key. We all know the symbols. And those are the ones I use. There are cases in which I can use an interrobang or kanji-related emoticons, but I'm doing a google search and copying and pasting. But long story short, I don't use these other symbols because they're not easily available.
Sure, I could remap my keys, at least on Xorg. (No clue how to do that on Windows. ) But then, those symbols just aren't used. It's also a reason I don't bother with APL, because the language uses dozens of characters that aren't normally on a keyboard. I'll mess with ones I can type.
Now, phones... Oh my. A phone will auto-substitute picture emoticons from things like :D . Technically, these are a Unicode character set with device implementations. It means even though the pic is a 32x32, it's only usually 2 bytes (unicode is fucky, urgh, so many bugs). I'm more leery in using unicode because of character handling and badness, but things seem to be moving that general direction.
But for the wider issue of Unicode, these "failed symbols" are right in there. But good luck exploring and finding them, since characters can be from 1-6 bytes, or 2^48 space... Alas, back to searching google and copying and pasting.
The major exception for me is em-dashes, which I use all the time.
There are also a few punctuation marks on a standard keyboard that don't get used very much outside of programming and other specialized contexts. {}\|
Another interesting thing related to punctuation is that the use of some punctation is really overloaded. Quotation marks in particular can mean literally quoting someone, a scare quote, an indication that something is a term of art, titles of certain types of works, and probably some other variants.
On some keyboard layouts (including Danish, which I use), we also have the ¤ on shift-4.
Nobody I know has ever used ¤ in any context at all, it seems like such an odd addition. I've used it sometimes as a footnote marker, since it's highly unlikely to clash with formatting, unlike an asterisk.
The interrobang and a few other "novel" punctuation marks are there, but many of the examples from the article haven't made it that far, AFAIK. (And probably never will.)
> characters can be from 1-6 bytes, or 2^48 space
It's a minor detail, but Unicode characters are limited to the range U+0000..U+10FFFF, so they fit in 21 bits and require no more than four bytes (in the UTF-8 encoding form) or two 16-bit code units (in UTF-16) in the usual string representations.
(No argument with your original statement regarding what we primarily use, anyhow!)
> Unicode characters are limited to the range U+0000..U+10FFFF, so they fit in 21 bits and require no more than four bytes (in the UTF-8 encoding form) or two 16-bit code units (in UTF-16)
The private use planes U+F8000..U+FFFFF and U+100000..U+10FFFF can be used as the high and low surrogate ranges in a surrogate-pair encoding scheme similar to that of UTF-16. That would give a range of U+00000000..U+7FFFFFFF, which is in 2^31 space. If the Unicode Consortium removed their arbitrary upper limit of U+10FFFF for codepoints, the UTF-8 and UTF-32 schemes as presently defined would naturally fit into this space without needing to use any surrogation. Only UTF-16 would require any surrogation -- its present scheme and those private use planes for a 2nd tier -- but hopefully the use of UTF-16 would be well on its way out by then.
So does the posting of this have anything to do with the recent Interrobang episode on the 99% Invisible podcast‽
(99pi is an awesome podcast btw!).
After listening to this episode, I even added a binding to the "C-x 8" map in Emacs[0]. This key map is mainly used for inserting various Unicode characters. So now in my Emacs, "C-x 8 ?" inserts ‽.
These are great. And they are expressions of type: "This is a different type of thing from a !. This type of thing is a (string of words that is represented by the new punctuation mark)".
Expressions of type give us new technology and increased leverage: We can further our inquiries because we can communicate those inquiries more objectively (i.e. more easily cross the subjectivity barrier). This objectified information is now information-technology-tooling. Though subtle, the use of these tools can eventually lead to new learning experiences and highly refined and promising new fields of work.
One of my current areas of interest is not so much punctuation, but what I call type-marks. These are unique markers that communicate the type of information that follows, or what is meant to be done with it. I find it's best if these symbols already exist in unicode tables. An example from bullet journaling is the hollow, upward-pointing triangle, which represents a meeting or event. I use other type-marks to indicate things like potential new models or leverage points in new models that may lead to useful new technologies. I use U+2295 or &oplus for that one in particular.
Anyway thanks to OP for posting the article--I love stuff like this.
I frequently use the ~ in front of numbers or words to denote some fuzziness about their accuracy.
It's a tremendously handy symbol for this purpose, as it otherwise requires a lot more thought or typing to get across a similar meaning (eg. even one or two 'approximately's or 'roughly's will rapidly reduce readability).
Denoting sarcasm would be a retrograde step -- part of the thrill & delight of being sarcastic is that the target is oblivious to it. One worry would be that accidentally inserting a single character (which already has a well understood meaning for the literate amongst us) becomes quite a big risk if it changes the entire tone of the preceding sentence / paragraph / thought. Plus of course there's the risk of rapidly escalating double-bluffs. Would ~~ be unsarcastic, or doubleplussarcastic?
It only just now occurred to me that I have absolutely no idea what the tilde was originally for. I only know it as a shortcut for "user's home dir". I'm sure it must have had some kind of pre-computer use, but I can't even imagine what it might have been.
I guess I could check Wikipedia, but I almost don't want to know. Like, finding out would take some of the mystery out of the world‽
My wild guess would be that it started as a composition character on typewriters to write things such as ã.
For those unfamiliar with how accents work on a typewriter: There'd be a couple special accent marks such as ~ and ` and whatnot that'd put down their accent but not advance the carriage. You could then hit a letter to get e.g. à or ē. It's a pretty clever system to prevent having to have a key for every accent.
The problem isn't the keyboard, it's the USB HID spec. You can easily make keyboards that can send arbitrary Unicode characters, but they still have to talk to computer in the same outdated language of scancodes.
Don't be so quick to dismiss something you don't understand as outdated and in need of replacement.
If keyboards would send characters instead of scancodes you would lose the ability to switch keyboard layouts in software (or it would be an ugly hack, which it was before we started to use scancodes). Many people use their computers with multiple languages and find switching keyboard layouts on-the-fly very useful.
The Apple Lisa was kind of half way in this direction. The Lisa ROM contained the matrix layout for each keyboard language, but it was the keyboard itself that would return both the scancode and the fixed language of the keyboard being used. And IIRC plugging in a keyboard of a different language would automatically change the system software the use that language.
While there might be a handful of people who switch keyboard layouts, I imagine that different keyboard layouts causes far more problems, lost time, and engineering effort. Overall it would still be better if keyboards sent Unicode characters.
I assume you don't work or live in a multilingual environment, do you? That's the kind of shortcut that brings huge problems to people all around the world.
Of course it means more engineering effort. That's why we're here for.
I routinely type English and Japanese on my one keyboard. We're not talking about whether keyboards should have different configurations in different parts of the world, just whether any particular keyboard should be able to be remapped so that it sends keys which no longer correspond to what's written on the keycaps.
I contend that almost no one does that (obviously outside the 2 people on this page who are so insistent on the feature), so the engineering effort is not worthwhile. And it must be weighed against all the times everyone has to select the right keyboard layout manually when installing operating systems, or they type a particular key but what shows up on the screen is different from the keycap and they have no idea how to fix it.
I understand your opinion, and surely you have way more experience than me on this matter. Still, it seems that there should be a better way. Why aren't keyboards able to communicate a default layout? I've been reading about the current one and it's true it's complex and full of legacy decisions, so why isn't there a "better scancodes" protocol?
It's true, using a keyboard layout different from the one printed on the keyboard is unheard of in much parts of the world, but on others it's almost a necessity. Continuously installing operating systems and manually configuring keyboard layouts is uncommon, too, and it's mostly solved by automatic layout recognition, aka "write the words you see".
My experience comes from continental Europe. Most people I know here switch at least between their local national keyboard layout and the US layout.
US layout lacks the characters necessary to write in the local language. European layouts often make it awkward to access punctuation common in English (e.g. you need Alt-Gr combos to access). Many people need to use both English and their local language (and perhaps a third European language!) in day-to-day business, hence the need for switching between layouts.
You say you imagine different layouts cause problems and sending Unicode characters would be better, but you did not give a single reason.
I regularly use different keyboard layouts and it has never been a problem. It does not suppose any lost of time or engineering effort. When you have to usually communicate in different languages (which is common for the most part of the world), it is very useful.
I have also built several keyboards, and I cannot imagine the pain it would be having to define a key for every Unicode character.
I gave several reasons: Every time I install a new computer I have to select the keyboard layout, or the layout is wrong and I cannot type properly. Think of all the user and engineering effort wasted to support this corner case of switching keyboard layouts (so the keycaps don't even match the keys sent), used by a tiny number of users.
I personally know of how much work went into getting QEMU to support keyboard layouts. It has to convert characters back to scancodes so the guest OS can convert the same scancodes back to characters. It was a nightmare to get it right. Such a lot of effort for such a silly feature.
> I have also built several keyboards, and I cannot imagine the pain it would be having to define a key for every Unicode character.
What? There's no reason you would define a key for every Unicode character.
Many users do not even know that keyboard layouts can be changed. I think you are hugely overestimating the effort spent in this. I understand this could be tricky for QEMU, but emulators are always tricky. They are exceptional, not the norm.
And what I don't understand, is how it would work in your ideal world. Let's assume we do not allow users to change keyboard layouts. How do I input characters for different languages? Am I supposed to have a different keyboard for every language or do I need a keyboard than can change layouts? Would those keyboards be included in laptops? How is this simpler than keyboard layouts in the OS?
I do not say you would have to define a key for every Unicode character, but if the OS is expecting Unicode, either you have to define some way to send it from the keyboard, or you implement in the OS some mapping, but at that point you would just be using Unicode characters as scancodes and we would be back at square one.
They don't make it clear at that point, but the section on the interrobang makes it clear we're talking about old typewriters. "[backspace]- would result in the - being directly underneath the " on a monospace typewriter.
On Windows/Linux, I use a 101 key. We all know the symbols. And those are the ones I use. There are cases in which I can use an interrobang or kanji-related emoticons, but I'm doing a google search and copying and pasting. But long story short, I don't use these other symbols because they're not easily available.
Sure, I could remap my keys, at least on Xorg. (No clue how to do that on Windows. ) But then, those symbols just aren't used. It's also a reason I don't bother with APL, because the language uses dozens of characters that aren't normally on a keyboard. I'll mess with ones I can type.
Now, phones... Oh my. A phone will auto-substitute picture emoticons from things like :D . Technically, these are a Unicode character set with device implementations. It means even though the pic is a 32x32, it's only usually 2 bytes (unicode is fucky, urgh, so many bugs). I'm more leery in using unicode because of character handling and badness, but things seem to be moving that general direction.
But for the wider issue of Unicode, these "failed symbols" are right in there. But good luck exploring and finding them, since characters can be from 1-6 bytes, or 2^48 space... Alas, back to searching google and copying and pasting.