Xterm(1) now UTF-8 by default on OpenBSD

jhallenworld · on March 9, 2016

I and others have pushed changes into XTerm to improve mouse support of terminal-based applications. All terminal emulators should implement XTerm's command set, especially these:

Bracketed paste mode: allows editor to determine that text is from a mouse paste instead of typed in. This way, the editor can disable auto-indent and other things which can mess up the paste. Libvte now supports this!

Base64 selection transfer: this is a further enhancement which allows the editor to query or submit selection text to the X server. This allows editors to fully control the selection process, for example to allow the selection to extend through the edit buffer instead of just the terminal emulator's contents.

One patch of mine didn't take, but I think it's still needed: allow mouse drag events to be reported even if the coordinates extend beyond the xterm window frame. Along with this is the ability to report negative coordinates if the mouse is above or to the left of the window. Why would this be needed? Think of selecting text which is scrolled off the window. The distance between edge and the mouse controls the rate of selection scrolling in that direction.

BTW, it's fun to peruse xterm's change log. For example, you can see all the bugs and enhancements from Bram Moolenaar for VIM. http://invisible-island.net/xterm/xterm.log.html

Thomas Dickey maintains a lot of other software as well, in particular ncurses, vile and lynx: http://invisible-island.net/

caf · on March 9, 2016

Bracketed paste mode is also useful for IRC, to prevent misfiring a huge paste into a channel.

sherr · on March 9, 2016

Yes, Thomas Dickey's been maintaining xterm and a lot else for donkey's years now. A lot of people owe him a big "thank you" for all his hard work. Thanks Thomas.

spedru · on March 9, 2016

Every time some link or headline reads "now UTF-8 by default", the only reasonable response in 2016 is "about time".

JoachimSchipper · on March 9, 2016

That's not why this article is interesting. Rather, it highlights how profoundly not UTF-8 ready the (terminal) world is.

(It does work in practice, but in-band signaling over a channel carrying complex data that receiver and sender interpret according to settings that do not appear in the protocol at all is, predictably, terrible.)

thisrod · on March 9, 2016

This reminded me of a Rob Pike comment. I can't find the text, but it was along the lines of, "I recently tried Linux. It was as if every bug I fixed in the 1980s had reverted."

kazinator · on March 9, 2016

That was baseless posturing. A famous study and its follow-up found that the utilities on GNU/Linux are more robust, and that was twenty years ago:

ftp://ftp.cs.wisc.edu/paradyn/technical_papers/fuzz-revisited.pdf [1995]

"This study parallels our 1990 study (that tested only the basic UNIX utilities); all systems that we compared between 1990 and 1995 noticeably improved in reliability, but still had significant rates of failure. The reliability of the basic utilities from GNU and Linux were noticeably better than those of the commercial systems."

I doubt there has been much improvement in those commercial Unixes; they are basically dead. (What would be the business case for fixing something in userland utility on commerical Unix?)

The maintainers of the free BSD's have been carrying that torch, but they don't believe in features.

Stepping into a BSD variant is like a trip back to the 1980's. Not exactly the real 1980's, but a parallel 1980's in which Unix is more robust---but the features are all rolled back, so it's just about as unpleasant to use.

notalaser · on March 9, 2016

> The maintainers of the free BSD's have been carrying that torch, but they don't believe in features.

I used Linux for more than a decade before switching to OpenBSD precisely because Linux developers believe in features to the point where how well they're implemented is no longer relevant.

The arrogant, know-it-all kids that we so lovingly nurtured in the Linux community grew up to be its forefront developers today. It shows.

Edit: I was hesitant to write this because it always leads to very unproductive results, but what the hell, I'll bite.

Systemd was the last straw for me, not because something something Unix philosophy (after all, OpenVMS disdained the Unix philosophy, and it worked better than Linux ever has) but because it's so bug-ridden that its interface's simplicity and elegance are next to useless.

Maintaining a non-trivial network & filesystem setup (e.g. I have a few dozen Qemu VMs, because writing software for embedded systems is like that) became a nightmare. It broke with every other update. Great if you're doing devops and this is expected and part of your job, terrible if you're doing actual development and you want an infrastructure that doesn't break between compiles.

I ragequit one afternoon, put FreeBSD on my workstation and OpenBSD on my laptop. I have not touched anything in my configuration in almost a year now and it works flawlessly. I don't think I've had it work for a whole month without having to fiddle with whatever the fuck broke in systemd, somethingsomethingkit or God knows what other thing bolted on top of the system via DBus. I can write code in peace now and that's all I want.

These are all great technologies. Systemd in particular was something I enthusiastically used at first, precisely because after Solaris' SMF -- XML-based as it is -- even OpenRC seemed like a step back to me. But, ya know, I'd actually want it to work.

digi_owl · on March 9, 2016

The basic problem, as i see it, is that the Gnome/Freedesktop people got hold of the user space reins, and turned what used to be a kernel up development process into a desktop down development process.

notalaser · on March 9, 2016

I don't know if it ever was a "kernel up" development process. Gnome (and KDE) both had their own, pretty complex stack, even before Freedesktop.org came up (e.g. KDE had DCOP and a bunch of other services). And they weren't exactly bug-free, either -- but at least they covered a lot less surface.

I don't think it's a simple problem, and I don't think all the blame should be laid on Freedesktop.org, where a lot of good projects originated. I do think a lot could be solved by developers being a little more modest.

digi_owl · on March 9, 2016

Not in a planned sense. But until Freedesktop, i had the impression that you had the kernel, then CLI user space, and the X and the DEs that wrapped the CLI tools in certain ways and the CLI tools in turn talked to the kernel.

Thus you could go from bare kernel, to CLI to GUI in a layered manner (and fall back when a higher layer had issues).

With Dbus etc the CLI has been sidelined. Now you have a bunch of daemons that talk kernel at one end and dbus at the other.

Never mind that they tried to push a variant of dbus directly into the kernel. And as that failed, is now cooking up another take that yet again is about putting some kind of DE RPC/IPC right in the kernel.

notalaser · on March 9, 2016

I wouldn't have a problem with a bunch of daemons talking D-Bus to each other, if their interfaces were properly documented, if D-Bus were properly documented and, in general, if D-Bus weren't quite a mess. I mean, if it weren't for this obscurity, issuing commands over D-Bus wouldn't be that vastly different than issuing them on a console, only more verbose.

Unfortunately, there is a lot of weird interaction between all these processes. It's often badly (or not at all) documented, and what plugs where is extremely unclear. It's very messy, and it doesn't stand still long enough for someone to fix it. They just pile half-done stuff over more half-done stuff.

It's really unfortunate because the Linux kernel is, pragmatically, probably the best there is. It may not excel in specific niches (e.g. security), but overall, it does a lot of things better than, or at least about as well as BSDs, on systems where not even NetBSD boots.

kazinator · on March 9, 2016

This seems an almost inveitable consequence of interfaces based on asynchronous message passing. Collaboration scenarios emerge which become impossible to trace. "If this secret handshake is done that way among these three processes, then such and such a use case is satisfied."

One problem with message passing as such is that messages are like function calls, but you can't put a breakpoint into the system and see a call stack!

If we call a function f on object a, which calls b.g(), a breakpoint in b.g() tells us that something called a.f() which then called b.g(). If we send a message f to a, which then sends a message g to b, a breakpoint in b on the receipt of message g tells us diddly squat! The g message came in for some reason, evidently from a. Well, why did a do that? Who knows; a has gone off and is doing something else.

guard-of-terra · on March 9, 2016

What's so cool about OpenVMS? You're not the first person to praise it but nobody ever explained why.

notalaser · on March 9, 2016

The comparison may be a little anachronistic but... well, in order to understand why OpenVMS made such a dent on computer history, you have to put it in context first.

OpenVMS's days of glory more or less coincided with the Unix wars. Unix was brilliantly hacker-friendly, but a lot of basic things that we now take for granted in Linux -- virtual memory, high-speed I/O and networking -- were clunky and unstandardized. Others (like Files-11, VMS's excellent filesystem) were pretty much nowhere to be found on Unices (or, if they were, they were proprietary and very, very expensive). An Unix system included bits and pieces hacked together by people from vastly different institutions (often universities) and a lot of the design of the upper layers was pretty much ad-hoc.

OpenVMS had been a commercial project from the very beginning. It had a very well documented design and very sound engineering principles behind it. I think my favourite feature is (well, technically, I guess, was) the DLM (Distributed Lock Manager), which was basically a distributed concurrent access system with which you could do concurrent access to resources (such as, but not only, files) in a clustered system. I.e. you could acquire locks to remote resources -- this was pretty mind-blowing at the time. You can see how it was used here: e.g. http://www3.sympatico.ca/n.rieck/docs/openvms_notes_DLM.html .

Also, the VAX hardware it ran on rocked. The whole thing was as stable and sturdy as we used to boast about Linux in comparison to Windows 98, except at a time when many Unices crashed if you did the wrong thing over NFS.

renox · on March 10, 2016

It's was quite a long time ago so my memory is fuzzy but one thing that was quite cool about OpenVMS is the automatic versioning of the files. Also the tools were quite robust: I remember a word editor DecWrite(?), it was labelled 'beta' but it was far, far more robust than Word (if less capable).

2trill2spill · on March 9, 2016

> The maintainers of the free BSD's have been carrying that torch, but they don't believe in features. Stepping into a BSD variant is like a trip back to the 1980's. Not exactly the real 1980's, but a parallel 1980's in which Unix is more robust---but the features are all rolled back, so it's just about as unpleasant to use.

Sorry but that's all BS. FreeBSD is definitely a modern system, has lot's of features and is emphatically, millions of times better than a 1980's Unix. Linux may have a larger community than the BSD's but saying the BSD's are like stepping into the 1980's is rather disingenuous.

dijit · on March 9, 2016

He wants; Games, SystemD, complicated control interfaces, and most of all a fully interactive desktop environment which mounts his drives for him and plays flash videos..

.. Without taking much time to configure it.

This is fair enough but it's not what I want in a machine, openBSD might be "behind" but it feels complete, supported, sustainable and most of all "very well thought out". FreeBSD is also exceedingly good, but makes trade offs in how clean the implementation of the OS feels to keep up with linux.

or, at least it feels like this to me. But to say the BSDs aren't modern is deluded, there's a reason they're known to have the fastest software networking stack in the world.

kazinator · on March 9, 2016

I do not want any of that stuff.

I found FreeBSD to be unusable simply in the command line environment. I was using only a text console login. I simply wanted a decent shell and editor.

Heck, FreeBSD wouldn't even scroll back with the de facto standard Shift-PgUp.

> mounts his drives for him

That amazing advancement in Unix usability can be achieved by something called the "automount daemon" which was introduced in the late 1980's in SunOS (the real, pre-Solaris one).

Tom Lyon developed the original automount software at Sun Microsystems: SunOS 4.0 made automounting available in 1988. [https://en.wikipedia.org/wiki/Automounter]

You basically just wrote a comment which paints a 1988 commercial Unix feature as a Linux frill that BSD people don't need.

FreeBSD has caved in and has autofs as of 10.1: https://www.freebsd.org/cgi/man.cgi?query=autofs&sektion=5

That was released in November 2014, only some 26 years after Sun rolled out the feature. Better late than never, I suppose.

2trill2spill · on March 9, 2016

> I found FreeBSD to be unusable simply in the command line environment. I was using only a text console login. I simply wanted a decent shell and editor.

If you don't like a command line interface install a desktop environment, if you want a different shell install one and if you wan't a different editor, again install a different one.

Nothing you have wrote suggest FreeBSD is unusable, apparently you prefer systems with this stuff already installed which is fine, but it doesn't mean you should knock the BSD's because you are unwilling or unable to install a couple new packages.

kazinator · on March 9, 2016

I like a command line environment.

> If you want a different shell install one and if you wan't a different editor, again install a different one.

I didn't want to customize the FreeBSD environment because I was only using to to maintain a port of a specific program. I wanted that to build in the vanilla environment and not have any dependency on some customizations.

Dealing with FreeBSD was just a hassle, even for the three or four minutes once in a while (at release time) to fire it up, pick up new code, build and go through the regression suite, then roll binaries.

The last straw was when I switched some parsing to reentrant mode, requiring a newer version of Flex than the ancient version on FreeBSD. There was no obvious way to just upgrade to a newer version without building that from sources. That's okay, but it means anyone else wanting to reproduce my steps would be forced to do the same. Everyone else has newer flex: no problem with the GNU-based toolchains on Mac OS, Solaris, and elsewhere. MinGW, Cygwin, you name it. On Ubuntu, old flex is in a package called flex-old, which is mutually exclusive with a package called flex.

I just said to heck with it; I'm just not going to actively support that platform.

Actually, that was the second to last straw. The BSD people also don't understand how compiler command line feature selection macros (cc -D_WHATEVER) are supposed to work.

If you don't have any feature selection, then you get all the symbols. The presence of feature selection macros acts in a restrictive way: intersection rather than union semantics. If you say -D_POSIX_SOURCE it means "don't give me anything but POSIX source", and so if you combine that with another such option, you get the set intersection, which is useless. I ended up using -D__BSD_VISIBLE, which is something internal that you aren't supposed to use (hence the double underscore) which has the effect of making traditional BSD functions visible even though _POSIX_SOURCE is in effect.

On GNU and some other systems, you just add -D_BSD_SOURCE and you're done: those identifiers are added to what you already selected.

This is how POSIX says feature selection works: see here:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2...

"Additional symbols not required or explicitly permitted by POSIX.1-2008 to be in that header shall not be made visible [by _POSIX_C_SOURCE], except when enabled by another feature test macro."

Except when enabled by another feature test macro: they are additive, damn it!

The BSD stance is that "just don't put anything on the command line and you get all the library symbols. What's the problem?" (Like, some of them are specific to your OS and they clash with some of mine?)

cokernel_hacker · on March 9, 2016

I don't think Rob meant stability. Rob was probably referring to the reality that modern Linux hasn't innovated itself past SVR4 by any appreciable amount.

We are still using X, still using terminals powered by control codes, etc.

Rob probably sees things like LANG and LC_ALL as bugs. His fix was UTF-8 everywhere, always. Where is Linux? Still in bag-of-bytes-o-rama.

pilif · on March 9, 2016

>Rob probably sees things like LANG and LC_ALL as bugs. His fix was UTF-8 everywhere, always

The problems solved by LANG or LC_ALL are not solved by UTF8 alone. Even if you use UTF8 for all your input and output, there is still the question of how to format numbers and dates to the user and how to collate strings.

These things are dependent on country and language, sometimes even varying between different places in a single country (in Switzerland, the German speaking parts use . As the decimal separator, while the French speaking part prefers ,)

These things are entirely independent of the encoding of your strings and they still need to be defined. Also, because it's a very common thing that basically needs to happen with every application, this is also something the user very likely prefers to set only once at one place.

Environment variables don't feel too bad a place.

plugnburn · on March 9, 2016

Here in ex-USSR we have those problems too. Why not standardize decimal separators altogether worldwide? We're not dealing with feeking paper handwriting! If a number is printed on a computer display, it must look like 123456.78, not like "123 456,78"! Same goes for datetime representation.

This localization BS has spawned an entire race of nonsense, where, for example, CSV files are not actually CSV in some regions, because their values are not COMMA-separated (as the name implies), but semicolon-separated. And we, programmers, have to deal with it somehow, not to mention some obsolete Faildows encodings like CP1251 still widely used here in lots of tech-slowpoke organizations.

So: one encoding, one datetime format, one numeric format for the world and for the win. Heil UTF-8!

pilif · on March 9, 2016

>not to mention some obsolete Faildows encodings like CP1251 still widely used here in lots of tech-slowpoke organizations.

as we're talking encodings: The worst file I ever had to deal with combined, within one file, UTF-8, cp437 and cp850.

I guess they had DOS and Unix machines bot no Windows boxes touching that file.

This is a problem that won't go away. Many developers are not aware of how character encoding, let alone Unicode, actually works and, what's the worst about this mess, many times, they can get away without knowing.

philh · on March 9, 2016

> If a number is printed on a computer display, it must look like 123456.78, not like "123 456,78"!

Humans find thousands separators useful. You're asking humans to give up useful things because they're hard to program.

That said, I idly wonder whether they could be implemented with font kerning. The bytes could be 123456.78, but the font could render it with extra space, as 123 456.78.

I don't know if it's possible with current font technology, and there are probably all sorts of problems with it even if it is, but it might be vaguely useful.

kazinator · on March 9, 2016

Humans should agree on the decimal and separator symbols, the same way that they agreed on the Indo-Arabic numerals, and symbols like + (plus) and - (minus).

uf · on March 9, 2016

Like we all agreed to the metric system already?

plugnburn · on March 9, 2016

I don't get neither how thousands separator is useful, nor what a "genius" came up with an idea to make comma a decimal separator in computers. I have nothing against either of these things in handwriting (though I personally never separate thousands), but in computing?..

I agree though that this can (and should) be solved at font-rendering level, not at an application level.

spiralpolitik · on March 9, 2016

Given that the world hasn't yet agreed on if a line ends by carriage return or carriage return-line feed I would not hold out much hope on this front (although with the death of just line feed some progress on this front as been made).

See also paper sizes and electrical power outlets.

wtbob · on March 9, 2016

> Given that the world hasn't yet agreed on if a line ends by carriage return or carriage return-line feed I would not hold out much hope on this front (although with the death of just line feed some progress on this front as been made).

Your point's correct, but linefeed hasn't died: it's still the line-ending on Unixes. Old Macs used carriage return; Windows use carriage return line feed; Unix uses linefeed. I don't know what Mac OS X uses, because I stopped using Macs before it came out.

plugnburn · on March 9, 2016

I also don't get why are you still using those miles and pounds when the rest of the world agreed on kilometres and kilograms.

kazinator · on March 9, 2016

I live in Canada. Before that I grew up in a metric country. Though Canada is metric, I use imperial measures here and there.

I use miles for the sport of running. This is because 1609 meters is close to 1600. Four laps around a standard 400 meter track is about a mile and everything follows from that. All my training is based on miles. I think of paces per mile. If I'm traveling abroad and some hotel treadmill is in kilometers and km/h, it annoys the heck out of me.

However, paradoxically, road signs and car speedometers in miles and miles/hour also annoy the heck out of me; though at least since I use miles for running, at least I'm no stranger to the damn things.

For laying out circuit boards, I use mils, which are thousandths of an inch: they are a subdivision which gives a metric air to an imperial measure. This is not just personal choice: they are a standard in the electronics industry. The pins of a DIP (the old-school large one) are spaced exactly 100 mils (0.1") apart, and the rows are 300 mils apart. So you generally want a grid in mil divisions. (The finer-grained DIPs are 0.05" -- 50 mils.).

There is something nice about a mil in that when you're working with small things on that scale, it's just about right. A millimeter is huge. The metric system has no nice unit which corresponds to one mil. A micron is quite small: it's 25.4 mils. (How about ten of them and calling it a decamicron? Ha.)

Inches themselves are also a nice size, so I tend to use them for measuring household things: widths of cabinets and shelves and the like. Last time I designed a closet shelf, I used Sketchup and everything in inches.

Centimeters are too small. Common objects that have two-digit inch measurements blow up to three digits in centimeters.

Centimeters don't have a good, concise way to express the precision of a measurement (other than the ridiculous formality of adding a +/- tolerance). In inches, I can quote something as being 8 1/16 inch long. This tells us not only the absolute length, but also the granularity: the fact that I didn't say 8 2/32 or 8 4/64 tells you something: that I care only about sixteeth precision. The 8 1/16 measurement is probably an approximation of something that lies between 8 1/32 and 8 3/32, expressed concisely.

In centimeters, a measurement like 29 cm may be somewhat crude. But 29.3 cm might be ridiculously precise. It makes 29.4 look wrong, even though it may the case that anything in the 29.1-29.5 range is acceptable. The 10X jump in scale between centimeters and millimeters is just too darn large. The binary divisions in the imperial system give you 3.3 geometric steps inside one order of magnitude, which is useful. For a particular project, you can chose that it's going to be snapped to a 1/4" grid, or 1/8" or 1/16" based on the required precision.

So for these reasons, I have gravitated toward inches, even though I was raised metric, and came to a country that turned metric before I got here. (And of course, the easy availability of rulers and tape measures marked in inches, plus support in software applications, and the enduring use of these measures in trade: e.g. you can go to a hardware store in Canada and find 3/4" wood.)

plugnburn · on March 10, 2016

Some things are traditionally measured in inches even worldwide, like screen diagonals or pipe diameters or, as you have noticed, mil grids. But in any other cases, seeing those feet, yards, miles and pounds in some internet resources presumably made for _international_ audience annoys the heck out of me. In our country (tip: Ukraine), not any ruler or tape measure even has inch marks, yet they are optional here when centimetres are a must. But as soon as I see a video about something "that's 36.5 feet tall", I have to run a conversion to find out what is it in metres. Pretty much the same as the case with some foreign and non-universal character encoding (when everything I see is just garbled letters and/or squares).

P.S. And yes, my ruler is made from aluminium, not aluminum.

kazinator · on March 10, 2016

Aluminium is an English word used in the UK.

Both the words "aluminium" and "aluminum" are British inventions. Both derive from "alumina", a name given in the 1700's to aluminum oxide. That word comes from the Latin "alumen", from which the word "alum" is also derived.

"Aluminum" was coined first, by English chemist Sir Humphry Davy, in 1808. He first called it "alumium", simply by adding "-ium" to "alum" (as in, the elemental base of alum, just like "sodium" is the elemental base of soda), and then added "n" to make "aluminum". In 1812, British editors replaced Davy's new word with "aluminium", keeping Davy's "n", but restoring the "-ium" suffix which coordinated with the other elements like potassium.

North Americans stuck with Davy's original "aluminum".

In Slovakia, we have a nice word for it: hliník, derived from hlina (clay).

kazinator · on March 9, 2016

LC_* is 1980's ISO C design that is unaware of things like, oh, threads. What if I want one thread to collate strings one way, and another to do it another way? Could easily happen: e.g. concurrent server fielding requests from clients in different countries.

Also, how on earth is it a good idea to make the core string routines in the library be influenced by this cruft? What if I have some locale set up, but I want part of my program to just have the good old non-localized strcmp?

The C localization stuff is founded on wrong assumptions such as: programs can be written ignorant of locale and then just localized magically by externally manipulating the behavior of character-handling library routines.

Even if that is true of some programs, it's only a transitional assumption. The hacks you develop for the sake of supporting a transition to locale-aware programming become obsolete once people write programs for localization from the start, yet they live on because they have been enshrined in standards.

alkonaut · on March 9, 2016

I still don't understand how encodings find their way into the localization. I understand that date/time/number formatting is localizable. Do not understand why "LC_TIME=en_GB.UTF-8" would be a different option from just "en_GB"?

Can I really expect it to work if I set

"LC_TIME=en_GB.encA" and "LC_MONETARY=en_GB.encB"

How would the two encodings be used? How would they be used in a message consisting of both monetary and datetime?

Should the setting not be one for encoding (selected from a range of encodings), then settings for formatting and messages (selected from ranges of locales), then finally a setting for collation which is both a locale and an encoding? Or is the linux locale system simply using these as keys, so in reality there is no difference in LC_TIME whether you use encA or encB, it will only use the locale prefix en_GB?

pilif · on March 9, 2016

>How would the two encodings be used? How would they be used in a message consisting of both monetary and datetime?

Full month names would be encoded in encA. Currency symbols in encB. Is it a good idea? No.

>Should the setting not be one for encoding (selected from a range of encodings), then settings for formatting and messages (selected from ranges of locales), then finally a setting for collation which is both a locale and an encoding?

I would argue an encoding setting should not be there to begin with or at most be application specific because that really doesn't depend on system locale (as long as the signs used by the system locale can be represented in the encoding used by the application).

I was just explaining why LC_* should exist even on a strictly UTF-8 everywhere system. I never said storing the encoding in the locale was a good idea (nor is it part of the official locale specification - it's a posix-ism)

zeveb · on March 9, 2016

What I hate is that the locales assume that date & number preferences are specific to one's physical location. I live in America, but I prefer German (9. March 2016 or 09.03.16) or British (9 March 2016 or 9/3/16) dates.

It's even worse when things assume that my date preferences reflect my unit preferences. I prefer standard units (feet, pounds, knots &c.) and British/Continental dates: I don't want to use French units, nor do I want to use American dates. And yet so much software assumes that it's all or nothing.

kazinator · on March 9, 2016

I agree with Rob's "UTF-8 everywhere". I took this approach in the TXR language. Its I/O streams output and input UTF-8, and only that. Period. (There is no virtual switch for alternative encodings.) Internally, everything is a wide character code point. I do not call the "fuck my C program function" known as setlocale, and no behavior related to character handling or localization is influenced by magic environment strings.

LANG and LC_ALL are the work of ISO C and POSIX; they are not the fault of Linux. Linux has these in the name of compliance; they were foisted upon the free word, essentially.

SixSigma · on March 9, 2016

That and getting rid of the TTY altogether.

We aren't using punched cards

EDIT: people hate when I say this, which amuses me. The TTY must die !!!!

nils-m-holm · on March 9, 2016

> The TTY must die!!!!

Being sight-impaired, I have to disagree strongly! The TTY is the only thing that lets me adjust the font size of all programs running in it without going through lots of trouble.

(BTW: didn't downvote your comment.)

SixSigma · on March 9, 2016

Plan9 has an environment variable $font that sets the system font for everything. The windowing system can even start inside one of its own windows so you can have different fonts for different sets of programs, all settable at runtime.

The TTY must die.

guard-of-terra · on March 9, 2016

Browsers let you do that. KDE also does, so do other environments. For a quick hack, set your sceen DPI to 50.

nils-m-holm · on March 9, 2016

Set your minimum font size to 32 points and browse the web for a while! Let me know how it feels!

guard-of-terra · on March 9, 2016

Browsers can set default zoom, not just font size.

nils-m-holm · on March 9, 2016

Doesn't help. When the zoom factor is big enough, you have to scroll sideways while reading.

Anyway, I have tried a lot of things over the years and nothing even comes close to using a text interface.

To name a few nuisances: controls moving outside of the screen, overlapping elements in web content, unreadable buttons, unclickable input fields, tiny fonts in menus, etc. Nothing of this happens with text interfaces.

Thanks for your input, though!

guard-of-terra · on March 9, 2016

For reading, consider installing beeline reader (yes the name is stupid-ish), in plugin or bookmarklet form.

plugnburn · on March 9, 2016

Seems like your mouse never broke (or, if you have a wireless one, the battery in it has never died, and if it did, you could immediately replace it).

Or are you the type that does everything on a touchscreen? Because, judging from your logic, traditional computer controls must die too...

SixSigma · on March 9, 2016

The high frequency of mouse failure, the high cost of downtime and the low price of mice suggest that having spare mice makes sense.

By your logic, I would be stranded at the side of the road wishing I had a spare tyre.

plugnburn · on March 9, 2016

That's not quite a correct analogy because modern systems allow you to do things without mouse. Still there are some individuals that obviously strive to take those possibilities away and make a mouse like a tyre. I'm really happy it's not the case yet.

kiiski · on March 9, 2016

Personally I like doing text-only work (such as programming) in a TTY with Emacs. I'd like to have a machine with no X at all, but unfortunately there aren't many good framebuffer-friendly graphical tools (fbi is good for viewing images, but doesn't support gifs; vlc is ok for playing videos, but the ncurses UI is too buggy; most modern websites are barely usable with links2, while netsurf is too mouse-oriented for my tastes).

pjc50 · on March 9, 2016

How do you operate a command line without something similar to the tty? Windows doesn't have a tty-style interface, and as a result its command prompt has been even more primitive.

(Powershell ISE is something else .. once it actually loads)

setra · on March 9, 2016

Then the commercial systems must have been horrible. Take a look at GNU code, then at plan 9 code. Combine a few GNU core utils, and you have more code than the whole plan 9 kernel. Granted Plan 9 came out a little passed 1990.

kazinator · on March 9, 2016

Only by not using subordinate clauses did you just avoid saying "plan 9" and "commercial" in the same sentence! Where is Plan 9 deployed? Who are the customers?

Plan 9 is a strawman representative of "commercial Unix".

> Combine a few GNU core utils, and you have more code than the whole plan 9 kernel.

When you actually sit down and think of the cases that can occur, that translates into code. Handle this, handle that, and the KLOC's just pile up.

Speaking of kernel size, what makes a kernel code base big? Why, device drivers for lots of hardware.

SixSigma · on March 9, 2016

While it is only a few :

Sydney Olympics lighting system was Plan9 based.

Inferno was used by NASA JPL projects

Lucent use a real time version of plan9 in phone masts

Coraid use Plan9 on their NAS servers

Researchers at LANL and IBM use plan9 on the Blue Gene, and other, supercomputers

I have worked for two plan9 based companies - ok they didn't survive but we tried :)

The international plan9 conferences drew about 30 people. People from commercial enterprises used plan9 in their workflows. Plan9 was my desktop while building a successful recruitment website.

cplease · on March 9, 2016

Lonely comment is lonely.

Literally halfs of dozens of research projects and ones of promotional installations served! Nearly threes of dozens attended conferences, at which twos of booths were no doubt tabled, perhaps both by you, one of the only persons who apparently used Plan 9 commercially.

I'm feeling nostalgic enough to go launch an inferno instance now just on principle.

_pfxa · on March 9, 2016

It's unadopted, but this does not mean it is bad. GNU/Linux is the worst of all and survives only because it's widely adopted, and better-marketed. Many who turn to Unix world first encounter GNU/Linux. GNU/Linux is, quality-wise, inferior to both Plan9 and BSDs, it's a big hack, but it came before, and got adopted first.

Now I downvoted all your comments in this thread for they are unconstructive both in the negative and the positive directions. This is a fanboy-like attitude, where you ignore the fact I explained above, and attack other comments. You take quantity over quality.

BSDs and other systems have their user bases. Those may be small, but they exist. Both GNU/Linux and BSDs are inferior to the ideal system where most legacy cruft shall be gone, but in order to reach that ideal system we should develop the research projects, the ones with little-to-no use. E.g. Plan9. Or microkernels. The all-utf8 approach is perfect, but it can't easily propagate to the mainline if it is not tested for long in research projects, and the ecosystem adapts in this timeframe. So we'd rather not attack them, but let them happen. They'll always be better than mainline, but lesser-adopted, but when they die, the good parts of them will propagate to GNU/Linux, BSD, etc. Take ZFS for example, it was developed on a Sun system, it's not widely adopted, but its now on FreeBSD and Linux (i.e. btrfs, the same concept), for you to enjoy. Or the research in functional languages. Many of those are not adopted, but many features are now propagating to mainline languages.

kazinator · on March 9, 2016

Linux came before BSD?

Please become better informed: https://en.wikipedia.org/wiki/Unix_wars#BSD_and_the_rise_of_...

"BSD purged copyrighted AT&T code from 1989 to 1994. During this time various open-source BSD x86 derivatives took shape, starting with 386BSD, which was soon succeeded by FreeBSD and NetBSD."

BSD was an OS long before 1989; the open-source BSD's weren't new projects written from scratch, but made possible by purging AT&T copyrighted code from the code base.

Linux (the kernel) only started in 1991, from scratch. The GNU parts that go into a "GNU/Linux" --- the GNU C compiler and utilities from the GNU project --- started in 1984. But that is still later than BSD. 1BSD was released in 1978: [https://en.wikipedia.org/wiki/Berkeley_Software_Distribution...]

_pfxa · on March 10, 2016

Oh thanks. And who invented vi? Bram Moolenaar!

Seriously, it should be obvious that BSDs mean, in the context of my comment, modern BSDs. The GNU/linux environment was practically usable before those were. Your comment is pure evil rhetoric.

SixSigma · on March 9, 2016

Plan9 was only ever an experiment, labelled as a Research OS.

I would still say is was a successful experiment.

cturner · on March 9, 2016

Hey, this isn't really a direct response to any of the points you made, but I've been thinking about your get-rid-of-the-tty comment, and wanted to ask.

What is your take on syntax highlighting in a plan9 world? I've read a list of questions somewhere about p9, and remember this one being asked and having a typically abrasive response, along the lines of "just don't". And I often have a lot of time for those kind of arguments - embrace minimalism. But I regularly (daily) find syntax highlighting to be super-useful for highlighting small errors. What's your take? It seems like a regex-ey problem. Could it be done in a way that was within the spirit of such a system?

cplease · on March 9, 2016

Just replying to point out that the grayed out comment above is the correct one.

There is ugliness in coreutils, but it mature, functional, proven ugliness. A lot of it is even there for a reason.

It's not difficult to make an elegant toy in isolation.

setra · on March 9, 2016

checkout the different implementations of echo across various operating systems: "https://gist.github.com/dchest/1091803".

cplease · on March 9, 2016

Do you even know what coreutils is? It is not openbsdutils. It is not plan9utils. It is not even linuxutils. It builds and runs on virtually every deployed vaguely POSIX-ish environment and then some.

Can I take the OpenBSD userland and untar, configure, build and run in cygwin? Nope. You have proven my point. Nobody uses the little SYS V version.

One hint as to why the GNU version is so "long" and "messy":

  /* System V machines already have a /bin/sh with a v9 behavior.
     Use the identical behavior for these machines so that the
     existing system shell scripts won't barf.  */
  bool do_v9 = DEFAULT_ECHO_TO_XPG;

It has to run in environments that others do not in order to provide full functionality, so it has to implement that functionality.

And Unix Fifth Edition, imbued with the cleanliness of Ken Thompson's ghost? Yeah, that's lovely, but not only is it again, of limited utility and portability across idiosyncratic modern environments, but it's full of bugs dating to an era where simplicity was valued above handling all inputs. In 1983, crashing on a bad input wasn't even generally understood to BE a bug, much less extremely dangerous especially in core system utilities.

vinkelhake · on March 9, 2016

I've used the -e option in GNU's echo many times. The various other versions are strictly less useful to me.

Does the option really belong in echo? Who knows, but it's certainly been useful to me.

UNIX fifth edition goes for absolute minimalism. Echo in Plan 9 is apparently used enough that it's worthwhile to optimize the number of times write is called. FreeBSD echo looks like someone just learned about writev. OpenBSD's seem like the sanest of the minimalists.

What's the takeaway for you?

JdeBP · on March 10, 2016

> What's the takeaway for you?

For many people, from the writers of the Single Unix Specification to the Debian Policy people, the take-away was and is use printf, for nigh on 20 years now.

* https://www.debian.org/doc/manuals/debian-reference/ch12.en....

* http://pubs.opengroup.org/onlinepubs/7908799/xcu/echo.html#t...

* https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=490605

* http://www.in-ulm.de/~mascheck/various/echo+printf/

* http://mywiki.wooledge.org/Bashism#Builtins

* https://wiki.ubuntu.com/DashAsBinSh#echo

* https://lwn.net/Articles/346028/

laumars · on March 9, 2016

I've used -e on GNU's `echo` quite a bit as well. But FreeBSD also supports `printf` (as does Linux by the way), so it's not a great inconvenience the missing -e flag on FreeBSD's `echo`.

    $ printf "This\tis an\nexample\n"
    This    is an
    example

You can also use the usual C-style printf parameters as well:

    $ printf "This is %-20s example\n" another
    This is another              example

kazinator · on March 9, 2016

To delimit URL's, use square brackets: [https://gist.github.com/dchest/109180].

There is an RFC standard way of quoting URLs and addresses, namely angle brackets. HN doesn't implement it, though:

<https://gist.github.com/dchest/109180>.

See? The closing > is included in the URL, stupidly.

The convention first appeared in [https://www.ietf.org/rfc/rfc1738.txt] in 1995, with Tim Berners Lee the top author.

kazinator · on March 9, 2016

As a result, the specification for echo is a ridiculous mess which deviates from the other POSIX utilities. It doesn't have the normal command argument handling, and also has implementation-defined behaviors. It's a good command to avoid in serious scripting, other than for outputting fixed, alphanumeric strings.

That said, it's useful to have features like C escape sequences for control characters or arbitrary characters. That feature should be in the shell language. Someone needed it and hacked it into some version of echo. Others elsewhere didn't and so now it's implementation defined whether or not you get backslash processing.

setra · on March 9, 2016

Browser seem to include the end quote when clicked. It mush be removed for the url to work.

2trill2spill · on March 9, 2016

That link is not working for me, Github is throwing a 404.

cjs2 · on March 9, 2016

The link has an extraneous `"` character at the end. It works fine if you remove it.

tempodox · on March 9, 2016

Last I checked, a call to fgetwc(3) on Linux crashes as soon as I actually enter a non-ASCII character, with a locale of en_US.UTF-8.

igravious · on March 8, 2016

I've been trying to teach myself some unicode code points because I'm getting sick and tired of continually Googling them and copying and pasting the result or bringing up a symbol character table.

In fact, I'd say keyboards are woefully out to date.

Specifically, I keep looking up † dagger (U+2020) and ‡ double-dagger (U+2021) for footnotes, black heart (U+2065) to be romantic, black star (U+2605) to talk about David Bowie's last album and ∞ to talk about actual non-finite entities.

I olny found out recently that Ctrl+Shift+u and then type unicode hexadecimal outputs these in Ubuntu, presumably all Linuxen. AltGr+8 is great for diaeresis while we're at it so you can go all hëävÿ mëtäl really easily.

edit: black heart and star are not making it through, why Lord, why?!

kbd · on March 8, 2016

I have a stupid little 'clip' program I wrote that has a dictionary of common texts that I can call by name and have added to the clipboard.

    $ clip lod
    $ pbpaste
    ಠ_ಠ

Maybe you can do the same without needing to remember code points. Something like TextExpander would accomplish the same thing.

igravious · on March 9, 2016

I don't think that's stupid. I think that's a great idea. I might have to use 'xclip' to go about making something equivalent.

kbd · on March 9, 2016

> I might have to use 'xclip' to go about making something equivalent.

Right now my code just shells out to pbcopy on Mac, but you may be interested in pyperclip[1] which provides cross-platform access to the clipboard.

[1] https://github.com/asweigart/pyperclip

aktau · on March 9, 2016

Can't remember where I got it from, but: https://github.com/aktau/dotfiles/blob/master/.zshrc-extra#L...

elros · on March 8, 2016

On OS X, if you type Command+Control+Space, it brings up a character insertion menu where you can search by character name. I can get both daggers, black star and black heart quite quickly that way.

comex · on March 9, 2016

You can also set up custom text macros in the keyboard preferences, which are a bit faster to input. I have :lod: mapped to ಠ_ಠ...

msbarnett · on March 8, 2016

Also on OS X, † is option-t, and ‡ is option-shift-7.

igravious · on March 8, 2016

Ok, On Linux I have found ‡ and †

† is AltGr-Shift-%, and ‡ AltGr-Shift-:

I'll never remember them :(

lozf · on March 9, 2016

U+2020 (†) and U+2021 (‡) aren't that hard to remember for the sake of a few extra key-presses and wider compatibility.

ktRolster · on March 9, 2016

Draw them onto the keyboard next to the % and : keys. I did that with Korean characters until I got the hang of them

Tiksi · on March 9, 2016

There's a program called gucharmap which does similar on linux, hotkeys probably vary though.

http://paste.click/MYRVrF

dylan-m · on March 8, 2016

Another really handy thing is the Compose key. If you're using GNOME it's under Keyboard Settings, under Shortcuts / Typing. I have it set to Right Alt. The idea is there's just a whole bunch of memorable key sequences for various common Unicode characters. For example, Alt + o + o = °; < + 3 = black heart, < + " = “, etc. It doesn't have all of the ones you like, but it's helpful :)

winestock · on March 9, 2016

Three years ago, I wrote a review of three programs that simulate a Compose key on Windows. I included some history behind that key, as well.

https://windows.appstorm.net/roundups/utilities-roundups/add...

davidp · on March 9, 2016

Very cool, thanks. I found a page that shows all the key combinations:

[0]: https://help.ubuntu.com/community/GtkComposeTable

ori_b · on March 9, 2016

I usually map my insert key to compose -- I never use 'insert' for the default functionality, but I also don't type funny characters often enough to justify getting rid of an alt or control key.

Zancarius · on March 9, 2016

I love reading about how different people map different things. For me, it would be a disaster to map insert since I use that for pasting (shift+insert--I'm a lefty and it's a bad habit I grew attached to years ago).

My choice for Compose is the right Windows key, which I think I eventually settled on because I use the left one in some keybinds (winkey+s for shell, etc.) and like you, I couldn't part with an alt or ctrl. I've often wondered what other folks tend to use.

To the grandparent: I'm sometimes amused by what Compose defines. There's ∞ (compose + 8 + 8), (compose + # + #), and oddly (compose + C + C + C + P). I think it may depend on system configuration, but I believe libX11 is responsible. (On my system, Arch, the key combinations appear to be documented under /usr/share/doc/libX11/i18n/compose for my locale.)

jcranmer · on March 8, 2016

That Ctrl+Shift+u hint is nice. Now I can type all the time I want without having to browse for the emoji page to copy it.

And it sucks that I have to use so much that I know the code point for it (1F4A9) off the top of my head. :-(

Edit: I'm definitely putting in U+1F4A9 (the PILE OF POO character), but apparently hacker news strips it out. I'm guessing it's filtering everything that has a symbol character class?

igravious · on March 8, 2016

Yes! The Ctrl+Shift+u hint is nice. I can't believe I only just learnt it. How many _years_ have I been Googling unicode characters for? I am ashamed to think.

I am glad PILE OF POO does not work for you.

does (U+2603) snowman work?

edit: noooo, no snowman

plugnburn · on March 9, 2016

And still U+5350 works. Better than a thousand words sometimes.

scrupulusalbion · on March 8, 2016

A few months ago, I had the idea to remake the old Space Cadet keyboard. One change was to make the bucky bits (e.g. control, alt, meta, super, etc.) allow you to type unicode characters instead of APL characters. Other than that and having lower case parentheses (not needing to use shift to type ( or ) ), the keyboard would be like any other mechanical keyboard.

lokedhs · on March 9, 2016

Did you follow through on it? I have wanted a modern space cadet keyboard for a long time, but the current trend in keyboards seems to be to have less keys, not more (a trend that just doesn't make much sense to me).

scrupulusalbion · on March 21, 2016

I think I came up with the idea about 6 months ago. It is on the back burner until I can find a way to mount the switches (Cherry MX Blues (had enough of those lying around)) [0] without resorting to a PCB. Any ideas on that front are welcome.

Regarding the number of buttons, the Space Cadet had 100 buttons and no number-pad [1], whereas most modern keyboards have 104 buttons. I suppose I could add a number-pad to my design (117 buttons), but then I could also use that area for extra user-definable buttons (20 buttons in a 4x5 grid -> 120 buttons). The Space Cadet is a bit larger than most IBM-style keyboards, so more keys means more real-estate; this is not to mention yet further divergence in design from the original Space Cadet keyboard.

Beyond hardware issues, there are software issues to resolve, like whether to include the macro functionality of the original. I can't find any documentation on how it worked, so I get to start from fresh.

[0] = I really wanted to use some hall-effect switches, but nobody makes them anymore, because they are allegedly the most luxurious switches ever. I would probably have to tear apart an original Space Cadet keyboard to get some. Thus, I would probably just use Cherry MX switches [1] = https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Sp...

JdeBP · on March 11, 2016

That's slightly ambiguous. Is your definition of a "modern space cadet" keyboard that it has more modifier types? Or that it has more keys? (Bear in mind that one could go back to a 101-key U.S. PC/AT keyboard and still have more keys than the 100-key space cadet keyboard.)

For keyboards with more keys (and yet no vendor-defined private HID usages), one can look at the keyboards available in Brazil. Some of the "multimidia" keyboards from the likes of Multilaser, C3 Tech, and Leadership have the 107-key Windows ABNT2 physical layout, with anywhere up to 20 further multimedia keys.

But these keyboards don't have keys engraved with more modifier types beyond the usual five.

lokedhs · on March 14, 2016

I am referring to a keyboard with a multitude of extra keys. The layout doesn't have to be exactly like the original space cadet keyboard of course, but having some 50 extra keys surrounding the normal qwerty portion, as well as some 10 modifier keys is basically what I'm looking for.

setra · on March 9, 2016

As someone who programs in modern APL (no joke check out dyalog) I think this is terrible. Besides the APL characters are just the greek characters, and all sorts of things use those.

scrupulusalbion · on March 21, 2016

The original keyboard design was meant to be used to type the characters present in the APL code page [0]. My intention was that one would still be able to type the same (or perhaps mostly the same) characters as the original keyboard but using Unicode instead the APL encoding (which is based on EBCDIC, yuck).

My design may or may not contain the same keys, because I am not sure how many would want the original APL character set. APL keyboards are available, so the market exists [1], but that doesn't mean much. I plan to have the micro-controller user-configurable and replaceable, so one could change what symbols were type-able with the same keyboard. As I expect to use UTF-8, this keyboard could be used to type any character in the UTF-8 code pages.

[0] = https://en.wikipedia.org/wiki/APL_(codepage) [1] = https://geekhack.org/index.php?action=dlattach;topic=69386.0...

scrollaway · on March 9, 2016

You can make your own keyboard layouts with X11.

This is mine:

https://github.com/jleclanche/dotfiles/blob/master/X11/xkb/s...

JdeBP · on March 10, 2016

> In fact, I'd say keyboards are woefully out to date.

I wrote a virtual terminal subsystem a while ago. I gave it keyboard layouts with the ISO 9995-3 common secondary group. No daggers, alas. But ISO 9995-3 does have pretty much all of the combining diacritical marks. <Group2> <Level3>+D05 is combining diaeresis. In practice I find myself not appreciating that as much as I appreciate being able to type U+00A7 as <Group2> <Level2>+C02.

unfamiliar · on March 9, 2016

I have an Alfred workflow that fuzzy-searches through all unicode characters by name and inserts the character when selected. All it takes is a good interface to make it fluid.

wtbob · on March 9, 2016

XCompose is your friend: black hear suit is Compose-< 3, and with additional configuration[1] dagger can be Compose-| -, double dagger Compose-| = and black star Compose-S S.

[1] https://github.com/cofi/dotfiles/blob/master/XCompose

Grue3 · on March 9, 2016

Ctrl-Shift-u also works in GIMP, even on Windows. I guess it's a GTK feature.

gpvos · on March 9, 2016

Wouldn't it be better if all those dangerous escape sequences (like Application Program-Control, redefining function keys, alternate character sets, etc.) were disabled by default in xterm? Anyone using the obsolete software that uses them could enable them if they wish.

deathanatos · on March 9, 2016

Repeat after me: UTF-8 is the sane default in this day and age. This is a good change.

The whole "the ISO 6429 C1 control code 'application program command'" thing is a bit surprising though. (I'm guessing this change doesn't actually avoid this directly? If you sent an APC it'd still do it, it's just that APC is multiple bytes in UTF-8, and hopefully a bit rarer?)

> Reinterpreting US-ASCII in an arbitrary encoding

This way will likely work — at least, I thought. The vast majority of encodings are a superset of ASCII, so reinterpreting ASCII as them is valid. The only one I know of that isn't is EBCDIC, and I've never seen it used. (Said differently, non-superset-of-ASCII codecs are incredible rare to encounter, so the above assumption usually holds.) (The reverse, reinterpreting arbitrary data as ASCII, is not going to work out as well.)

Though it is rather horrifying how easily it is to dump arbitrary data into a terminals stream. Unix does not make this easy for the program. The vast majority of programs, I'd say, really just want to output text. Yet, they're connected to a terminal. Or better, if perhaps a program could say, "I'm outputting arbitrary binary data", or even "I'm outputting a application/tar+gzip"; the terminal would then know immediately to not interpret this input. And in the case of tar+gzip, it would have the opportunity to do something truly magical: it could visualize the octets (since trying to interpret a gzip as UTF-8 is insane); it could even just note that the output was a tar, and list the tar's contents like tar -t. If the program declares itself aware, like "application/terminal.ansi", then okay, you know: it's aware; interpret away.

But it doesn't, so it can't. Part of the difficulty is probably that the TTY is both input and output (not that the input can't also declare a mimetype or something similar). And the vast majority of programs don't escape their user input before sending it to a terminal; it's like one giant "terminal-XSS" or "SQL-injection-for-your-terminal". And it is probably unreasonable to expect it; I don't really know of any good libraries around terminal I/O; most programs I see that do it assume the world is an xterm and just encode the raw bytes, right there, and pray w.r.t. user input.

catting the linux kernel's gzip into tmux can have consequences from "lol" to "I guess we need a new tmux session".

It was also just today that I discovered that neither GNU's `ps` nor `screen` support Unicode, at least, for characters outside the BMP.

comex · on March 9, 2016

UTF-16 isn't a superset of ASCII, for one. Doesn't seem that anyone uses a native UTF-16 terminal, but if you're trying to use grep or whatnot on a UTF-16 encoded file, it'll happily silently not do what you want...

TazeTSchnitzel · on March 9, 2016

畂桳栠摩琠敨映捡獴!

comex · on March 9, 2016

唀吀䘀ⴀ㄀㘀戀礀琀攀猀眀愀瀀猀愀爀攀愀氀猀漀昀甀渀.

zkirill · on March 9, 2016

This is really great! Just a few days ago I got very confused when I saw tofu characters in xterm and had to switch to uxterm to see them (or set some locale flag in my home dir).

plugnburn · on March 9, 2016

UTF-8 must be the default and only encoding. Why does anything else still exist?

Grue3 · on March 9, 2016

Because you don't want to give the geniuses who came up with stuff like "Han Unification" a monopoly on encoding.

jmnicolas · on March 9, 2016

Yes but UTF-8 with or without byte order mark ? ;-)

plugnburn · on March 9, 2016

Without. BOM (when used for UTF-8) is an obsolete crap invented by necrosoft in order to make their software incompatible with normal.

TazeTSchnitzel · on March 9, 2016

It's not a Microsoft invention, and MS's use of it is really quite sensible. They had a problem of distinguishing UTF-16, UTF-8 and non-Unicode (possibly a single-byte "extended ASCII" type encoding, possibly some multi-byte monstrosity) text files. Since UTF-8 and ASCII-compatible encodings look similar when there aren't many >U+007F characters in use, and identical if none are in use, they could get confused. Prepending a Byte Order Mark solves this problem, in that it makes a file unambiguously UTF-8 (or UTF-16, for that matter).

tempodox · on March 9, 2016

How do you have a BOM in the shell?

plugnburn · on March 9, 2016

Some masochist M$-fan could invent even this just in order to justify the difference from civilized world.

Tiksi · on March 9, 2016

ANSI must be the default and only encoding. Why does anything else still exist?

plugnburn · on March 9, 2016

Because, you see, not everyone in the world uses Latin characters. UTF-8 must become a new standard instead of that whole obsolete encoding zoo.

Tiksi · on March 9, 2016

And in 20-30 years we'll likely be saying the same about UTF-8.

I figured that "ANSI" would give away that I wasn't being serious since it's not actually an encoding.

plugnburn · on March 9, 2016

> And in 20-30 years we'll likely be saying the same about UTF-8.

Well... If we will, why not? But the thing is that in 20-30 years we won't be able to invent any new writing systems that UTF-8 won't cover. Single-byte encodings were doomed because of their single-byteness. The same awaits two-byte encodings like UCS-2 (aka UTF-16BE) - we already have extended code points for something that glamour hipsters call "emoji". Variable-byte encoding will never become obsolete.

Tiksi · on March 9, 2016

> But the thing is that in 20-30 years we won't be able to invent any new writing systems that UTF-8 won't cover.

I think you underestimate humanity's aptitude at creating things that don't fit into well defined standards.

My (admittedly poorly stated) point wasn't that we shouldn't be moving everything over to UTF-8. I personally use it wherever possible just because it makes life easier. My point was that there are decades of things that use ASCII-US or another one of the overlapping but incompatible encodings because they were the RightThing™ to use at the time and there's no way we're going to get rid of everything non-UTF-8 any time soon.

In 20-30 years we'll be saying "Why isn't everything in FutureText-64, it should be the only encoding. Why does anything else even exist?", and it'll be because we're saying the same about UTF-8 now.

plugnburn · on March 9, 2016

I think you miss the point. When CP1251, KOI8-R and other crazy imcompatible things came around, they came around because there was a need: ASCII didn't provide a necessary character set. Now when we have Unicode that embodies virtually all character sets existing on Earth, we don't _really_ need either non-Unicode encodings, or even fixed-byte UTF versions. So a move to any hypothetical FutureText-64 will actually give no practical gain, unlike a move from single-byters to, for example, UCS-2 and then from UCS-2 to UTF-8.

But my main point is another: eliminate all single-byter and fixed-byter zoo and leave one universal encoding. When (if ever) it's time to replace it, we'll do it all and at once, not having those crazy iconvs everywhere.

TazeTSchnitzel · on March 9, 2016

Unicode is currently limited to 21 bits for compatibility with UTF-16. Eventually we might manage to exhaust all available codepoint space, and with that we'd have to move to yet another encoding with a whole new kind of surrogate pairs. Though UTF-8 could originally handle 31 bits, that's no longer the case.

plugnburn · on March 9, 2016

So I see 2 steps here: dropping UTF-16 altogether (well, already, because there are plenty of extended codepoints above 0xFFFF), and when approaching the 31-bit limit - inventing something like "zero-width codepoint joiner" to compose codes of arbitrary length.

For example, in a hypothetical alien language, a hypothetical character "rjou" would have a code 0x2300740457 (all the previous codes are exhausted). We can't express this with a single code, so actually we split it into 2-byte parts and write "#" (0x0023), joiner, "t" (0x0074), joiner and "ї" (cyrillic letter yi, 0x0457). As we have a joiner between these codes, we know that we must interpret and display them not as a "#tї" sequence but as a single alien "rjou" character. I think you get the idea.

TazeTSchnitzel · on March 9, 2016

> dropping UTF-16 altogether (well, already, because there are plenty of extended codepoints above 0xFFFF)

UTF-16 can handle stuff above U+FFFF just fine, it encodes that with surrogate pairs. Are you thinking about UCS-2?

The 21-bit limit for Unicode comes from the limits of UTF-16's surrogate pairs.

kazinator · on March 9, 2016

Great! Now just drop the embarrassing man(1) page reference, and you can call it modernized.

Wow, I'm surprised that the people whose buttons this pushes are able to make(1) a HN account, let alone have enough points to downvote.

Think about it. There is only one man page for xterm. I fyou type "man xterm" with no section number you get that man page. If there existed an xterm(7) page, you'd still get the xterm(1) man page by default. So why the hell write the (1) notation every time you type the word xterm?

Man page section numbers are not useful or relevant, by and large and mentioning them only adds noise to a paragraph.

Even stupider is when the worst of the Unix wankers write man page section numbers after ISO C function names. Example sentence: "Microsoft's malloc(3) implementation is found in MSVCRT.DLL". #facepalm#

gjvc · on March 9, 2016

>Think about it. There is only one man page for xterm. I fyou type "man xterm" with no section number you get that man page. If there existed an xterm(7) page, you'd still get the xterm(1) man page by default. So why the hell write the (1) notation every time you type the word xterm?

Because the convention exists to define the type of the component. It's a handy convention, and I'm betting there are a few people reading this who have never used anything other than GNOME terminal so appending the section number immediately helps the reader to place the component, otherwise they'd have to look it up. etc

kazinator · on March 9, 2016

So, if I don't know anything but Gnome terminal, and don't know what xterm is, if I see "xterm", I have to look it up. However, if I see "xterm(1)", I don't have to look it up?

Strange.

(And how did I get to the situation in which I know what (1) means, yet I only know Gnome terminal and don't know what xterm is?)

(What about the fact that xterm(1) is also a hyperlink in the sumitted page? You could change the anchor text to "xterm(foo)" and it would still navigate to the correct man page with one click.)

gjvc · on March 10, 2016

unix has got much bigger problems than this

neerdowell · on March 9, 2016

OpenBSD's malloc(3) implementation is found in sys/kern/kern_malloc.c, and OpenBSD's malloc(9) implementation is found in lib/libc/stdlib/malloc.c

brohee · on March 9, 2016

The reverse actually...

kazinator · on March 9, 2016

Of course the reverse; why would the traditional (3) section be suddenly taken over by kernel functions, and libc stuff moved to (9)?

recursive · on March 9, 2016

Huh. I always thought those parenthesized numbers after unix commands were version numbers.

klodolph · on March 9, 2016

Don't take the downvotes personally, it's just uninteresting content getting moderated.