Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I use lots of characters that look like ASCII but are in fact not ASCII but nonetheless accepted as valid identifier characters.

Clever, I was wondering how the : was done, but it's an abomination :-/

With some simple improvements to the language, about 99% of the C preprocessor use can be abandoned and deprecated.



Walter, D has conditional compilation, versioning and CTFE without preprocessor so I guess that covers the 99% "sane" functionality. Where do you draw the line between that and the 1% abomination part, i.e. your thoughts on, say, compile time type introspection and things like generating ('printing') types/declarations?


The abomination is using the preprocessor to redefine the syntax and/or invent new syntax. Supporting identifier characters that look like `:` is just madness.

Of course, I've also opined that Unicode supporting multiple encodings for the same glyph is also madness. The Unicode people veered off the tracks and sank into a swamp when they decided that semantic information should be encoded into Unicode characters.


What other kind of difference should be encoded into Unicode characters? For example, the glyphs for the Latin a and the Cyrillic а, or the Latin i and the Cyrillic (Ukrainian, Belarusian, and pre-1918 Russian) і look identical in practically every situation, and the Latin (Turkish) ı and the Greek ι aren’t far off. At least not far off compared to the Cyrillic (most languages) д and the Cyrillic (Southern) g-like version (from the standard Cyrillic cursive), or the Cyrillic т and the several Cyrillic (Southern) versions that are like either an m or a turned m (from the cursive, again). Yet most people who are acquainted with the relevant languages would say the former are different “letters” (whatever that means) and the latter are the same.

[Purely-Latin borderline cases: umlaut (is not two dots in Fraktur) vs diaeresis (languages that use it are not written in Fraktur), acute (non-Polish, points past the letter) vs kreska (Polish, points at the letter). On the other hand, the mathematical “element of” sign was still occasionally typeset as an epsilon well into the 1960s.]

Unicode decides most of these based on the requirement to roundtrip legacy encodings (“have these been ever encoded differently in the same encoding?”), which seems reasonable, yet results in homograph problems and at the same time the Turkish case conversion botch. In any case, once (sane) legacy encodings run out but you still want to be consistent, what do you base the encoding decisions on but semantics? (On the other hand, once you start encoding semantic differences, where do you stop?..) You could do some sort of glyph-equivalence-class thing, but that would still give you no way to avoid unifying a and аeveryone who writes both writes them the same.

None of this touches on Unicode “canonical equivalence”, but your claim (“Unicode supporting multiple encodings for the same glyph is [...] madness”) covers more than just that if I understood it correctly. And while I am attacking it in a sense, it’s only because I genuinely don’t see how this part could have been done differently in a major way.


It's a good question. The answer is straightforward. Let's say you saw `i` in a book. How would you know if it is Latin or Cryillic?

By the context!

How would a book distinguish `a` as in `apple` from `a` as in `a+b`? (Unicode has a separate letter a from a math a.)

By the context!

This is what I meant by Unicode has no business adding semantic content. Semantics come from context, not from glyph. After all, what if I decided to write:

(a) first bullet point

(b) second bullet point

Now what? Is that letter a or math symbol a? There's no end to semantic content. It's impossible to put this into Unicode in any kind of reasonable manner. Trying to do it leads one into a swamp of hopelessness.

BTW, the attached article is precisely about deliberately misusing identical glyphs in order to confuse the reader because the C compiler treats them differently. What better case for semantic content for glyphs being a hopelessly wrongheaded idea.


I'm obviously not Walter, but I have a succinct answer that may upset a few people, but avoids a lot of confusion at the same time.

The idea of a letter in an alphabet and a printable glyph for that letter are two different ideas. Unicode could have and probably should have had a two-layer encoding where the letters are all different but an extra step resolves letters to glyphs. Where one glyph can represent more than one letter, a modifier can be attached to represent the parent alphabet so no semantic information is lost. Comparison for "same character" would be at the glyph level without modifiers, and we could have avoided a bunch of different Unicode equivalence testing libraries that have to be individually written, maintained, and debugged. Use in something like a spell checker, conversion to other character sets, or stylization like cursive could have used the glyph and source-language modifier both.


(I expect Walter probably has better things to do than to reply to random guys on the ’net, but we can always hope, and I was curious :) )

First off, Unicode cursive (bold, Fraktur, monospace, etc.) Latin letters are not meant to be styles, they are mathematical symbols. Of course, that doesn’t mean people aren’t going to use them for that[1], and I’m not convinced Unicode should have gotten into that particular can of worms, but I think you can consistently say that the difference between, for example, an italic X for the length of a vector and a bold X for the vector itself (as you could encounter in a mechanics text) is not (just) one of style. Similarly for the superscripts and modifier letters—a [ph] and a [pʰ] or a [kj] and a [kʲ] in an IPA transcription (for which the modifiers are intended) denote very different sounds (granted, ones that are unlikely to be used at the same time by a single speaker in a single language, but IPA is meant to be more general than that).

(Or wait, was this a reply to my point about Russian vs Bulgarian d? The Bulgarian one is not a cursive variant, it’s derived from a cursive one but is a perfectly normal upright letter in both serif and sans-serif, that looks exactly the same as a Latin “single-storey” g as in most sans-serif fonts but never a Latin “double-storey” g as in most serif fonts, and printed Bulgarian only uses that form—barring font problems—while printed Russian never does. I guess you could declare all of those to be variants of one another, even if it’s wrong etymologically, but even to a Cyrillic user who has never been to Bulgaria that would be quite baffling.)

As to your actual point, I don’t think the comparison you describe could be made language-independent enough that you wouldn’t still end up needing to use a language-specific collation equivalence at the same time (which seems to be your implication IIUC). E.g. a French speaker would usually want oe and œ to compare the same but different from o-diaeresis, but a German speaker might (or might not) want oe and o-umlaut to compare the same, while every font renders o-diaeresis and o-umlaut exactly the same. French speakers (but possibly not in every country?) will almost always drop diacritics over capital letters, and Russian speakers frequently turn ё (/jo/, /o/) into е (/je/, /e/) except in a small set of words where there’s a possibility of confusion (the surnames Chebyshev and Gorbachev, which end in -ёв /-of/, are well-known victims of this confusion). Å is a stylistic varisnt of aa in Norwegian, but a speaker of Finnish (which doesn’t use å) would probably be surprised if forced to treat them the same.

And that’s only in Europe—what about Arabic, where positional variants can make (what speakers think of) a single letter look very different. Even in Europe, should σ and ς be “the same glyph”? They certainly have the same phonetic value, and you always have to use one or the other...

Of course, we already have a (font-dependent) codepoint-to-glyph translation in the guise of OpenType shaping, but it’s not particularly useful for anything but display (and even there it’s non-ideal).

[1] https://utcc.utoronto.ca/~cks/space/blog/tech/PeopleAlwaysEx...


printed Bulgarian only uses that form

This is a total pedantitangent but I don't think that's actually true. These wikipedia pages don't talk about it directly but I think give a bit of the flavour/related info that suggest it's not nearly that set in stone:

https://bg.wikipedia.org/wiki/%D0%91%D1%8A%D0%BB%D0%B3%D0%B0...

https://bg.wikipedia.org/wiki/%D0%93%D1%80%D0%B0%D0%B6%D0%B4...

The second one, in particular, says early versions of Peter I's Civil Script had the g-looking small д, so these variants have been used concurrently for some time.


I made no mention of collation, alternate compositions, or of fonts. All I'm saying is that Unicode from the beginning could have had capital alpha and capital Latin 'A' been the same glyph with a glyph-part representation and a separate letter-part representation could have made clear which was which. O-with-umlaut and o-with-diareses could have been done the same. Since you've mentioned fonts, I'll carry on through that topic. Rather than having two code points with two different entries in every font, we could have considered the glyph and the parent alphabet as two pieces of data and had one entry in the font for the glyph.


Ignoring Unicode and focusing just on C: if the glyph matches a glyph used in any existing C operator maybe it shouldn't be legal as an identifier character.


I’m not defending either standard Unicode identifiers or C Unicode identifiers (which are, incidentally, very different things, see WG14/N1518), no :) The Agda people make good use of various mathematical operators, including ones that are very close to the language syntax (e.g. colon as built-in type ascription and equals as built-in definition, but Unicode colon-equals as a substitution operator for a user-defined type of terms in a library for processing syntax), but overall I’m not convinced it’s worth it at all.

As a way to avoid going ASCII-only, though, excluding only things that look like syntax might be simultaneously not going far enough (how are homograph collisions between user-defined identifiers any better?) and too far (reliably transplating identifiers between languages that use different sets of punctuation seems like it’d be torturously difficult).


That ship sailed long before Unicode. Even ASCII has characters with multiple valid glyphs (lower case a can lose the ascender, and lower case g is similarly variable in the number of loops), not to mention multiple characters that are often represented with the same glyph (lower case l, upper case I, digit 1).


That's a font issue with some fonts, not a green light for blessing multiple code points with the exact same glyph.

In fact, having a font that makes l I and 1 indistinguishable is plenty of good reason to NOT make this a requirement.


> The Unicode people veered off the tracks and sank into a swamp when they decided that semantic information should be encoded into Unicode characters.

As if that weren't enough, they also decided to cram half-assed formatting into it. You got bold letters, italics, various fancy-style letters, superscripts and subscripts for this and that.. all for the sake of leagacy compatibility. Unicode was legacy right from the beginning.


The "fonts" in Unicode are meant to be for math and scientific symbols, and not a stylistic choice. Don't use them for text, as it can be a cacophony in screen readers.

Unicode chose to support lossless conversion to and from other encodings it replaces (I presume it was important for adoption), so unfortunately it inherited the sum of everyone else's tech debt.


Unicode did worse than that. They added code points to esrever the direction of text rendering. Naturally, this turned out to be useful for injecting malware into source code, because having the text rendered backwards and forwards erases the display of the malware, so people can't see it.

Note that nobody needs these code points to reverse text. I did it above without gnisu those code points.


Yeah, where do you stop when you start adding fonts to Unicode?


If #include <𝒸𝓊𝓇𝓈𝒾𝓋𝑒.h> is wrong, I don't want to be right.


𝐼 𝘭𝘪𝘬𝘦 𝑢𝑠𝑖𝑛𝑔 𝐛𝐨𝐥𝐝 𝑎𝑛𝑑 𝘪𝘵𝘢𝘭𝘪𝘤 𝑡𝑒𝑥𝑡 𝑗𝑢𝑠𝑡 𝑡𝑜 𝒎𝒆𝒔𝒔 𝑤𝑖𝑡ℎ 𝑎𝑛𝑦𝑜𝑛𝑒 𝑤ℎ𝑜'𝑠 𝑡𝑟𝑦𝑖𝑛𝑔 𝑡𝑜 𝑢𝑠𝑒 𝑡ℎ𝑒 𝑠𝑒𝑎𝑟𝑐ℎ 𝘧𝘶𝘯𝘤𝘵𝘪𝘰𝘯 𝑖𝑛 𝑡ℎ𝑒𝑖𝑟 𝑏𝑟𝑜𝑤𝑠𝑒𝑟 𝑜𝑟 𝑒𝑑𝑖𝑡𝑜𝑟. Sᴍᴀʟʟ ᴄᴀᴘs ᴡᴏʀᴋ ғᴏʀ ᴀ sɪᴍɪʟᴀʀ ᴇғғᴇᴄᴛ. ℑ𝔱 𝔤𝔢𝔱𝔰 𝔢𝔵𝔱𝔯𝔞 𝔣𝔲𝔫 𝔦𝔫 𝔯𝔦𝔠𝔥 𝔱𝔢𝔵𝔱 𝔢𝔡𝔦𝔱𝔬𝔯𝔰 𝔴𝔥𝔢𝔯𝔢 𝔶𝔬𝔲 𝔠𝔞𝔫 𝔪𝔦𝔵 𝔲𝔫𝔦𝔠𝔬𝔡𝔢 𝔰𝔱𝔶𝔩𝔢𝔰 𝔞𝔫𝔡 𝔯𝔢𝔞𝔩 𝔰𝔱𝔶𝔩𝔢𝔰. pᵁrⁿoͥvͨiͦdͩeͤs oͤpⁿpͩoˡrͤtˢuˢnities cᶠoͦnͬfusing aᵖnͤdͦᵖˡᵉ sͭoͪfͤtͥwͬare.


Use Markdown if you want italics.


We have emojis so we’re probably not far from Unicode characters that blink.


To clarify, what is needed are:

1. static if conditionals

2. version conditionals

3. assert

4. manifest constants

5. modules

I occasionally find macro usages that would require templates, but these are rare.


One other thing that would be great that sometimes people use the preprocessor for is having the names variables/enums as runtime strings. Like, if you have an enum and a function to get the string representation for debug purposes (i.e. the name of the enum as represented inside the source code):

    typedef enum { ONE, TWO, THREE } my_enum;

    const char* getEnumName(my_enum val);
you can use various preprocessor tricks to implement getEnumName such that you don't have to change it when adding more cases to the enum. This would be much better implemented with some compiler intrinsic/operator like `nameof(val)` that returned a string. C# does something similar with its `nameof`.


> you can use various preprocessor tricks to implement getEnumName such that you don't have to change it when adding more cases to the enum.

For those who don’t know: the X Macro (https://en.wikipedia.org/wiki/X_Macro, https://digitalmars.com/articles/b51.html)


Hey, even an article written by Walter, that's a fun coincidence! :)

This is slightly different than the form I've seen it, but same idea: in the version I've seen, you have a special file that's like "enums.txt" with contents like (warning, not tested):

    X(red)
    X(green)
    X(blue)
and then you write:

    typedef enum { 
        #define X(x) x
        #include "enums.txt"
        #undef X
    } color;

    const char* getColorName(color c) {
        switch (c) {
            #define X(x) case x: return #x;
            #include "enums.txt"
            #undef X
        }
    }
Same idea, just using an #include instead of listing them in a macro. Thinking about it, it's sort-of a compile time "visitor pattern".


As an update, I removed all use of the X macro in my own code.


I like that ONE == 0.


Did not even think about that :) Just so used to thinking of enums like that as opaque values.


> With some simple improvements to the language, about 99% of the C preprocessor use can be abandoned and deprecated.

Arguably the C feature most used in other languages is the C preprocessor's conditional compilation for e.g. different OSes. Used by languages from Fortran (yes, there exists FPP now - for a suitable definition of 'now') to Haskell (yes, `{-# LANGUAGE CPP #-}`).


In C++, anyway. C’s expressiveness, on the other hand, is pretty weak, and a preprocessor is very useful there.

A better preprocessor (a C code generator, effectively) would be a simple program that would interpret the <% and %> brackets or similar (by “inverting” them). It is very powerful paradigm.


You're talking about metaprogramming. I've seen C code that does metaprogramming with the preprocessor.

If you want to use metaprogramming, you've outgrown C and should consider a more powerful language. There are plenty to pick from. DasBetterC, for example.


But the <%-preprocessor would be the most powerful metaprogramming tool, would it not? Simply because the programmer would have at their disposal the power of the entire real programming language as opposed to being limited to whatever template language happens to be built in. For instance, if I want to generate a piece of code to define an enum and, at the same time, have a method to serialize it (say, into XML), then with <% it is a trivial task, whereas in C# I need to define and use some weird "attribute" class, while C++ offers me no way whatsoever to accomplish this, with all its metaprogramming power. Is D different in this regard?


D can indeed do it. But that is way too advanced for what C is.

I'll repeat that if metaprogramming is important to you, you need a more advanced language. Why are you using C if you want advanced features?


Because it seems to me that metaprogramming/preprocessing/code generation is orthogonal to how advanced or complex the language is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: