>1. There is no real reason that trigraphs should be expanded inside a comment.
trigraphs are 100% substitutes for unrepresentable characters. they absolutely positively ALWAYS should be replaced by the character. Pretend it takes place before the character even arrives inside the comment, because it does.
it's very much like the #define/#include c-preprocessor step, it happens first, that's what keeps it clean, understandable, manageable. (Sure you can have more complex macro systems, but they are... complex, they can get very ugly)
if you know how to process a unix shell commandline, you know that there are layers to it. Trigraphs are just like that. If you don't know how a unix shell commandline is processed, learn it, it's worth knowing.
I'm talking about why trigraphs had to behave in such way, not how. C and C++ have a concept of source character set and execution character set, which can diverge. Let's say trigraphs are indeed for unrepresentable characters, then in which character set are them unrepresentable? If the answer is for source, alternative spelling is sufficient and comments should ideally have no effect or users will be confused. If the answer is for execution, why do other characters have no equivalent?
Also you should be aware that the macro expansion in C/C++ is not like a literal string replacement. `#define FOO bar` doesn't turn `BAREFOOT` into `BAREbarT` or `"OH FOO'S SAKE"` into `"OH bar's SAKE"`. (Some extremely old preprocessors did do so, by the way.) `#define FOO(x) FOO(x)` doesn't make `FOO(bar)` into an infinite recursion because `FOO` is prevented from expansion when `FOO` itself is being already expanded. There are certainly some layers, but they are not what you seem to think.
you want to be able to convert source code from one system to another and back again, and you want to rules to be simple so that everybody who writes such a coverter gets it right, and you also don't want to think about a zillion edge cases. If the trigraphs exist on the wrong side of the conversion, flag them. otherwise, it's a very simple process.
I was not talking about how the preprocessor is implemented, I was talking about the layering. You keep wanting to mix layers because you think you know better; thar be dragons.
Layering is only valuable when that serves its goals well. I don't see any reason to have an additional layer in the language here. If you are thinking about a strict separation between preprocessor and parser, that is already known to be suboptimal in compilation performance decades ago. (As a related example, a traditional Unixy way to separate archiving and compression is also known to be inefficient; a combined compressing archiver is better in design.)
I disagree with the downvotes here. C language "layers" are tricky to get right, source of footguns and a backdoor potential (especially the trigram that started this comment chain), and overall a bandaid invented when there were no better solutions (like modules, or Unicode). Trigrams are a weird archaic quirk of the C language (and no other modern language), and I'm glad to see them gone.
And since we're thinking about layers, character encoding hacks should be entirely outside of a programming language responsibility. Now that would be a proper layering.
trigraphs are 100% substitutes for unrepresentable characters. they absolutely positively ALWAYS should be replaced by the character. Pretend it takes place before the character even arrives inside the comment, because it does.
it's very much like the #define/#include c-preprocessor step, it happens first, that's what keeps it clean, understandable, manageable. (Sure you can have more complex macro systems, but they are... complex, they can get very ugly)
if you know how to process a unix shell commandline, you know that there are layers to it. Trigraphs are just like that. If you don't know how a unix shell commandline is processed, learn it, it's worth knowing.