Whence '\n'?

ynfnehf · 2024-10-06T17:43:08 1728236588

First place I read about this idea (specifically newlines, not in general trusting trust) was day 42 in https://www.sigbus.info/how-i-wrote-a-self-hosting-c-compile...

"For example, my compiler interprets "\n" (a sequence of backslash and character "n") in a string literal as "\n" (a newline character in this case). If you think about this, you would find this a little bit weird, because it does not have information as to the actual ASCII character code for "\n". The information about the character code is not present in the source code but passed on from a compiler compiling the compiler. Newline characters of my compiler can be traced back to GCC which compiled mine."

moefh · 2024-10-07T01:50:06 1728265806

I was hoping GCC would do the same, leaving the decision about the value of '\n' to GCC's compiler, but apparently it hardcodes the numeric values for escapes[1], with options for ASCII or EBCDIC systems.

[1] https://github.com/gcc-mirror/gcc/blob/8a4a967a77cb937a2df45...

codetrotter · 2024-10-07T02:32:57 1728268377

But these numeric values are also ASCII representation of numbers, rather than being the actual byte that is written to the output. Maybe there is hope still. Where do the byte values for those numbers come from when the compiler writes output?

wahern · 2024-10-07T06:05:56 1728281156

The C standard (see C23 5.2.1p3) requires the values of '0' through '9' to be contiguous, so it doesn't matter if you only care about round-tripping. '7' - '0' == 7 no matter the character set. Though, for round-tripping I suppose this isn't strictly necessary, but it certainly makes parsing and printing decimal notation very convenient. Notably, for both ASCII and EBCDIC 'A'..'F' and 'a'..'f' are also contiguous, so parsing and printing hexadecimal can be done much the same as decimal.

nialv7 · 2024-10-07T11:04:38 1728299078

Maybe the assembler?

nasso_dev · 2024-10-06T16:05:28 1728230728

> This post was inspired by another post about exactly the same thing. I couldn't find it when I looked for it, so I wrote this. All credit to the original author for noticing how interesting this rabbit hole is.

I think the author may be thinking of Ken Thompson's Turing Award lecture "Reflections on Trusting Trust".

Karellen · 2024-10-06T16:43:53 1728233033

Although that presentation does point out that the technique is more generally used in quines. Given that there is a fair amount of research, papers and commentary on quines, it's possible that the author may have read something along those lines.

https://en.wikipedia.org/wiki/Quine_(computing)

yuchi · 2024-10-06T20:45:23 1728247523

Also have a read of this fabulous short web from 2009: https://www.teamten.com/lawrence/writings/coding-machines/

yen223 · 2024-10-06T20:28:50 1728246530

I don't think so. I too recall seeing a post about this exact piece of trivia ('\n' in rust) years ago, but I couldn't find the source anymore.

tylerhou · 2024-10-06T21:20:25 1728249625

It might have been https://research.swtch.com/nih ?

yen223 · 2024-10-06T22:05:47 1728252347

There's nothing in that article about Rust?

ktm5j · 2024-10-06T20:15:08 1728245708

I totally missed that bit when the OP, but it definitely made me think of that paper so maybe.

defrost · 2024-10-06T23:54:42 1728258882

Interesting that 10 hours in there's no thread hits for EBCDIC.

All theories being bandied about should account for the fact that early C compilers appeared on non-ASCII systems that did not map \n "line feed" to decimal 10.

https://en.wikipedia.org/wiki/EBCDIC

As an added wrinkle EBCDIC had both an explicit NextLine and an explicit LineFeed character.

For added fun:

    The gaps between letters made simple code that worked in ASCII fail on EBCDIC. For example for (c = 'A'; c <= 'Z'; ++c) putchar(c); would print the alphabet from A to Z if ASCII is used, but print 41 characters (including a number of unassigned ones) in EBCDIC.

    Sorting EBCDIC put lowercase letters before uppercase letters and letters before numbers, exactly the opposite of ASCII.

The only guarentee in the C standard re: chracter encoding was that the digits '0'-'9' mapped in contiguous ascending order.

In theory* simple C programs (that printed 10 lines of "Hello World") should have tha same source that compiled on either ASCII or EBCDIC systems and produced the same output.

* many pitfalls aside

skissane · 2024-10-07T01:41:14 1728265274

> As an added wrinkle EBCDIC had both an explicit NextLine and an explicit LineFeed character.

Despite EBCDIC having a newline/next line character (NEL), it is rarely encountered in many EBCDIC systems. Early on, most EBCDIC systems (e.g. MVS, VM/CMS, OS/400, DOS/VSE) did not store text as byte stream files, but instead record-oriented files – storing lines as fixed-length or variable-length records. With fixed-length records, you'd declare a record length when creating the file (80 or 132 were the most common choices); every line in the file had to be of that length, shorter lines would be padded (normally with the EBCDIC space character, which is 0x40 not 0x20), longer lines would either be truncated or a continuation character would be used. With variable-length records, each record was prefixed with a record descriptor word (RDW) which gave its length (and a couple of spare bytes that theoretically could be used for additional metadata). However, in practice the use of variable-length records for text files (including program source code) was rather rare, fixed-length records were the norm.

So even though NEL exists, it wasn't normally used in files on disk. Essentially, newline characters such as NEL are "in-band signalling "for line/record boundaries, but record-oriented filesystems used "out-of-band signalling" instead. I'm not sure exactly how stdio was implemented in the runtime libraries of EBCDIC C compilers – I assume \n did map to NEL internally, but then the stdio layer treated it as a record separator, and then wrote each record using a separate system call, padding as necessary.

Later on, most of these operating systems gained POSIX compatibility subsystems, at which point they gained byte stream files as exist on mainstream systems. IBM systems generally support tagging files with a code page, so the files can be a mix of EBCDIC and ASCII, and the OS will perform translation between them in the IO layer (so an application which uses EBCDIC at runtime can read an ASCII file as EBCDIC, without having to manually call any character encoding conversion APIs, or be explicitly told whether the file to be read is EBCDIC or ASCII). Newer applications make increasing use of the POSIX-based filesystems, but older applications still mostly store data (even text files and program source code) in the classic record-oriented file systems.

From what I understand, the most common place EBCDIC NEL would be encountered in the wild was EBCDIC line mode terminal connections (hard copy terminals such as IBM 2741 and IBM 3767).

gnulinux · 2024-10-06T22:12:32 1728252752

This is a fascinating post. It reads to me like some kind of cross between literate-programming and poetry. It's really trying to explain the idea that when you run `just foo` the very 0x0A byte comes from possibly hundreds of cycles of code generation. Back in the day, someone encoded this information into OCaml compiler -- somehow -- and years later here in my computer 0x0A information is stored due to this history.

But the way in which this phenomena is explained is via actual code. The code itself is besides the point of course, it's not like anyone will ever run or compile this specific code, but it's put there for humans to follow the discussion.

titwank · 2024-10-07T00:09:00 1728259740

I wondered if clang has the same property, but it's explicitly coded as 10 (in lib/Lex/LiteralSupport.cpp):

    /// ProcessCharEscape - Parse a standard C escape sequence, which can occur in
    /// either a character or a string literal.
    static unsigned ProcessCharEscape(const char *ThisTokBegin,
                                      const char *&ThisTokBuf,
                                      const char *ThisTokEnd, bool &HadError,
                                      FullSourceLoc Loc, unsigned CharWidth,
                                      DiagnosticsEngine *Diags,
                                      const LangOptions &Features) {
      const char *EscapeBegin = ThisTokBuf;
      // Skip the '\' char.
      ++ThisTokBuf;
      // We know that this character can't be off the end of the buffer, because
      // that would have been \", which would not have been the end of string.
      unsigned ResultChar = *ThisTokBuf++;
      switch (ResultChar) {
    ...
      case 'n':
        ResultChar = 10;
        break;
    ...

titwank · 2024-10-07T10:24:16 1728296656

Similarly in GCC it is hardcoded, with a choice of ASCII or EBCDIC (in gcc/libcpp/charset.cc):

    /* Convert an escape sequence (pointed to by FROM) to its value on
       the target, and to the execution character set.  Do not scan past
       LIMIT.  Write the converted value into TBUF, if TBUF is non-NULL.
       Returns an advanced pointer.  Handles all relevant diagnostics.
       If LOC_READER is non-NULL, then RANGES must be non-NULL: location
       information is read from *LOC_READER, and *RANGES is updated
       accordingly.  */
    static const uchar *
    convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
                    struct _cpp_strbuf *tbuf, struct cset_converter cvt,
                    cpp_string_location_reader *loc_reader,
                    cpp_substring_ranges *ranges, bool uneval)
    {
      /\* Values of \a \b \e \f \n \r \t \v respectively.  */
    #if HOST_CHARSET == HOST_CHARSET_ASCII
      static const uchar charconsts[] = {  7,  8, 27, 12, 10, 13,  9, 11 };
    #elif HOST_CHARSET == HOST_CHARSET_EBCDIC
      static const uchar charconsts[] = { 47, 22, 39, 12, 21, 13,  5, 11 };
    #else
    #error "unknown host character set"
    #endif

      uchar c;
    
      /* Record the location of the backslash.  */
      source_range char_range;
      if (loc_reader)
        char_range = loc_reader->get_next ();
    
      c = *from;
      switch (c)
        {
    ...
        case 'n': c = charconsts[4];  break;
    ...

johnisgood · 2024-10-07T15:43:03 1728315783

If so, this should answer the question where the compiler knows it from?

dist-epoch · 2024-10-06T16:42:05 1728232925

I remember a similar article for some C compiler, and it turned out the only place the value 0x10 appeared was in the compiler binary, because in the source code it had something like "\\n" -> "\n"

happytoexplain · 2024-10-06T17:59:22 1728237562

This is over my head. Why did we need to take a trip to discover why \n is encoded as a byte with the value 10? Isn't that expected? The author and HN comments don't say, so I feel stupid.

kibwen · 2024-10-06T18:04:41 1728237881

The point is to ask "who" encoded that byte as the value of 10. If you're writing a parser and you parse a newline as the escape sequence `\n`, then where did the value 10 come from? If you instead parse a newline as the integer literal `10`, then where does the actual binary value 1010 come from?

The ultimate point of this exercise is to alter your perception of what a compiler is (in the same way as the famous Reflections On Trusting Trust presentation).

Which is to say: your compiler is not something that outputs your program; your compiler is also input to your program. And as a program itself, your compiler's compiler was an input to your compiler, which makes it transitively an input to your program, and the same is true of your compiler's compiler's compiler, and your compiler's compiler's compiler's compiler, and your compiler's compiler's compiler's compiler's compiler, and...

dotancohen · 2024-10-07T11:45:39 1728301539

All right Stan, don't belabour the point.

mikl · 2024-10-06T19:03:34 1728241414

The interesting point is how the value of 10 is not defined in Rust’s source code, but passed down as “word of mouth” from compiler to compiler.

happytoexplain · 2024-10-06T23:50:18 1728258618

Ohh, a lot of spoken and unspoken things in the article made me think this was supposed to be a bug/quirk investigation. This makes much more sense.

antonvs · 2024-10-07T11:13:52 1728299632

It is a kind of bug/quirk: it implies that the compiler is not buildable from scratch in its current form. It depends on a binary that knows how to translate \n.

happytoexplain · 2024-10-07T12:19:50 1728303590

Right, I just mistook which thing was the bug/quirk.

analog31 · 2024-10-06T23:11:09 1728256269

The earliest reference on my shelf is Kernighan and Ritchie, 1978. Like you say, it's probably just been passed on from generation to generation. And it's easier to say "just like C" than to make up a new convention. Python uses the same convention.

eru · 2024-10-07T00:47:35 1728262055

The article is not about the convention, but about the mechanics of how the compiler came to 'learn' about the convention.

yen223 · 2024-10-06T20:22:24 1728246144

If you had to rebuild the rust compiler from scratch, and all you had was rustc's source code, there's nothing in the source code to tell you what '\n' actually maps to.

It's an interesting real-world example of the Ken Thompson hack.

crote · 2024-10-06T22:15:22 1728252922

The thing is, why 10? Why not 9 or 11? The code says "if you see 'string of newline character', output 'newline character'". How does the compiler know what a newline character is? Its code in turn just says "if you see 'string of newline character', treat it as 'newline character'"...

As a human I can just Google "C string escape codes", but that table is nowhere to be found inside the compiler. If C 2025 is going to define Start of Heading as \h, is `'h' => cooked.push('\h')` going to magically start working? How could it possibly know?

Clearly at some point someone must've manually programmed a `'n' => 10` mapping, but where is it!?

kreco · 2024-10-06T23:13:21 1728256401

> The thing is, why 10? Why not 9 or 11? The code says "if you see 'string of newline character', output 'newline character'". How does the compiler know what a newline character is? Its code in turn just says "if you see 'string of newline character', treat it as 'newline character'"...

You can look into the ascii table then.

10 is the line feed. '\n' is there to help human because it makes more sense than '10'.

Nobody is asking why 'a' is equal to '61'.

From the codebase, you know that '\n' is a char. A char is a value between 0 and 255, if you explicitly convert '\n' to int then you happen to find the ascii value and you are good to go and there is no need to pretend there is any poetry in this.

Requoting you:

> "if you see 'string of newline character', output 'newline character'"

It simply becomes "if a you see 'the arbitrary symbol for the new line', output 'the corresponding ascii value'".

I read the quote as "if you see 'a', output 'a' in ascii code." which is not mysterious in any kind of way.

IMO, the article does not make sense at all because it pretends to wonder where an hexadecimal value is coming from, but with any other kind of symbol it would be the exact same article, you enter 'a' and find some weird hexadecimal value and you won't be able to correctly trace it from the source code.

It would make sense it 'a' would display 'a' but '\n' would display some hexadecimal garbage in a text editor.

LegionMammal978 · 2024-10-06T23:25:40 1728257140

> From the codebase, you know that '\n' is a char. A char is a value between 0 and 255, if you explicitly convert '\n' to int then you happen to find the ascii value and you are good to go and there is no need to pretend there is any poetry in this.

But how does the computer know which int to output when you "explicitly convert '\n' to int"? As humans, we can clearly just consult the ASCII table and/or the relevant language standard, but the computer doesn't have a brain and a pair of eyes, instead it must store that association somewhere. The purpose of this article is to locate where the association was originally entered into the source code by some human.

The question is less interesting for ordinary characters like 'a', since the codes for those are presumably baked into the keyboard hardware and the font files, and no further translation is needed.

floren · 2024-10-07T14:33:23 1728311603

> The question is less interesting for ordinary characters like 'a', since the codes for those are presumably baked into the keyboard hardware and the font files, and no further translation is needed.

It's true that the question is less interesting for regular characters, but your explanation why is way off base.

Consider a computer whose only I/O is a serial console. It is concerned with neither a keyboard nor a font file.

LegionMammal978 · 2024-10-08T00:23:38 1728347018

For a computer whose only I/O is a serial console, I'd say that it has no character 'a', but only a character 0x61 with such-and-such properties and relationships with other characters. It's when we press the 'A' key on our keyboard and get 0x61 out, or put 0x61 in and see the glyph 'a' on a display or a printout, that the code becomes associated with our human concept of the letter.

That is, suppose I design a font file so that character 0x61 has glyph 'b' and 0x62 has glyph 'a', and I accordingly swap the key caps for 'A' and 'B' on my keyboard. If I write a document with this font file and print it off, then no one looking at it could tell that my characters had the wrong codes. Only the spell-checker on my computer would complain, since it's still following its designers' ideas of the character codes 0x61 and 0x62 are supposed to mean within a word.

wruza · 2024-10-07T14:56:25 1728312985

I understand and share the excitement on this subtle topic, but this only exists on a source code level. There’s a list of source codes linked in time by compilation processes that eventually lead to a numeric literal entered by a human.

But physical computers knew what to insert in place immediately, because there was 0x0a somewhere in binary every time.

LegionMammal978 · 2024-10-07T23:57:06 1728345426

Of course our physical computers know what to insert, since it was embedded in the binary. But it hasn't always been embedded "every time": there was a point in the past where someone's physical computer didn't know what to insert, and so they had to teach it by hand. Without the source code (or some human-entered code) at the end of the chain, we'd have to insist that the code was embedded from the very dawn of time, which would be rather absurd. Personally, I like how this article and other such projects push back on some of the mysticism around bootstrapping that sometimes floats around.

immibis · 2024-10-07T12:25:35 1728303935

You're operating on different levels. Of course we know ASCII 10 is newline. The comment you're replying to is asking: yes but how does the CPU know that, when it runs the compiler? Obviously it sees the number 10 in the machine code. Where did that 10 come from - what is the provenance of that particular byte? It didn't come from the source code of the compiler, which just says \n, and it didn't come from the ASCII table, because that's just a reference document for humans, which the computer doesn't know about.

eru · 2024-10-07T00:49:11 1728262151

> It simply becomes "if a you see 'the arbitrary symbol for the new line', output 'the corresponding ascii value'".

> I read the quote as "if you see 'a', output 'a' in ascii code." which is not mysterious in any kind of way.

Only, it's not like that.

It's like:

> If you see a backslash followed by n, output a newline.

There's no 'newline character' in the input we are parsing here.

pharrington · 2024-10-07T00:08:12 1728259692

Here's another way to think of the inspiration for the article. You're creating a file to use as input to another computer program (in this case, rustc). Your text file contains the ascii strings 'a', and '\n'. The rustc computer, when reading the text file, reads the corresponding byte sequences - 39, 97, 39, and 39, 92, 110, 39, respectively. The first byte sequence contains the 97 that you desire, but the second sequence does not contain a 10. Yet, rustc somehow knows to generate a 10 from 39, 92, 110, 39. How?

antonvs · 2024-10-07T11:05:37 1728299137

> You can look into the ascii table then.

I suggest reading the article, to find out just how badly you’re missing the point.

TacticalCoder · 2024-10-06T23:01:27 1728255687

> The thing is, why 10? Why not 9 or 11?

Check the first Unicode codepoints, 10 is defined there:

000A is LINE FEED (LF) = new line (NL), end of line (EOL)

It was already 10 in ASCII too (and the first 128 codepoints of Unicode are mostly the same as ASCII [I think there are a few tiny differences in some control characters]).

So to answer your question: it's neither 9 nor 11 because '\n' stands for "new line" and not for "character tabulation" nor for "line tabulation" (which 9 and 11 respectively stands for).

> Clearly at some point someone must've manually programmed a `'n' => 10` mapping

I don't disagree with that.

antonvs · 2024-10-07T11:06:37 1728299197

> Check the first Unicode codepoints, 10 is defined there

The point is how the compiler knows that. Read the article.

johnisgood · 2024-10-07T15:44:27 1728315867

Does https://news.ycombinator.com/item?id=41764555 answer the question?

tzot · 2024-10-06T17:04:22 1728234262

I always thought, maybe because of C, that \0??? is an octal escape; so in my mind \012 is \x0a or 0x0a, and \010 is 0x08.

So I find this quite confusing; maybe OCaml does not have octal escapes but decimal ones, and \09 is the Tab character. I haven't checked.

syncsynchalt · 2024-10-06T21:32:04 1728250324

There's some truth in that direction, but it's not related to backslash escapes (which are symbolic/mnemonic, \n is "[Ne]wline", \r is "carriage [R]eturn", \t is "[T]ab", and so on).

Instead, consider the convention of control characters, such as ^C (interrupt), ^G (bell), or ^M (carriage return). Those are characters in the C0 control set, where ^C is \0x3, ^G is \0x7, or ^M is \0xD. You're seeing a bit of cleverness that goes back to pre-Unix days: to represent the invisible C0 characters in ASCII, a terminal prepends the "^" character and prints the character AND-0x40, shifting it into a visible range.

You may want to pull up an ASCII table such as https://www.asciitable.com to follow along. Each control character (first column) is mapped to the ^character two columns over, on that table.

That's why \0 is represented with the odd choice of ^@, the escape key becomes ^[, and other hard-to-remember equivalents. These weren't choices made by Unix authors, they're artifacts of ASCII numbering.

sudobash1 · 2024-10-07T04:37:02 1728275822

Minor nitpick: the terminal prints the character OR'ed with 0x40 (not AND'ed).

syncsynchalt · 2024-10-07T18:53:15 1728327195

Oop. I always make the same mistake in code too.

wruza · 2024-10-07T15:03:22 1728313402

Better visible here: https://www.commfront.com/pages/ascii-chart

dpassens · 2024-10-06T17:29:58 1728235798

It is indeed a decimal escape: https://ocaml.org/manual/5.2/lex.html#char-literal

fanf2 · 2024-10-06T19:21:25 1728242485

Yeah backslash-decimal character escapes are really rare, the only string syntaxes I know of that have them are in O’Caml, Lua, and DNS

binary132 · 2024-10-06T19:43:15 1728243795

Is O’Caml an Irish fork of OCaml? :)

fanf2 · 2024-10-07T14:40:13 1728312013

It was the spelling when I first learned the language, eg https://arxiv.org/pdf/2006.05862 https://caml.inria.fr/pub/old_caml_site/FAQ/stephan.html

binary132 · 2024-10-08T11:00:54 1728385254

Oh, interesting. Thanks for sharing

kijin · 2024-10-06T16:25:16 1728231916

The incorrect capitalization made me think that, perhaps, there's a scarcely known escape sequence \N that is different from \n. Maybe it matches any character that isn't a newline? Nope, just small caps in the original article.

cpach · 2024-10-06T16:48:37 1728233317

If you do view source, it’s actually \n, but it’s not displayed as such because of this CSS rule:

  .title {
    font-variant: small-caps;
  }

sedatk · 2024-10-06T16:51:42 1728233502

So, the HN title is wrong.

isatty · 2024-10-06T16:57:25 1728233845

The original title is.

niederman · 2024-10-06T17:28:51 1728235731

No, the original title is correct, small caps are just an alternate way of setting lowercase letters.

neuroelectron · 2024-10-06T19:35:05 1728243305

When have you ever seen small caps in use on this website?

deathanatos · 2024-10-06T17:57:12 1728237432

In addition to what others have said about smallcaps being a stylistic rendering, if you copy & paste the original title, you'll get

  Whence '\n'?

paulddraper · 2024-10-06T16:28:08 1728232088

There is actually.

Many systems use \N in CSVs or similar as NULL, to distinguish from an empty string.

I figured this is what the article was about?

deathanatos · 2024-10-06T17:59:56 1728237596

Python has a \N escape sequence. It inserts a Unicode character by name. For example,

  '\N{PILE OF POO}'

is the Unicode string containing a single USV, the pile of poop emoji.

Much more self-documenting than doing it with a hex sequence with \u or \U.

binary132 · 2024-10-06T19:44:26 1728243866

That is in fact why I clicked this article. Oh well. Still a fun read. :)

ncruces · 2024-10-06T17:03:06 1728234186

I'm guessing the “other post” that inspired this might be: https://research.swtch.com/nih

dang · 2024-10-06T17:24:13 1728235453

Discussed here:

Running the "Reflections on Trusting Trust" Compiler - https://news.ycombinator.com/item?id=38020792 - Oct 2023 (67 comments)

amelius · 2024-10-06T19:06:29 1728241589

A more interesting question: what would our code look like if ASCII (or strings in general) didn't have escape codes?

rswail · 2024-10-07T05:55:35 1728280535

If ASCII didn't have control sequences then teletypes wouldn't have worked.

How do you make the carriage (a literal thing on a chain being driven back and forth) "return" to the start of the line without a "carriage return" code?

How would you make the paper feed go up one line without a "line feed" code?

Same for ringing the bell, tabs, backspace etc.

A "new line" on a teletype was actually two characters, a CR and a LF.

Unfortunately, using one of the other sequences (eg RS, for "Record Separator") would have saved billions of CPU cycles and misinterpreted text files dealing with the CRLF, CR, LF, LFCR sequences to mean "new line".

fsckboy · 2024-10-06T23:47:10 1728258430

>what would our code look like if ASCII didn't have escape codes?

ASCII is the layer that doesn't have escape codes (although it does have a single code for ESC), ascii is just a set of mappings from 7-bit numbers to/from mostly-printable characters

ezequiel-garzon · 2024-10-07T00:54:30 1728262470

At least a full fourth of the ASCII code points, 0x00-0x1F, are not printable. A bit more, as we should add del (0x7F) and, according to some though more controversially, space (0x20).

fsckboy · 2024-10-09T04:58:32 1728449912

the full forth may not be "printable if you define printable as using toner and ignoring positioning". But they aren't escape codes.

amelius · 2024-10-07T09:39:53 1728293993

What would you call the lower 32 or so codes then? Control codes?

zzo38computer · 2024-10-09T05:13:45 1728450825

Codes 0x00 to 0x1F are called control codes (and in some contexts (such as use of ASCII control codes with TRON character set) 0x20 is also considered to be a control code). The code 0x7F is also a control code. (The PC character set (which is a superset of ASCII) has graphics for those codes as well, although in plain ASCII there are no graphics for the control codes.)

fsckboy · 2024-10-09T05:01:37 1728450097

they are explicity defined as escape codes in terms of inputting them, but not in terms of receiving them. They are not by any stretch escape codes. See ANSI (which ASCII is a part of) if you want to learn about ESCAPE codes.

zokier · 2024-10-06T21:12:22 1728249142

It would depend more on what we are intending to do, are we controlling a terminal, or are we writing to a file (with specific format).

Terminal control is fairly easy answer, there would be some other API to control cursor position, so the code would need to call some function to move the cursor to next line.

For files, it would depend on what the format is. So we might be writing just `<p>hello world</p>` instead of `hello world\n`. In fact I find it bit weird that we are using teletype (and telegraph etc) control protocol (what ASCII mostly is) as our "general purpose text" format; it doesn't make much sense to me.

rswail · 2024-10-07T05:57:14 1728280634

Except that terminals didn't exist when ASCII was created, printers existed, either with a "carriage" that carried the actual print head (like IBM Selectric typewriters, or the original ASR33 teletypes) or "chain printers" that printed an entire line at a time with a set of rods for each character position that had the entire character set on them.

ivanjermakov · 2024-10-06T21:41:06 1728250866

This is actually a valid problem when writing quines[1]. You need to escape string delimiters in a way without using its literal.

This is what `chr(39)` is for in the following Python quine:

    a = 'a = {}{}{}; print(a.format(chr(39), a, chr(39)))'; print(a.format(chr(39), a, chr(39)))

[1]: https://en.wikipedia.org/wiki/Quine_(computing)

eru · 2024-10-07T00:57:05 1728262625

You don't need these gymnastics of knowing ASCII in Python quines:

    a = '''a = {}
    print(a.format(repr(a)))'''
    print(a.format(repr(a)))

(OK, it's not a quine yet, because the output has slightly different formatting. But the formatting very quickly reaches a fixed point, if you keep feeding it into a Python interpreter.)

The actual quine is the slightly less readable:

    a = 'a = {}\nprint(a.format(repr(a)))'
    print(a.format(repr(a)))

https://github.com/matthiasgoergens/Quine/blob/c53f187a4b403... has a C quine that doesn't rely on ASCII values either.

wongarsu · 2024-10-06T20:09:48 1728245388

In PHP you see a lot of print('Hello World' . PHP_EOL); where the dot is string concatenation (underrated choice imho) and PHP_EOL is a predefined constant that maps to \n or \r\n depending on platform. You could easily extend that to have global constants for all non-printable ascii characters.

jiehong · 2024-10-06T20:42:39 1728247359

The font you use could choose to display something for control characters, so they would have a visible shape on top of having a meaning.

Perhaps like [0] (Unicode notation).

[0]: https://rootr.net/im/ASCII/ASCII.gif

atoav · 2024-10-06T16:53:56 1728233636

One rule of programming I figured out pretty quick is: if there are two ways of doing it and there is a 50/50 chance of one being correct and the other one isn't, chances are you will get it wrong the first time.

chgs · 2024-10-06T17:07:46 1728234466

The USB rule.

First time is the wrong way up

Second time is also the wrong way up

Third time works

jancsika · 2024-10-06T17:35:16 1728236116

It's like the Two General's Problem embedded in a single connector.

You never really know it's right until you take it out and test the friction against the other orientation.

dailykoder · 2024-10-06T17:57:51 1728237471

It's actually super easy and, atleast for me, was always intuitive. Most USB cables have their logo or something else engraved on the "top" with the air gap. And since the ports are mostly arranged the same way, there is rarely any problem. Maybe I am just too dumb to understand jokes, but it always confused me :(

dfc · 2024-10-06T21:47:25 1728251245

I can't find a reference now. But from what I remember the logo is supposed to be on top facing the user when plugging a device in. This was part of the standard that defined the size/shape/etc of what USB is.

switch007 · 2024-10-06T19:08:55 1728241735

People don't always have perfect sight, lighting etc to see it. Or know about that tip. Or remember what it signifies. Often you're fumbling, doing 2 things at once.

ddalex · 2024-10-08T16:58:55 1728406735

> lighting

This explain the Apple connector branding...

sundarurfriend · 2024-10-07T05:32:38 1728279158

I was ecstatic when I learned this piece of information a few years ago, and find it very useful when I'm able to use it. But I've come to learn that at least 50% of the time they either have no logo at all, or have something engraved on both sides in too small a size to be unambiguous. USB multiport hubs, mobile data cables, a lot of these little cables still need the 3 step process.

wruza · 2024-10-07T15:09:53 1728313793

Even if marked correctly, it still doesn’t go in until you look at the socket.

USB-A connection is not a classic system, you have to collapse the wavefunction before the connectors can match.

The proper way is to always look first.

gweinberg · 2024-10-06T20:36:56 1728247016

It's really only the sideways ones which give people trouble. Especially if it's sideways on the back of a computer (or tv) so you can't really see what you're doing).

crote · 2024-10-06T22:18:45 1728253125

Desktop computers are fairly easy too. The vast majority of towers have the motherboard on the right-hand side, so that can be treated as the "down" direction USB-wise.

chupasaurus · 2024-10-06T21:54:17 1728251657

Intel added the satiric text about the rule with double-tongue depiction in one of their whitepapers around USB3 publication for a reason. Sadly couldn't find it.

sedatk · 2024-10-07T00:58:04 1728262684

Doesn't work with vertical slots.

fader · 2024-10-06T17:16:05 1728234965

It's because of the quantum properties of USB connectors. They have spin 1/2.

SAI_Peregrinus · 2024-10-06T17:32:56 1728235976

I thought it was because USB connectors occupy 4 spatial dimensions.

PaulDavisThe1st · 2024-10-06T17:38:46 1728236326

That's good, because otherwise we'd never be able to find them when we need them.

inopinatus · 2024-10-06T20:32:40 1728246760

Instead we always find a USB type mini B when needing a micro B, a micro B when needing a type C, and a type C when needing an extended micro B. If you reveal a spare extended micro B whilst rummaging around then it will in additional transpire that the next cable needed will be a mini B, irrespective of any prior expectation you may have held about the device in question.

A randomly occurring old-school full-size type B may be encountered during any cable search, approximately 1% of the time, usually at the same moment your printer jams.

What I really don't understand, however, is why I keep finding DB13W3s in my closet

kstrauser · 2024-10-06T21:47:22 1728251242

Just 3, plus 1 imaginary.

dtgriscom · 2024-10-06T17:38:58 1728236338

I boosted my USB plugged-in-successfuly-on-first-try rate when I imagined the offset block in the cable male USB connector as being heavy, so it should be below the centerline when plugged into a laptop's female USB connector. (Only works when the connector is horizontal, but better than nothing.)

crote · 2024-10-06T22:20:10 1728253210

USB-C changed that to "It'll physically fit the first time, but good luck figuring out if it's going to work!"

neongreen · 2024-10-07T01:05:26 1728263126

Does it ever happen that USb-C works one way but not the other?

toyg · 2024-10-07T08:59:27 1728291567

No, it's a reference to the fact that USB-C is really a meta-protocol, which can (and is) implemented in many different ways by manufacturers - and thanks to variation in power handling, it also carries the odd chance of frying your devices!

... Isn't progress wonderful?

(Jk, it's so much better than USB-A. But the wtf moments are real)

chgs · 2024-10-07T02:04:40 1728266680

15 types of cables all called USB-C

I’m sure it’s an improvement.

archmaster · 2024-10-06T16:35:06 1728232506

if only this went into where the ocaml escape came from :)

diath · 2024-10-06T16:49:27 1728233367

It does, it links to this: https://github.com/ocaml/ocaml/blob/4d6ecfb5cf4a5da814784dee...

fiddlerwoaroof · 2024-10-06T17:34:04 1728236044

But this doesn’t really explain anything: ‘\010’ isn’t really any more primitive than ‘\x0a’: they’re just different representations of the same bit sequence

fluoridation · 2024-10-06T17:48:08 1728236888

But it is more primitive than '\n', and can be rendered into binary without any further arbitrary conversion steps (arbitrary in that there's nothing in '\n' that says it should mean 10). It's just "transform the number after the backslash into the byte with that value".

xixixao · 2024-10-07T07:24:11 1728285851

Yeah, but what maps the characters 1 and 0 to the numerical value 10? I bet it’s the same chain of compilation, at some point leading to an assembly compiler, and that itself leading to some proto binary implementation, maybe even going into bios/hardware implementation?

fluoridation · 2024-10-07T12:09:41 1728302981

The logic can be implemented entirely in a parser written in a high level language. For example,

    //const int ZERO = '0';
    const int ZERO = 0x30;

    int convert(const char *s){
        int ret = 0;
        for (; *s; s++){
            ret *= 10;
            ret += *s - ZERO;
        }
        return ret;
    }

After that you just dump the value to the output verbatim. Any high level language can handle this, it's just a change of representation like any other. There's no need to do any further delegation.

fluoridation · 2024-10-07T14:33:01 1728311581

Replying to myself because I can't edit the comment anymore. I think I see what you mean now. The parser for the language that compiles your parser also needs to know that "10" means 10, and that "0x30" means 48. I guess that also is knowledge that is passed down through compiler generations. I wonder if you could write a simple parser that doesn't contain a single numerical constant.

adamrezich · 2024-10-07T16:03:34 1728317014

This is very interesting. I don't know if it was the intent, but Jon Blow doesn't plan to self-host the Jai compiler anytime soon, and I wonder if the rationale is at least in part due to things like this. It's interesting how he doesn't see self-hosting as any sort of immediate goal, quite unlike Rust and Zig.

i4k · 2024-10-06T19:58:15 1728244695

This is fascinating and terrifying.

pmarreck · 2024-10-06T23:37:55 1728257875

wait, Rust was bootstrapped via OCaml?

odo1242 · 2024-10-06T23:51:49 1728258709

Yep

Also, the original Rust developers were OCaml devs, Rust borrowed design features (like matching and additive types) from OCaml, the syntax for lifetimes in Rust is the syntax for polymorphism in OCaml, and they borrowed semantics such as latent typing.

pmarreck · 2024-10-08T12:08:16 1728389296

That is very interesting!

phibz · 2024-10-06T21:28:43 1728250123

Backslash escape codes are a convention. They're so pervasive that we sometimes forget this. It could just as easily be some other character that is special and treated as an escape token.

amelius · 2024-10-06T19:04:46 1728241486

Why backslash?

o11c · 2024-10-06T20:48:09 1728247689

Because backslash is a modern invention with no prior meaning in text. It was invented to allow writing the mathematical "and" and "or" symbols as /\ and \/.

dTal · 2024-10-06T21:01:31 1728248491

Hm. According to Wiki, "As of November 2022, efforts to identify either the origin of this character or its purpose before the 1960s have not been successful."

While your rationale was used to argue for its inclusion in ASCII, as an origin story however it is very unlikely, as (according to wiki again): "The earliest known reference found to date is a 1937 maintenance manual from the Teletype Corporation with a photograph showing the keyboard of its Kleinschmidt keyboard perforator WPE-3 using the Wheatstone system."

The Kleinschmidt keyboard perforator was used for sending telegraphs, and is not well equipped with mathematical symbols, or indeed any symbols at all besides forward slash, backslash, question mark, and equals sign. Not even period!

sedatk · 2024-10-07T01:04:10 1728263050

"Why character X for Y?" has usually a universal answer: character X wasn't used commonly for the given scope at the time, was available widely, and felt sensible enough for Y.

amelius · 2024-10-07T13:29:27 1728307767

Since you mentioned Y, why was the Y combinator named Y?

Were A...X already taken?

ProofHouse · 2024-10-07T14:14:47 1728310487

A lot LOT better content than his stupid wanko manko perv post

coolio1232 · 2024-10-06T17:10:20 1728234620

I thought this was going to be about '\N' but there's only '\n' here.

dang · 2024-10-06T17:22:50 1728235370

It's in the html doc title but the article doesn't deliver.

gjvc · 2024-10-06T17:06:56 1728234416

this is a nothingburger of an article

cpach · 2024-10-06T15:56:20 1728230180

Previous discussion: https://news.ycombinator.com/item?id=41564527

binary132 · 2024-10-06T19:39:07 1728243547

Cool, but actually it was just 0x0A all along! The symbolic representation was always just an alias. It didn’t actually go through 81 generations of rustc to get back to “where it really came from”, as you’d be able to see if you could see the real binary behind the symbols. Yes I am just being a bit silly for fun, but at the same time, I know I personally often commit the error of imagining that my representations and abstractions are the real thing, and forgetting that they’re really just headstuff.

Etherlord87 · 2024-10-07T12:02:36 1728302556

The point of the article, that a lot of people in the comments seem to be missing, is that this alias is not defined in the SOURCE CODE of the foo. And not in the rust compiler. And Rust compiler is compiled... Using Rust compiler! So if you follow the source code, there's no definition, anywhere - that is, unless you have a knowledge (that isn't there in the source code, not in the latest revision at least), that the Rust compiler wasn't always compiled in Rust compiler (duh!) and you know where it was compiled. So the definition of this alias is an information that is somewhat "parasitic", secondary, embedded implicitly, like non-DNA information inherited from your parents, known as "Epigenetic mechanisms" (not memes).

binary132 · 2024-10-08T10:58:21 1728385101

Yeah, but my point is, it’s 0A in the compiler’s binary. ‘\n’ is just its name. If you could somehow see inline where the name maps to in the binary, you’d be able to go look at it.