Hacker News new | past | comments | ask | show | jobs | submit login

Why would you expect char to be signed?

If you mean just because it's signed on x86, fair enough; but it sounds as if you think signed is just a more natural option. My intuition goes the other way, for what it's worth.

Anyway, here is one possible reason. It used to be that if you wanted to load a single byte from memory on ARM, the only way to do it always treated it as unsigned. So if you wanted to work with signed chars, you needed explicit extra instructions to do the sign-extension. This isn't true for more recent versions of the ARM ISA -- there's an LDRSB instruction to go with the older LDRB -- but it may be one reason why that choice was originally made.




I'd expect char to be signed, because all other integer types are signed by default.


"char" has something else unusual going on. If you compare it to a "unsigned char" in an if-condition, and compile with -Wall -Wextra, you'll get a warning about casting. If you compare it to a "signed char" on the same system, you'll still get the warning! In fact "char" is not considered to be exactly the same as "signed char" or "unsigned char", it has 3 variants! Even though logically it must be one or the other on a particular platform. So you could think of char as mostly characters, whether ascii, or iso8859-1, or utf8 code-units ...

Functions like toupper() take an int but say "the value must be representable as an unsigned char" so you technically need to cast to "unsigned char" on most platforms, but I doubt anybody uses these functions for anything but ascii. They may work for single-byte locales like iso8859-1 etc if you have the locale env vars set right, but they won't work for non-ascii chars encoded as utf-8, which is generally what you always want to use these days. (There's towupper() which works with 2-byte locales like UCS-2, which is utf-16 without surrogate pairs, and can't represent all the new unicode chars, but you probably don't want to go there, you want some modern unicode library that works properly with utf-8 or UCS-4.)


towupper() works with wide chars, which aren't necessarily 2-bytes. In fact on UNIX-like systems wchar_t tends to be 32 bits and wide chars are usually UCS-4.


I honestly didn't know that ... I've never actually used the libc wchar_t functions :)


Good, because nobody should use wchar_t at all. It's the API that was thought up by people who got drunk and asked themselves "how can we make this char situation even worse?" wchar_t is widely recognized as one of those huge mistakes from the 90s, along with UCS-2. Today you should store strings in as bytes using UTF-8 and if you need to handle them in a fixed-width format you would choose an explicit 32-bit-wide type.


Except when you’re on Windows and have to use WCHAR to handle Unicode characters because they use UCS-2, not UTF-8


On Windows WCHAR is defined to hold a 16-bit unicode character, and is defined to be unsigned. In standard C wchar_t can be any damn thing, and isn't even guaranteed to be wider than char. It can be signed or unsigned. It is useless.


I know you're technically right, but it still seems bizarre to me to ever use char as an integer. If I wanted a byte-sized integer, I would use int8_t/uint8_t (today, anyway).

The only use for char that ever seemed intuitively reasonable to me was to hold ascii characters.


char has another use. I've always figured char pointers were the proper way to provide byte-level access to other objects, since char pointers are allowed to alias other pointer types. That is, they're not bound by strict aliasing rules. I don't think int8_t or uint8_t have the same special exception.

This means you could implement your own versions of memcpy, fread and fwrite by casting the void * arguments to char * , but if you cast them to uint8_t * , your code might not be correct.


That rule also applies to unsigned char and signed char. In practice I think uint8_t and int8_t are usually just typedefs for these respectively, but in principle they needn't be, so you're correct that the aliasing exemption might not apply to those.

I would tend to prefer explicitly using unsigned char or signed char rather than plain char though, partly to signal that I am treating the bytes as integers rather than character. (Actually I would still use uint8_t even though I just learnt it might not be unsigned char, because it looks clearer in my eyes, but I'm not sure I should admit it here...)


prior to C99 adoption char was usually how you got a int8_t. ascii characters are technically 7 bits, so should the extra bit be sign?

That's all just having fun, I like the consistency argument and the fact (is it a fact?) that char is signed on most platforms.


> char was usually how you got a int8_t

Not sure if that's just a typo, but you would use a signed char, which is a different type to char even on implementations where char is signed. Part of the reason for this, of course, is because char can be unsigned so if you want a signed integer you have to specify that. But more philosophically, unsigned char and signed char are numerical types that are not meant to be characters (despite their names), whereas char is a character type that just happens to be backed by an integer.

Indeed I believe that int8_t is almost always just a typedef for signed char (but I still would use int8_t where available for clarity).


I'd expect them to go by the standard. "C Programming Language" has this to say:

> Whether plain chars are signed or unsigned is machine-dependent, but printable characters are always positive.

Kernighan, Brian W.. C Programming Language (p. 36). Pearson Education. Kindle Edition.


Well, char has two semantic meanings. Either as a raw byte, or as an ASCII value. Both are represented as unsigned values (at least conceptually), so making them unsigned-by-default is fairly reasonable. Integers in mathematics and common usage are signed, so making them signed-by-default is also fairly reasonable.

But if you think of char as a typedef [u]int8_t, then I do get the consistency argument.


> I'd expect char to be signed, because all other integer types are signed by default.

But why would you expect char to represent an generic integer in the first place?

It's just a wrapper for bytes, which in their nature are just bits devoid of traditional mathematical, numerical value.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: