Hacker News new | past | comments | ask | show | jobs | submit login
How does Base32 (or any Base2^n) work exactly? (ptrchm.com)
70 points by pchm on Dec 17, 2023 | hide | past | favorite | 45 comments



Hey Piotr/pchm, I'm not sure I follow your argument that Base32 is less popular because it's not a standard (there is a standard - RFC4648 as you mention).

Not implementing the RFC, is not implementing Base32, changing the order, or using 32 emoji does not make it Base32. Put another way, you can change the order of characters in Base64, or use a different dictionary, and indeed there are several variants of that too (BinHex4, Uuencoding, Base64Url, B64) - there are specific implementation detail concerns there too.

Base64 won out as a reasonably dense way to encode binary data in 7-bit safe ASCII for use in email, and later http headers (where spacing and line length may be modified in transit, and some ASCII characters are prohibited - eg 0x00/null). Part of the reason is; bit-grouping makes encode/decode simpler (you can use bit shifting). Something like ASCII85/Base85 which is a more dense encoding, and close to the maximum you can get in 7 bit safe ASCII (94 characters 33-126 if space is important, 95 if space quantity can be preserved) but you have to use multiply/divide instructions. The union of bit-shift speed (power of 2) and 7-bit safe ASCII characters (max 94 values) is: binary, base4, octal, hexadecimal, base32, and base64.

For human readability, especially verbal communication, hexadecimal or base32 are advantageous in that they are more dense than decimal, can be generated via bit-shifting vs more complex processor instructions, but you needn't also communicate the character's case (unlike Base64).


You make some good points. What I was trying to say is that even though there is the RFC, it's quite common to modify the alphabet or use other variants like Crockford's (mainly to avoid random profanity, e.g. in the URL identifiers).

When you see a Base64 string, you can be pretty certain that it's the standard version. With Base32, it's not obvious which variant was used.

Many languages don't provide a stdlib Base32 implementation (Ruby doesn't), but Base64 is pretty much always included. Maybe this influenced my perception of the lack of a universal standard.

Anyway, I should work on that section to communicate my point better.


I believe the technical term is “Schelling point”: something that people can decide on without communication.

Base64 is very close to the Schelling point of Base62 i.e. [A-Za-z0-9], requiring only a couple more additional decisions to be made: which two extra characters to add.

Unfortunately the original Base64 inexplicably got this wrong and chose + and / instead of the more sensible choice of - and _


In some cases (luck of the data, but often when encoding ASCII without padding) you won't see the non alphanumeric characters (62nd and 63rd place) in Base64 either. So you can't always tell the difference between Base64, Base64Url, Xxencode, or B64.

"Hello, world!" = `SGVsbG8sIHdvcmxkIQ` (base64, base64url), `BG4JgP4wg65RjQalY6E` (Xxencode), or `G4JgP4wg65RjQalY6E` (b64). A legitimate reason for choosing B64 over Base64 would be: it maintains ASCII sort-order.

Any language that has to deal with HTTP (or MIME) has to encode/decode Base64 in order to support some headers (eg Basic auth) and features (binary data from a form submission). There is no similar HTTP need for Base32, so perhaps it's less surprising it's not in the standard library?


If anyone interested, here is my article about optimization of encoding/decoding u128 to base62 (non power of 2) https://dev.to/rsk/optimization-of-u128-to-base62-encoding-3...


A few other bases that are interesting:

Base36: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ

Good encoding for binary data in textual contexts. Such as where you have parameter inputs or database fields that are constrained and only accept certain characters. The lack of spaces means that it can be used on the command-line easily. Example use: IRC channel names.

Base64: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Same as above but it adds lower-case alphabet characters. This is important because as you restrict the number of characters allowed in a byte: the length of the string goes up massively. With more characters the coding is more efficient. Example use: YouTube video ids.

Base92: "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~!@#$%^&*()_+{}\"<>?`-=[];',./|"

Base92 is every character you can make on a standard key-board (I've replaced space with pipe here.) It includes many characters that have special meanings on the command-line or may be used as delimiters in text-based protocols. So while this offers a more 'efficient' encoding scheme for binary data it may break in some contexts. It's best where the input allows for typical formatting. Example use: forum / chat messages.

BaseN encoding schemes are interesting because they allow you to use standard text-fields in many systems to store binary data. The most well-known here is base64 which allows browsers to embed whole files as text and store them directly in the HTML. Some sites use these for optimization hacks.


That is not base64, it's base62. You can tell because it only has 62 symbols. To get base64 you have to add 2 symbols that you arbitrarily select from the master "table of symbols to add to base62 to get to base64 depending on what the platform is and what characters are restricted in it" [1]. For instance you might use `@`, except in an email. Or `/`, but not in an fs path or URL.

As for base92, those symbols might all be easy to enter on your keyboard, but on international layouts the process can be quite involved indeed.

I prefer base36 for this reason. Want a compact random string? Math.random().toString(36). Watch out to prefix it with a char for settings that disallow leading digits through! (variable identifiers, css class names, etc.)

[1] https://en.wikipedia.org/wiki/Base64#Variants_summary_table


Base62 is fantastic for URL-friendly encoding. I use GUIDs for primary keys in my web app, and encode them for frontend consumption using Base62. Looks much neater and doesn't cause issues like Base64 extra characters might.


>Base64: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

This is only 62 characters!


Suspect they mistyped `Base62` (since they seem to be favouring non powers of two)


Base64 uses / and +


Sometimes. Other times those chars are not allowed in the embedding context (paths, for instance), so you have to use '+' and ','. Or maybe '_' and '-'.


If you follow rfc4648, those are "base 64" but not "base64":

> This encoding [using '-' and '_'] may be referred to as "base64url"... Unless clarified otherwise, "base64" refers to the base 64 in the previous section ['+' and '/'].


> base36 textual contexts

Better IMO is base 32 with U (obscenity), 0/O (ambiguity), and I (ambiguity) removed.


Removing characters for obscenity is pointless (thousands way to evade this "filter"), english-centric and honestly a weird idea.

I've always heard that the reason in another ambiguity (u/v) which makes more sense to me.


Base64/Base32/ASCII is English-centric.

Might be weird to you personally, but there's literally government agencies to prevent obscenities.


What makes the letter U obscene?


You can make the word fuck with it. That upsets children on the internet.


I doubt that upsets any children on the internet; more likely it upsets some adults on behalf of children on the internet.


If that's what you're trying to avoid, it will be a lot more effective to remove F.


Might as well go for Base27 then. Strip out all of the vowels and you can't accidentally make naughty words any more.


That's Crockford Base32, not RFC Base32

https://en.m.wikipedia.org/wiki/Base32


Crockford is a bit different, and normalizes I/1/O/0 on parsing.


Do we really expect humans to read baseX encodings directly to make it worth to have ambiguity checks?


Sometimes. Imagine if this is being used to generate something like a DOI or other catalog number for some data or physical artifact. As research scales up, the size of these identifiers also benefits from a more compact encoding.

These kinds of IDs might be printed in a research paper (perhaps in a figure caption or bibliography/reference entry). Then, someone might be reading this from a printed copy of the paper rather than a PDF with a link in it.

Or, researchers might be verbally referencing a particular item during some meeting. It might be recognizable among some peers actively working with the same artifacts, but might also need to be typed back into some search form to get back to online metadata etc.

Another place the same identifier might be is on a printed label for physical artifacts in an archive. Of course, you might also want something like a 2D barcode for scanning, but it is helpful to have something human readable.


Removing U just means your CD key begins with FCKGW


So.. crockford32 mentioned in the article?


Base32 encouraged me to develop my own Base32-encoder on .NET. I eventually added other encoding types over the years, leading to the library called SimpleBase. It's now being used by popular packages like Ipfs.Core, net-dns, and KubeOps.

https://github.com/ssg/SimpleBase


Just putting this here for alternate baseX: https://github.com/qntm/base2048


Radix: https://en.wikipedia.org/wiki/Radix

- "Golden ratio base is a non-integer positional numeral system" (2023) https://news.ycombinator.com/item?id=37969716

Number systems > Classification: https://en.wikipedia.org/wiki/Number#Classification ; N natural numbers, Z[±] integers, Q rational/s, R Reals, C complex numbers (with complex conjugate exponents), Infinity or Infinities, ; C ⊆ R ⊆ Q ⊆ Z ⊆ N

***

  i^4x ~= e^iπx
Also, perhaps this is a better representation for a continuum of reals:

   e^(x*yi*zπ)
But then there's no zero; only quantities approaching zero.

   e^(x*yi*zπ) * e^(a*bi*cπ)
But then there are still no negative numbers, so:

   sign * e^(x*yi*zπ) * e^(a*bi*cπ)
Where `sign` is in {-1,0,1}, or maybe just this would be the ultimate radix:

   sign * e^(x*yi*zπ)
But then represent infinity, or infinity_expr1 (because e.g. 1/x != 2/x except at x=0)


Which encoding eliminates symbols that can be confused ljke O vs 0? Or I vs l



Crockford variant of Base32 uses a different encoding alphabet and treats those similar characters as aliases.


I’m a big fan of base58

+ almost as efficient as base64 + no special characters + no padding characters


Unlike Base64 or Base32, Base58 has approximately O(N^2) complexity because it requires iterative division and multiplication operations on big integers. You can't encode a gigabyte of data with Base58 in a reasonable time, but you certainly can with Base64 or Base32.


I thought base58 runs on 8 byte blocks because 58^11 is slightly larger than 256^8. Then I checked the spec and this is actually not a standard requirement.


Haven't seen that, but how do you work with carry for numbers that are 256^8 < x < 58^11 ?


If we cut off at 8 byte blocks every number would be < 256^8. Encountering x > 256^8 would simplify be invalid.


In that case, there would be padding left in every encoded block. The size overhead would weaken the case for Base58 especially if you consider using it for arbitrarily long data.


Seems like a compiler should be able to convert division to shifts and subtractions.

> u8 divmod 58 can be reduced to a u8->u16 multiply, a right shift, and three conditional subtractions; that's not great, but on a modern CPU it's a afterthought compared to the quadratic loop over the input size.

Same topic from 2018: https://news.ycombinator.com/item?id=18409344


I don't understand that comment. How do you handle carry?


Base64 doesn't need padding so that one's easy.

No special characters... I mean it's true, but there's not many places I'm worried about inability to mix in some - and _.

Base58 also avoids a couple confusable characters, but that only matters when copying by hand, and if I'm copying by hand I'd rather use base32.


I'm a big fan of not base anything encoding


Why? Use Vinci for everything?


I like plaintext




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: