The computer this runs on is Ben Eater's 6502 breadboard kit, which I highly recommend to anyone with the slightest interest in wires and NAND gates: https://eater.net/6502. After the hello world program was working, I wanted to try something more substantial, so here's the world's slowest implementation of BLAKE3 :) If you have Rust and Cargo installed, you can run `make emulate` to test out the program in your terminal.
Ha, this is neat, I haven't seen the permutation table presented this way before, good stuff.
permute:
; Rather than permuting the 32-bit words stored in the
; COMPRESS_MSG array, we permute the 8-bit pointers stored
; in the MPTRS array.
;
; The permutation is defined to be:
; 2, 6, 3, 10, 7, 0, 4, 13, 1, 11, 12, 5, 9, 14, 15, 8
;
; If we walk those assignments, we get two loops:
; 0<-2<-3<-10<-12<-9<-11<-5<-0
; 1<-6<-4<-7<-13<-14<-15<-8<-1
If you want to be on the far end of the speed/space trade off you could simply use a lookup table. That should speed things up considerably. You'd index it by byte (split the tables, save one shift left) and then recombine the results.
The fact this takes almost 18 minutes on a 6502 but only ~100ms on my phone is incredible. One of those things that makes you realise we have so much computing power it's hard to comprehend.
Given how the 6502, 8051, and Z80 cores seem to show up in a ton of ultra-low-cost products (running at many times their original clock speeds, and using bank switching to extend the address space into the megabytes range), some even networked with ability to do TLS, the fact that a 6502 can do BLAKE3 isn't so surprising, and IMHO really helps reinforce the notion that crypto == math.
I think the biggest difference between them is that BLAKE3 is still quite new. I agree with https://latacora.micro.blog/2018/04/03/cryptographic-right-a... that most cryptographic applications should make "boring" choices, and BLAKE3 is not yet boring. But assuming everything goes smoothly over the next 2-3 years, I'll start recommending BLAKE3 over BLAKE2 in essentially all cases.
If you're curious about the single-threaded performance differences between BLAKE3 and BLAKE2b, I might clarify that the red bar chart at the top of the BLAKE3 readme is actually a single-threaded measurement. Everything depends on your input size though, and we have some more detailed graphs in the paper.
Unlicense [0] is the "short and simple" counterpart to CC0. I like the idea of simple language licenses, too, but there are thought to be some issues with these [1].