ChatGPT says a desktop CPU uses 105Wh for every hour of maxed computations. I th...

SushiHippie · on June 18, 2024

FWIW getting 8 0s took me 30 seconds, 9 0s definitely less than an hour, but after that I only got lower 9 0s, and 10 0s do not seem to be in sight

quirino · on June 18, 2024

One hour later, it seems like you've got your 10th zero! :)

Luck plays a tiny part too, I'm testing sequentially and I got my 9th zero with only ~7e8 hashes, a whopping 100x earlier than expected.

SushiHippie · on June 18, 2024

Oh I didn't notice, I've written something so it automatically submits if it's a better value haha.

Yep, crazy how long it took, but currently I'm also running multiple 'variants'. All the same program except the nonce generation, one with numbers, one with 32 character long 'base64' and one with 64 character long 'base64'. I got inspired by seletskiy who instead of using numbers used the whole allowed character set for the nonce.

playingalong · on June 18, 2024

Why would the selection of characters matter? (Other than making sure you try distinct input strings)

SushiHippie · on June 18, 2024

It doesn't but I'm going sequential, so I needed different character sets for each instance I launched.

NAHWheatCracker · on June 18, 2024

I guess either my computer or my code is really bad! I only got a mid-8 0s with multiple hours.

SushiHippie · on June 18, 2024

How many hashes does your code go through per second?

I just tried your HN username, and it took me these times:

19 seconds for 7 0s - nonce: 293576344

28 seconds for 8 0s - nonce: 436316829

NAHWheatCracker · on June 18, 2024

I added some instrumentation and it was doing around 3.3 million hashes per second. Mine isn't sequential, it's generating random strings. Changing from a 64 byte string to a 26 byte string with (14 character nonce) brought it up to around 6 MH/s.

Given the nonces you provided, I guess you're going sequential and you're hashing 2-3 times faster than me. I just got a 7 0 hash in 70 seconds and an 8 0 hash in 164 seconds. You're right and I was overestimating how long I waited around last night.

SushiHippie · on June 18, 2024

After experimenting more with this, it seems like with random generation the bottle neck is the "input generation" and not the hashing itself, that's why I went with sequential.

I replaced my nonce generation with a static string and my hashrate went to the moon, so I invested more time in optimizing my input generation.

NAHWheatCracker · on June 18, 2024

I profiled my code and you're right that it was using a big chunk of CPU on generating the input. Approximately 40-50% of the CPU time was taken up by go's rand.IntN function.

I was able to optimize it so that I only call rand.Int63 twice per string generation, bringing the CPU time of that part down to 10-15%. I'm getting 11.3 MH/s now!

amenhotep · on June 18, 2024

There is really no point randomising at all, any cycles you spend on it are a total waste - one of the most important properties of a good hash algorithm is that a one bit change in the input should result in an output that's completely unrelated to the previous.

Your random number generator is pretty much equivalent to feeding a counter into SHA256, so using it as input is pretty much equivalent to hashing a counter twice. There's no point!

NAHWheatCracker · on June 18, 2024

Yea, I know. It makes me feel better this way, though.

NAHWheatCracker · on June 18, 2024

BTW, I changed it so it's iterative, getting 12.5 MH/s.

tgmatt · on June 18, 2024

Curious. I tried with a C program that ChatGPT spat out using 12 threads of my CPU and I get the 8 zeroes almost instantly. It calculates 1 billion in about 13 seconds, apparently.

Start time: 1718742386

1718742387 SHA-256 hash of "NAHWheatCracker/486005487" is: 00000000cbed8e14c5e52b0c9bef443f017564aa6870da37f85ec92dd01b544d

1718742394 SHA-256 hash of "NAHWheatCracker/993155254" is: 0000000031fadb4805db8036ec66d872bdd9f04a507e0bbfea9f60d1d00b86f2

1718742395 SHA-256 hash of "NAHWheatCracker/436316829" is: 00000000e20d059e8a7dc687b85bd6b33e891f7d4b27e98aaf0c991d0264066e

End time: 1718742403

EDIT: To be clear, I'm not submitting anything to the leaderboard as I appreciate the spirit of the challenge is writing your own code. I'm just having some fun.

EDIT 2: Added some better timing. Interestingly, I modified it to use 64 bit integers and it was dramatically slower. Like, orders of magnitude slower.

ifeja · on June 18, 2024

Please would you post the code

tgmatt · on June 19, 2024

Sure: https://pastebin.com/tFPadBVZ

Again, I didn't write it, but did add some timestamps.

quirino · on June 18, 2024

The top competitors are very likely using GPUs, I guess that would make this cost much much lower...

jonas-w · on June 18, 2024

At least I'm not (second place currently), I'm using a single threaded rust program that chatgpt wrote for me. I'm getting 10-15 million hashes per second.

But seletskiy is, as you can see in his nonce which tells us that he uses an RTX4090 and has 18 Giga Hashes per second. I'm really curious how he wrote that program, as I have a Rx 7900 XTX and would love to use it for this competition haha

seletskiy · on June 18, 2024

Hey. It's nothing very fancy. About 150 lines of C/CUDA code with no deps, including args parsing and logging.

The code runs at steady rate of 18.00-18.40 GH/s at cloud GPU. In fact it's not hashes-per-second, but actually messages-per-second checked.

It launches a 64⁶ kernels in a loop, where each launch checks first two bytes of the SHA of a message concatenated with unique 6-byte nonce per kernel + 6-byte incremental nonce for each launch. There is only one block, so SHA algorithm is heavily trimmed. Also, most of the message is hard-coded, so pre-calculated SHA state is used; it's 10 loops less than needed to encode whole block. Since we only 2 bytes of the hash to check, last two loops also unrolled by hand to exclude parts that we wouldn't need. All code also in big-endian since SHA is, so message hardcoded in big-endian as well.

Base64-encoding is pretty time-consuming, so I've optimized it a bit by reordering the alphabet to be in ASCII-ascending order. I've got to the point where single binary-op optimization can earn 100-500 MH/s speed-up, and I don't really know what else here is remaining.

I don't have RTX4090, so instead I just rented 4090 GPU to run code on. It's less than $0.3 per hour.

I've tried GPT-4 to get some directions for optimization, but every single proposal was useless or straight wrong.

I by no means a C/GPU programmer, so probably it can be optimized much more by someone who more knowledgeable of CUDA.

GPU's are ridiculously fast. It freaks me out that I can compute >18,000,000,000 non-trivial function calls per second.

Anyways, if you want to chat, my e-mail is in the profile.

jffry · on June 21, 2024

I think we've done something similar in our kernels, because I've likewise struggled to squeeze more than ~18GH/sec from a rented 4090. I think the design of SHA256 makes it hard to avoid many of the loop iterations. It's possible there are some fun GPU specific instructions to parallelize some of the adds or memory accesses to make the kernel go faster.

If you limit your variable portion to a base16 alphabet like A-P, radix encoding would just be `nibble + 0x41`. Sure you're encoding 4x as many values, but with full branch predictability. You'd have to experimentally determine if that performs any better.

You could also do something theoretically clever with bitmasking the 4 nibbles then adding a constant like 0x4141 or 0x00410041 or something (I'm too tired to think this through) and then you can encode twice as many per op assuming 32-bit bitwise ops are similar in speed to 16-bit bitwise ops.

Anyways this has been a cool challenge. You might also enjoy hunting for amulets - short poems whose sha256 hash contains a substring that's all 8: https://news.ycombinator.com/item?id=26960729

korbin · on June 21, 2024

SHA256 is designed as such that the maximum amount of data that can be contained within a single block is 440 bits (55 bytes.)

If you carefully organize the nonce at the end and use all 55 bytes, you can pre-hash the first ~20/64 rounds of state and the first several rounds of W generation and just base further iterations off of that static value (this is known as a "midstate optimization.")

> If you limit your variable portion to a base16 alphabet like A-P

The more nonce bits you decide to use, the less you can statically pre-hash.

In FPGA, I am using 64 deep, 8-bit-wide memories to do the alphabet expansion. I am guessing in CUDA you could something similar with `LOP3.LUT`.

jonas-w · on June 18, 2024

Thanks, I sent you an e-mail, you might need to check in spam folder, as gmail doesn't like my mail server.

wdfx · on June 18, 2024

I've also been spending today building a miner. Started in JS, then JS with worker threads (~2.2MH/s), then c++ with openssl (~3.7MH/s) and now attempting CUDA.

Also have been using ChatGPT to do the code translations.

I'm currently stuck in yak shaving, I cannot compile CUDA programs here yet as on fedora 40 I need an earlier gcc (13.2) which I am now also having to compile from source.

jonas-w · on June 18, 2024

I'm also yak shaving currently, as I'm using NixOS... which means spending more time looking for the correct packages and creating environments.

jonas-w · on June 18, 2024

FWIW I got it working with rust and opencl, most of it is written by chatgpt as I have no clue about opencl. GPU usage is only 50-60% and I get 100MH/s.

With hashcat and opencl I could get 12GH/s but I couldn't find a way to use hashcat for this use case.

wdfx · on June 18, 2024

I've parked CUDA for now, I don't understand it enough to care to fix the tooling.

Got an optimised C++ version with no deps averaging about ~24 MH/s on an i7-11800H.

I've got 9 zeros; if I get a result that ranks top 10 I think I'll submit and call it a day.

jonas-w · on June 18, 2024

Wow 24MH/s on an i7 with 8 cores sounds really good!

I don't know how I got it working, but I'm now at 3GH/s with my OpenCL implementation. I basically converted 90% of my rust logic to opencl and now my GPU is at 100% usage and I also needed to switch to a tty, as my window manager became unresponsive haha

I'm kind of glad about this HN post, as I had absolutely no clue about how sha256 and opencl worked before this challenge.

Thanks @quirino

quirino · on June 18, 2024

I'm glad you had some fun! This experiment went about as well as I could hope!

If anyone's curious, I'm getting 4.5MH/s single-threaded and 12.2MH/s multi-threaded on a slightly old i7 with 4 cores.

It's my own C++ implementation, which I've made about 20% faster than the fastest one I found online (Zig/stdlib, also tried Go/stdlib, C++/cgminer, rust/ring, C++/VanitySearch and Python/stdlib).

I think it might be faster just because I was able to skip some steps, since all inputs are short and of the same length.

I've just finished testing 10^12 inputs. I think I'll stop with 10 zeroes, which is very likely to happen in the next couple of days, according to my calculations. I might revisit it later to learn some GPU programming.

jonas-w · on June 18, 2024

Very cool!

But same, I also was able to skip some steps as I can make some assumptions about the input as its fixed in length and doesn't need padding.

I got lucky and found a hash with 12 zeroes in 60s with my OpenCL implementation, and now I got my second spot back.

GPUs are so crazy

nzentzis · on June 18, 2024

I'm using a janky CUDA miner that's getting around 6GH/s on a 3090. My code is kind of bad though since I hacked it together in a couple of hours last night, so I wouldn't be surprised if a proper/competent implementation got 2-3x the hashrate mine does.

jonas-w · on June 18, 2024

Mind sharing it? I have an AMD card, so its not that much of use for me, but still would be interested