64 bits ought to be enough for anybody

tyoma · on Nov 28, 2019

There have been a few OpenCL versus CUDA posts lately and I wanted to add my take on it after writing the GPU portion of sixtyfour, which was my first GPGPU application.

OpenCL needs more examples, tutorials, and documentation. I really wanted to use it because it is cross platform, but quickly gave up. The tutorials and documentation wasn’t there, and what I did find was confusing. For CUDA there is an immense body of examples, support, and documentation that makes creating your first CUDA program dead simple.

I went with CUDA.

cesarb · on Nov 28, 2019

> For CUDA there is an immense body of examples, support, and documentation that makes creating your first CUDA program dead simple.

Isn't it dead simple only if you have compatible hardware? OpenCL works everywhere, even in the worst case on the CPU through pocl.

retrac · on Nov 28, 2019

I'm surprised this isn't more of an advantage. Sometimes you just need to test if your code runs and in my experience, a CPU on my netbook is fine for that purpose. With CUDA I have to upload the code to a VM and test it there. Both expensive and a bit awkward.

tyoma · on Nov 28, 2019

I originally wanted to do just that, but my laptop is a MacBook and OpenCL is deprecated there. With CUDA I could just ssh to an old box under my desk.

darkwater · on Nov 29, 2019

I guess you could just ssh with OpenCL too, no?

rat9988 · on Nov 29, 2019

Yes he can, but why would he? It just doesn't make sense to do so.

tyoma · on Nov 28, 2019

I know, that was one of the reasons I wanted to use OpenCL. My time for this was limited, and I saw I could get a basic implementation working much faster in CUDA. All the major cloud providers offer (only) NVIDIA cards too, so it was not a hindrance for the end goal of running it on cloud.

llukas · on Nov 28, 2019

Different subsets of OpenCL with different extensions work everywhere.

If you want to use C++ then it definitely doesn't work everywhere. Precompiled kernels (SPIR-V) also do not work (anywhere?).

Writing a GPU C program that only gets compiled when you run your program is extra overhead on productivity.

pjmlp · on Nov 28, 2019

Try to use OpenCL on Android.

nixpulvis · on Nov 28, 2019

I took a class one summer that used OpenCL. Ended up learning a lot. Maybe this will help someone: https://github.com/nixpulvis/gpu_prog_arch

pjmlp · on Nov 28, 2019

Back when I started to explore GPGPU, OpenCL was all about C, CUDA allowed me to play with more productive languages, naturally I also went with CUDA.

Same applies to Vulkan versus other next-gen APIs.

jdc · on Nov 29, 2019

Did you check out SYCL at all?

frozenport · on Nov 28, 2019

OpenCL needs template meta programing and C++ language features, right now its too painful to do any serious optimization (like work per element).

Const-me · on Nov 29, 2019

The source code is very hard to port to Windows: pthreads instead of OpenMP, Linux sys.calls instead of std::chrono, proprietary compiler extensions for CPU SIMD for no reason, etc.

That’s unfortunate because it would be interesting to compare virtualized V100 GPU with real hardware 1080Ti. On paper, the two are pretty close. If that’s true in practice, you can use https://vast.ai to reduce the cost by a factor of ~5, i.e. it will only cost about $350.

tyoma · on Nov 29, 2019

Concur on pthreads and Linux syscalls, but the intrinsics should be cross-platform.

Const-me · on Nov 29, 2019

They are, when used properly.

E.g. to create an SSE register with all 64 bit lanes set to the same value, use _mm_set1_epi64x, to different values _mm_set_epi64x (don't forget to flip the order). No need to use GCC's proprietary __attribute__ ((aligned (16))).

Another portable option is C++ and alignas keyword.

kick · on Nov 29, 2019

That website is kind of sketchy. The blog on it doesn't link anywhere. Is it yours?

gwern · on Nov 29, 2019

Vast.ai is a little obscure but it's a real functioning service, and pretty nifty. I used it awhile ago to rent an 8-GPU machine for way less than AWS would've cost: https://www.gwern.net/Faces#biggan

Const-me · on Nov 29, 2019

Not mine, found by DDG by a question like “rent 1080Ti GPU”, and there're more.

I don’t need their services, but I think the prices are reasonable.

Nvidia is way too greedy, they charged $5000 for server equivalent of a $700 consumer GPU. The major difference between the two GPUs is legal, not technical.

Cloud providers like Google or Amazon have little choice but to pass the cost to users.

rrss · on Nov 29, 2019

The server class gpus are generally different silicon than the consumer gpus. That's a technical difference.

Const-me · on Nov 29, 2019

> generally different silicon

1080Ti / P100:

Chip: GP102 / GP100

Cores: 3584 / 3584

Base clock: 1480 / 1328

Single precision TFlops: 10.6 / 10.6

TDP, Watts: 250 / 300

The only major difference of the chip is double precision performance, probably irrelevant for the OP's task. And I'm not convinced the difference is in the silicon as opposed to firmware or drivers, games don't need doubles so the NV could cripple the consumer model without consequences, just to differentiate the two products.

rrss · on Nov 29, 2019

GP100 and GP102 are different sizes. They are certainly different silicon - the die area can't be changed in firmware or drivers.

From techpowerup:

GP100: 610 mm^2

GP102: 471 mm^2

Const-me · on Nov 29, 2019

Good point. You’re right, they are different chips. But still, the only difference is double precision performance (that’s probably what these extra transistors do), and support of HBM2 memory as opposed to GDDR5X. Pretty sure Amazon/Google/MS would be happy to offer affordable cloud-based GP102-s, if they could.

kick · on Nov 29, 2019

You don't seem to be familiar with binning.

Const-me · on Nov 29, 2019

I am, but I think greed contributes much more to the price.

nVidia also made Tesla P40. It has 100% identical chip to the 1080Ti, same GP102, same frequencies, 100% same specs. The price is around $5000. The GPU is not 100% same, it has more VRAM, but still, that's 8x price difference for almost identical products.

If Teslas would be better, nVidia wouldn't need to put that ridiculous paragraph in the EULA of the driver, about data center usage. That paragraph is the only reason why we don't have affordable GPGPUs in AWS.

chaosfox · on Nov 28, 2019

I don't think I understand the problem.. to do this search you need to have the target number, and if you already have the target number no search is required.. what am I missing ?

tyoma · on Nov 28, 2019

So the goal wasn’t to find a new and more expensive method for figuring out an answer when you can examine the oracle. The idea was to show that an “impossible” problem isn’t really that impossible given some thought about the algorithm and the computing power easily accessible today.

The impetus for doing this was though a real discussion when we were debating whether to do an analysis to identify some constants, or just to brute force comparisons. At what point was the analysis faster? One thing led to another and next thing I know I was learning CUDA...

saagarjha · on Nov 28, 2019

It’s a thought experiment to figure out how long it takes to brute force a 64-bit space. In reality you won’t have the number you’re searching for.

bonoboTP · on Nov 28, 2019

But in reality you'd need to do some complicated operations on each 64-bit item, right? When e.g. cracking a password or so.

I also don't understand what type of real-world application is being simulated here. I do understand that it is a simplified example, but of what?

The takeaway is that it is feasible to perform a single comparison operation for each 64-bit bitstring. But we don't just do a single operation per item in the real world.

empath75 · on Nov 28, 2019

The complicated operation will generally be a constant factor though. Let’s say it takes 100 times longer each cycle, now it’ll cost you $100,000 instead of $1000.

saagarjha · on Nov 28, 2019

Yeah, you’d probably add an order of magnitude to this for most practical instances, but it’s still not completely unreasonable.

tyoma · on Nov 28, 2019

There is an associated github repo with code for those who want to reproduce results (https://github.com/trailofbits/sixtyfour). Or if you’d like to contribute an ARM version :).

saagarjha · on Nov 28, 2019

Don’t know enough about NEON to help, but do you even have hardware to run it on if someone did contribute an ARM version?

tyoma · on Nov 28, 2019

Yes. AWS has support for ARM instances, and I have an RPI3 :).

saagarjha · on Nov 28, 2019

Up until recently I had considered even a 32-bit brute force to be infeasible (fun story: I learned this during a CTF, when I complained to a team member that I was stuck trying to find a way to not have to do that and they just came back to me with the brute-forced answer about an hour later…). This’ll teach me to not underestimate computers in the future :)

clarry · on Nov 28, 2019

Advent of Code[1] often has problems that might seem just a bit too much to bruteforce, or not, depending on your background and language at hand. Many of them are bruteforceable enough that I ended up setting races: run a bruteforce in the background while I try to write a less naive solution. Kinda fun, but I also found out that many of the problems are a bit too easy to bruteforce (with C on a modern CPU), taking something like 10-15 minutes.

[1] https://adventofcode.com/

reitzensteinm · on Nov 28, 2019

It really depends on the complexity of what you're doing. Since n cycles per 32 bit value takes n seconds on a single core, brute forcing the fast invsqrt const will be done while you take a few sips of coffee.

But any kind of substantial operation like building data structures and spilling to main memory can trivially leave you with a program that takes weeks.

josteink · on Nov 28, 2019

In some CPU-discussions I continuously see some people still insist Intel beats AMD... because AVX512.

But these tests shows minimal difference between AVX2 and AVX512. So what’s the big deal?

jiggawatts · on Nov 29, 2019

In current-generation Intel CPUs the use of AVX512 forces a reduction in clock speed, negating much of the theoretical benefits.

The clock speed reduction also affects other cores, which may not be executing AVX512 instructions at that time. Code that is nearly 100% AVX512 will get an overall speed boost, but mixed multi-threaded workloads can actually regress in performance. On virtualised or multi-user systems such as Citrix or some database engines this is a serious issue at the moment.

This will improve or go away entirely once they're on 10nm or some smaller process.

Secondly, the rarity of AVX512 means that few applications take advantage of this instruction set, and those that do have not had anywhere the same level of fine-tuning as the more common AVX2 code.

For example, SQL Server recently got vector instruction set support, called "Batch Mode Processing", but I can't find any references to indicate that it uses AVX512, and it probably doesn't.

Once AMD supports AVX512, it trickles down to the mainstream CPUs, and the clock speeds are maintained it'll have a significant advantage over AVX2.

tyoma · on Nov 29, 2019

The big deal with AVX-512 is the scatter-gather instructions, which make it possible to efficiently vectorize different classes of problems.

In this particular use case those did not really matter, but when you can use them they really help.

aianus · on Nov 29, 2019

Does anyone know why the big cloud vendors don’t seem to have any offerings that can beat a regular consumer machine in single thread performance? Surely there must be a market for it?

FractalParadigm · on Nov 29, 2019

AFAIK server CPUs aren't much different than consumer offerings, and generally more cores = lower clock speeds, and (generally) subsequently leads to lower single threaded performance.

It's highly likely there's a market for it, however I'm doubtful they'd be willing to spec lower-density, higher-performance machines for the few people who would actually take advantage of them. Considering high core count CPUs are becoming the new norm, re-writing or finding software that can take advantage of multiple cores/threads is certainly the way to for maximum performance (where possible, of course)

moreati · on Nov 29, 2019

My guess: there's not a _big_ market for it.

I half remember an article about stock trading companies overclocking servers and running them single threaded at max clock speed. The aim was shaving a few more microseconds from their response time. They accepted the reduced lifetime of the hardware, just replaced it sooner. I got the impression it was a niche market though.

aewens · on Nov 28, 2019

While using GPUs is one way to increase performance, with all of those machines at your disposal you can also use a simple MPI script to run the job in parallel across all of them to drastically bring down the total runtime. This is actually how much of the industry handles parallel processes at scale since running it on a single machine can only get you so far.

tyoma · on Nov 28, 2019

So even though GPU time is more expensive and GPUs need specialized programming, for this particular problem they can’t be beat with raw CPU power, either in absolute or in $/ops measures.

Just putting a problem on a ton of cores is an absolutely valid strategy for problems that are a bad fit for GPUs.

iscoelho · on Nov 28, 2019

If you run parallel across many machines you will reduce the time, but the cost stays the same (roughly anyway).

The initial cost is what makes this insane, not the time, and the end result (GPUs) is undebatable the most cost efficient method (in this case very likely by a factor of 50-100x).

A situation like this is not premature optimization. Developers sometimes need to understand that in the scheme of things, their time is not worth very much compared to the cost of the infrastructure required to run their code. Throwing the corporate credit card at optimization issues is far too common today in tech.

hinkley · on Nov 28, 2019

On the flip side, we spent almost $100,000 in developer time one summer in order to keep a handful of customers from having to upgrade hardware.

We could have gifted them all $5000 machines and spent less. Plus, when you remember that the point of spending on developers is to earn back many times that cost in sales, the opportunity cost of that 3 months of very senior development effort was massive.

girvo · on Nov 28, 2019

I think it depends on what said developer is working on, and what kind of infrastructure is under discussion, no? Optimising our web app’s backend has worth, so we do it, but only to a point and it’s certainly not a major component of our workload — our time is definitely worth more than we would save on our infrastructure, because our infrastructure is small to begin with

BlueTemplar · on Nov 28, 2019

No quantum supremacy comparisons ?

rolltiide · on Nov 29, 2019

the joke this time is that we switch to non-binary logic gates with qubits.

so we'll still laugh at this comment even though the addressable space turned out to be "good enough", it was just an antiquated framework to begin with

why-oh-why · on Nov 28, 2019

If my calculations are correct, that GPU is making 1,000,000,000,000 guesses per second.

The takeaway here is that cracking a 64bit password only cost $1700 and is totally doable in 3 weeks.

If you need things to be really safe I guess you need to start lengthening those passwords.

DuskStar · on Nov 28, 2019

> The takeaway here is that cracking a 64bit password only cost $1700 and is totally doable in 3 weeks.

Not quite - you need to run the hash function that the password is stored with, and that might be a factor of 1,000,000+ slowdown relative to "are these numbers the same".

gruez · on Nov 28, 2019

>If my calculations are correct, that GPU is making 1,000,000,000,000 guesses per second.

>The takeaway here is that cracking a 64bit password only cost $1700 and is totally doable in 3 weeks.

...assuming that each "guess" only requires a comparison. If you factor in hashing, it's much slower. According to gpuhashcat benchmarks[1], a 1080 ti crack md5 passwords at ~25 GH/s (1GH/s = 1 billion guesses per second), phpass at ~6.9 MH/s, and bcrypt passwords at 13 KH/s.

[1] https://gist.github.com/epixoip/a83d38f412b4737e99bbef804a27...

kerng · on Nov 28, 2019

Not exactly, it requires a lot more computational effort since it's not just guessing a number (salt, iterations and hash function are there to make this much harder). However, the example shows that seemingly impossible things are indeed not that far away (even for ordinary people). Computers are amazing.

oxfordmale · on Nov 28, 2019

If you need things to be really safe, you should enable 2FA.

Those $1700 of computational costs are likely to be run on hacked infrastructure, so definitely doable.

michaelmrose · on Nov 28, 2019

The 1700 estimate is probably off by a factor of a million.

1.7 billion in computation

rstuart4133 · on Nov 28, 2019

The amount of yak shaving needed to get a one line loop running on CUDA is truly impressive. (shudder) But I bet an FPGA implementation would beat it, in every way.