My Engineyard Contest Odyssey

jcsalterego · on July 22, 2009

Looking at the C code, here are a few suggestions:

- __builtin_popcount() and __builtin_popcountll() are great gcc builtins which perform bit population counts, useful for hamming distances (after the XOR operation, of course). [1]

- avoid scanf(), which is costly. My approach was to generate a header file with the char * words[] array, and also a int words_len[] with, you guessed it, the pre-computed lengths. Then go for the memcpy() route and you're in a much better position.

- speaking of memcpy, gcc provides one that should be optimal in most cases, but for really short words, it might be easiest to write your own. I came across a good article on it [2] but this is what I ended up using:

  void
  c_memcpy (char * dst, char * src, size_t len)
  {
    __builtin_prefetch(dst, 1, 3);
    while (len--) {
      *dst++ = *src++;
    }  
  }

[1] http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Other-Builtins.h...

[2] http://www.embedded.com/columns/technicalinsights/19205567

wooby · on July 22, 2009

Thanks for the tips. gcc builtins are definitely the next level for me.

I'm curious about your scanf point though. I'm using scanf/strdup before the check loop to load the words into memory. Are subsequent memcpy calls on an implicitly loaded *char dict[] array faster on non-malloc'd memory for some reason?

jcsalterego · on July 22, 2009

Unfortunately, I have no empirical data or to support that hard-coding in a payload has an advantage over a dynamic allocation. As an aside, the Python dictionary (hash table for you rubyists/perlfaces) works this way, as it over-resizes to allow growth, unless you specify its maximum size (__slots__ I think?)

My hope and understanding are that the compiler's static analysis skills will best align the given data, whereas something like malloc() would try to allocate on the heap as best it can with the open possibility of other data needing to be on the heap as well.

The point I was driving at, and what I tried to do, was to hard-code in as many known variables as possible such that the compiler could do its thing and try to be as efficient as it could with the given data. This includes the word list, the list of lengths, the target hash -- none of which has to be done by hand, but rather a higher-level script called before compilation.

Good times.

wooby · on July 23, 2009

That makes sense, I might test it sometime.

nkurz · on July 22, 2009

Great writeup, thanks! I'm confused by your conclusion, though: "I wish I would have identified Rackspace initially as the platform of choice."

From the results of the contest, I would have concluded that the platform of choice was CUDA. Do you think otherwise? Is there a reason you'd choose to use a flotilla of Rackspace servers for a problem like this rather than a graphics card? If you are going to be programming in C anyway, it seems like a much simpler approach.

wooby · on July 22, 2009

If I was using just my own machine, or spent a few grand to put together my own CUDA farm, CUDA would have been great.

I think using cloud resources was the way to go though, because it's easy to search for keys concurrently, and you can summon an effectively unlimited amount of computing power pretty cheaply. You can add as many cloud servers as you want to your cluster without worrying about locking or divvying up work, and the number of keys your cluster is able to check per second increases linearly with the number of machines.

The winning team, awesome guys with a cluster of 8 very powerful computers and a CUDA program, were able to check just over 400 million keys per second. If I would have used 400 Rackspace cloud servers, it would have cost me about $100 to run them for the duration of the contest and I would have been achieving similar speeds.

nkurz · on July 23, 2009

I guess I'm feeling that the CUDA solution pays for itself pretty quickly if you are going to be using it for the long term. Personally, I was checking 200 Million keys per second on a $300 Nvidia card and got down to 34 (with at least 3 35's).

Presuming your costs are accurate, this would imply that the card would be a cheaper solution if run for more than a week. Considering the ~100x speed advantage of the card, it's rather surprising that their cloud solution is as affordable as it is.

bwag · on July 22, 2009

Looks like the winning team was able to generate more than 400 million keys per second.

"With all the machines running we got up to a total of 3079431974 hashes/second" (via http://www.win.tue.nl/cccc/sha-1-challenge.html)

wooby · on July 23, 2009

Uhoh - I missed decimal place. 3 billion/sec would mean about 3000 of the cheapest Rackspace cloud machines, which could have been operated for 18 hours for about $810. I'm not sure Rackspace lets you have that many though.

deno · on July 23, 2009

I'd like to see CUDA Cloud Hosting. The contest's result are showing that'd be an awesome solution for heavy computational problems.