[Condensed version of this narrative: the news.yc code, prior to the
the release of arc3, contains a remotely-exploitable vulnerability
permitting account theft. Anyone running a news installation who has
not yet upgraded to arc3 should do so.]
Hacker News login cookies are random eight-character strings, stored
server-side in a hash table mapping them to user names. I discovered
a few weeks ago that these strings were rather less random than they
were meant to be, and, through a delightful combination of exploits,
could be predicted, enabling an attacker to steal accounts.
Here's the rand-string function from arc.arc, version 2. It gets
called with n=8 to generate login cookies, and n=10 for the "fnids"
that get used all over the site as hash keys identifying closures.
(def rand-string (n)
(with (cap (fn () (+ 65 (rand 26)))
sm (fn () (+ 97 (rand 26)))
dig (fn () (+ 48 (rand 10))))
(coerce (map [coerce _ 'char]
(cons (rand-choice (cap) (sm))
(n-of (- n 1) (rand-choice (cap) (sm) (dig)))))
'string)))
The first thing you might notice about this function is that not all
characters are equally probable. Each digit has a 1/30 chance of
occuring, while each letter has a 1/78 chance. This alone is no big
deal: this distribution means that each character carries 5.826 bits
of entropy, versus the 5.954 that a uniform distribution would
provide. So for an eight-character string, this bug reduces the
effective keyspace by just over a factor of two -- not enough to have
any practical implications.
The 'rand' function is an arc primitive, bound directly to mzscheme's
'random':
; need to use a better seed
(xdef 'rand random)
The comment seen here is prescient, as we'll see.
This is the C function which implements mzscheme's 'random' function:
static long sch_int_rand(long n, Scheme_Random_State *rs)
{
double x, q, qn, xq;
/* generate result in {0..n-1} using the rejection method */
q = (double)( (unsigned long)(m1 / (double)n) );
qn = q * n;
do {
x = mrg32k3a(rs);
} while (x >= qn);
xq = x / q;
/* return result */
return (long)xq;
}
Where mrg32k3a() is:
static double mrg32k3a(Scheme_Random_State *s) { /*(double), in {0..m1-1}*/
double x10, x20, y;
long k10, k20;
/* component 1 */
x10 = a12*(s->x11) - a13n*(s->x12);
k10 = (long)(x10 / m1);
x10 -= k10 * m1;
if (x10 < 0.0)
x10 += m1;
s->x12 = s->x11;
s->x11 = s->x10;
s->x10 = x10;
/* component 2 */
x20 = a21*(s->x20) - a23n*(s->x22);
k20 = (long)(x20 / m2);
x20 -= k20 * m2;
if (x20 < 0.0)
x20 += m2;
s->x22 = s->x21;
s->x21 = s->x20;
s->x20 = x20;
/* combination of component */
y = x10 - x20;
if (y < 0.0)
y += m1;
return y;
}
This, obviously, is not a cryptographically strong PRNG. Is it possible
that we could break it, computing its internal state by seeing a few
consecutively-generated rand-strings? Probably: it looks as though it
could be represented as the solution to a manageable system of diophantine
equations. That, though, was more math than I felt like doing, so I went
looking for an easier approach.
Where does the RNG seed come from? Ah ha:
rs = scheme_make_random_state(scheme_get_milliseconds());
Where scheme_get_milliseconds is defined, after eliding some
preprocessor cruft, as:
long scheme_get_milliseconds(void)
{
struct timeb now;
ftime(&now);
return now.time * 1000 + now.millitm;
}
In other words, the random seed is merely the number of milliseconds
since epoch at the time the seed function was called.
The part of mzscheme that calls the seed function is a bit daunting:
it appears that in some cases, the PRNG state can be thread-local and
be initialized when the thread is spawned. However, instrumenting
sch_int_rand() with some debug output showed that in arc, the same
state vector gets used everywhere, and is initialized when the
mzscheme runtime starts up.
The millisecond at which news.yc last started is not an immediately
simple thing to determine, though it was at least easy to verify the
sanity of the system clock, thanks to an open NTP serevr:
dfranke@feanor:~$ sudo ntpdate -q news.ycombinator.com
server 174.132.225.106, stratum 2, offset 0.370866, delay 0.08228
17 May 01:45:13 ntpdate[27901]: adjust time server 174.132.225.106 offset 0.370866 sec
So for a start, I thought, perhaps I could determine the server's
start time to within a few seconds or minutes. A boring way to go
about this would be simply to monitor the server for downtime, and
record when it became accessible again. But impatience is one of the
three great programmer's virtues, and the best way to predict the future
is to create it, and so forth, so I decided on a more proactive
approach: crash it!
A couple months ago, PG left this comment after news.yc recovered from
some downtime:
HN was down today for around 2 hours. Sorry about that.
The News server currently crashes a couple times a day when it runs
out of memory. All the comments and stories no longer fit in the 2 GB
we can get on a 32 bit machine. We'd been planning to upgrade to a new
64 bit server. In the meantime it was arguably a rather slow form of
GC.
Unfortunately the process somehow got wedged in the middle of
segfaulting. We're not sure why and will probably never know. But that
meant the process that usually notices when News is wedged and
restarts it was unable to kill it.
(The server had since been upgraded, so these crashes are/were no longer
happening.)
I figured that the watchdog works by requesting a page and checking to
make sure it gets a response, and that if it doesn't get one, then it
assumes the server is wedged and restarts it.
Here's arc2's top-level request handler:
(= srvthreads* nil threadlimit* 50 threadlife* 30)
; Could auto-throttle ips, e.g. if one has more than x% of recent requests.
(= requests* 0 requests/ip* (table) throttle-ips* (table) throttle-time* 30)
(def handle-request (s (o life threadlife*))
(if (len< (pull dead srvthreads*) threadlimit*)
(let (i o ip) (socket-accept s)
(++ requests*)
(= (requests/ip* ip) (+ 1 (or (requests/ip* ip) 0)))
(let th (thread
(if (throttle-ips* ip) (sleep (rand throttle-time*)))
(handle-request-thread i o ip))
(push th srvthreads*)
(thread (sleep life)
(unless (dead th) (prn "srv thread took too long"))
(break-thread th)
(close i o))))
(sleep .2)))
So, there's a limit of 50 concurrent threads, and threads are killed
after 30 seconds if they haven't already terminated. So if I were to
hold open 50 concurrent connections, and the watchdog were to run during
the following 30 seconds, then the server ought to restart.
The watchdog code has not been released, so rather than soil my hat
color by DoSing the production server, I decided to continue hacking
on my local install on the assumption that I had the ability to
determine the server's start time to within one minute.
So, a one-minute interval is 60,000 possible PRNG seeds. If I kept
polling to see when the server came back up after the watchdog killed
it, then let's very conservatively assume that I could be among the
first 50 people to issue an HTTP request. Each page that comes back
from the server typically contains 2-3 fnids, so the reply I got would
contain some from among first few hundred to be generated, and thus
from among the first few thousand iterations of of the PRNG.
This leaves determination of the PRNG seed comfortably within the
reach of brute force: run the PRNG for 10,000 iterations for each of
the 60,000 possible seeds, and see which one produces the fnids I saw
in response to my request. I wrote a program that does just this:
http://dfranke.us/hacknews.c
So now I was able to determine PRNG seeds, but I couldn't conclude my
adventure quite yet. Since logging into news.yc is an uncommon
operation compared to simply browsing around, only a tiny fraction of
rand-strings that the server generates correspond to login cookies.
Furthermore, since fnids and login cookies have different lengths, and
since the PRNG gets called for a few other purposes at unpredictable
times, every individual PRNG iteration begins a candidate login
cookie. That's 40 or more false candidates produced for every page
view.
Nonetheless, online brute force would still be manageable. If each
page view produces an average of 50 candidates, and one in every
thousand page views is a login (this might be slightly optimistic),
that's 50,000 attempts necessary in order to find a working login. HN
gets about 500,000 hits on a busy day, so this could be done in a day
or two while likely staying under the radar.
A marginally more efficient approach would be a bit of social engineering:
1. Request a page. Find a generated fnid from the page source and
look it up in our candidate list. Call this A.
2.
ERC> /join #startups
<dfranke> Hey guys, I haven't been able to log in to news.yc
since the server restarted a little while ago. Anyone
else having problems?
<jrandomsucker> dfranke: Works for me.
<dfranke> Hmm, weird. I'll just try again later I guess.
3. Request another page, note the fnid, find it in the candidate
list. Call this B.
Step 4: Test the cookies that fall between A and B.
If this conversation takes one minute, then this reduces the search to
about 17,500 attempts -- less than a day's worth at a modest rate of
querying -- and possibly picks up multiple accounts in the process.
Epilogue:
I sent PG a draft of this post. RTM and I wrote a better implementation
of rand-string which reads from /dev/urandom and obeys a proper uniform
distribution. This new version appears in arc3:
(def rand-string (n)
(let c "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
(with (nc 62 s (newstring n) i 0)
(w/infile str "/dev/urandom"
(while (< i n)
(let x (readb str)
(unless (> x 247)
(= (s i) (c (mod x nc)))
(++ i)))))
s)))
PG removed the 50-thread concurrency limit and replaced it with a
per-IP rate limiter, so the DoS attack described here should no longer
work.
Like Pandora, he is so curious, he has to check this out.
Like a rat in a maze, he keeps going looking for the clear path.
Like Sherlock Holmes, he applies logic to determine the next step.
Like General Sherman, he keeps marching, building tools along the way as he needs them.
Like William Randolph Hearst, he defines the landscape. ("You provide the pictures, I'll provide the war.") "so I decided on a more proactive approach: crash it!" (hilarious)
And like any parent, he didn't quit until his baby walked.
Thank you, Daniel. I sure hope you've found a way to channel that talent in your day job.