Hacker News new | past | comments | ask | show | jobs | submit login
How CloudFlare extracts a signal from 10 trillion log lines each month [video] (thedotpost.com)
123 points by jgrahamc on June 25, 2015 | hide | past | favorite | 24 comments



Good god, 4 million log requests per second. 400 TB a day (compressed) if they stored the logs.

I recently setup a fun project using a RethinkDB cluster and Node.js express middleware[1] that logs all requests to RethinkDB in JSON. I did some load testing and sustained 1,400 writes per second, and was quite happy thinking this would scale up very larger. However, not 4 million write per second large. :-)

[1] https://github.com/commando/express-rethinkdb-logger


It would be a better comparison if we knew across how many nodes they're sustaining that. Maybe they just have 4,000 nodes doing 1k TPS each.


I don't really think the word "just" belongs in that sentence ;)


The talk discusses the machine count at the 20-minute mark. 40 machines for the log queue, 5 for the psql, and 100+ for the consumers.


I mean, storing logs to DB is an inherently slow way to store logs


Interesting how NGINX + Lua is becoming more and more widely used in mission critical applications with huge amount of traffic. Since the introduction of LuaJIT performances have been outstanding and many companies like Netflix, Alibaba, Cloudflare, Kong [0], Airbnb all run on a customized nginx with Lua modules; doing amazing things from security to API management.

[0] http://github.com/mashape/kong


I predicted about a year ago that nginx/LuaJIT (OpenResty) is the sleeping giant of web development. I've seen more and more companies start using it, and I wouldn't be surprised if people start talking about it as a Node.js alternative in the not so distant future.


The good thing about NGINX is that it's perceived as language agnostic. So no matter what your codebase lang, you can always put an nginx on top. While Node tools, are only good if you use them within a Node stack (see Strongloop). As far as I know nginx is powering 150M websites/apps.


HAProxy will be using Lua in future releases for similar reasons. Interesting trend!


And Lua is taking off because of these reasons such as the ability to extend Nginx or HAproxy. Plus, is simple to use, easy to learn and highly efficient, without the need to touch C/C++.


What is the deal with the recent "thedotpost.com" envelope for perfectly valid YouTube URLs? I want to flag this post just for that.

https://www.youtube.com/watch?v=LA-gNoxSLCE


jgrahamc is the speaker, it was at a conference run by the same company as thedotpost.com.

I'm not really seeing the issue. Complaining about how someone chooses to link to their own content seems silly tbh.


My apologies, I just noticed a flood of these previously unrecognized domains which had a lot of "web chrome" around a central YouTube video. I shouldn't have included the language about flagging it, as that made my comment more negative than I intended it. I am slowly learning to pare down my HN comments to avoid that editorializing.


In this particular case, the website logo is the same logo on the speaker's podium and on the wall behind the speaker :)


No need to apologize, I was just explaining in case people did actually start flagging it because of your comment.

I'm a bit of an asshole by nature so there isn't a need to take my diction seriously.

Best of luck to you with your goal :)


To determine whether something is really just "blog spam", I always look at whether the page adds anything that would be lost on YouTube. In this case there's the description, a link to slides, and information on the speaker. So this one gets a pass.


That is 4M requests per second. To put this in prospective, the established Akamai runs at 25M/s (http://www.akamai.com/html/technology/real-time-web-metrics....), which means that Cloudflare is quickly growing to hold 20% of Akamai traffic.


https://github.com/cloudflare/jgc-talks/blob/master/dotScale...

I like that github extracts the slides for me, but what would be better is if it could extract just the plain text.


The Nginx + LuaJIT approach is new to me. I went looking and found their blog post about it [1]. That post mentions log aggregation, but it sounds from this talk like they're doing dynamic routing via the Lua code. Is the idea to accept requests for entirely separate sites via one front-end Nginx host(s), and then perform a sort of NAT or virtual routing to different client properties based on the request headers?

At any rate, very interesting stuff, and more to read up on now.

[1]: https://blog.cloudflare.com/pushing-nginx-to-its-limit-with-...


In the world of lossy counting. This [1] could be something to look at if you wanted to answer the question "how many requests/secs towards jgc.org at noontime, two months ago"

[1] https://github.com/dgryski/hokusai


Why not HyperLogLog?


Presumably because there's no point replying to a video talk about HyperLogLog with a comment saying "you might want to look into HyperLogLog".


No love for Elasticsearch :/


Are you complaining about the obvious ES JSON in one of the slides not getting a mention or just the lack of mention at all?

If the latter, it's highly likely due to the findings (possibly even independently verified) of the treatment ES received from Jepsen, which was revisited in a talk at that same conference: https://news.ycombinator.com/item?id=9778291




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: