Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Proxygen, Facebook's C++ HTTP Framework (facebook.com)
324 points by mikeevans on Nov 5, 2014 | hide | past | favorite | 89 comments


Hey there, I work on Proxygen at Facebook. I'm happy to answer any questions you have about the project.


This question is a bit naive, but outside of Facebook, can you think of what kind of application this is well suited for?


Well besides the fun of hacking around and building little HTTP servers, I could see this being useful if you want to save money by running fewer instances of your HTTP service. For instance, if you have a widely deployed python webservice that isn't scaling well, you could rewrite it in C++ with very little boilerplate using proxygen's httpserver.

It's early stages for the proxygen open source project. Maybe further down the road we'll provide off-the-shelf binaries, but we think the library is already interesting enough to warrant a release.


Is that the inception of the project? I am more curious about the lineage. How was the decision made to go this route instead of throwing more instances at it?


The blog post goes into more detail, but Proxygen started with an effort to write a L7 reverse proxy that could deeply integrate into FB internal services. We pulled out a lot of the non-FB specific stuff into this open source release. Before that, we used hardware load balancers for this role, which was expensive.


By expensive, not just capital costs, but costs around operating them - they weren't as reliably configurable, health-checkable, and instrumentable as we'd want, and Proxygen (and a later L4 load balancer) were.

Also the previous load balancers had constraints we weren't willing to accept - they required special connectivity to our networks, we could not use particular combinations of options, and we had to rely on vendors to solve problems that most of their customers were not encountering and/or able to detect.


Yeah, we are hitting the same wall right now and we've been going down the same route using HAProxy/Chef. Our plan is to put an API in front of it and treat it similar to an ELB. Are you still using hardware load balancers for SSL termination?


The machines running Proxygen do the TLS termination.


That's interesting. I had been working something similar years ago. I eventually open sourced it, but discontinued work on it. Http://github.com/baus/swithflow

I'm surprised more systems don't take this approach of doing more work in L7 proxies.


Can you tell us motivation to build this instead of contorting to and using existing solutions such as Apache Traffic Server, squid, nginx or haproxy?


I am not a huge fan of NIH but more often than not it is simply easier in the long run to roll something in house that does the job. Although the debt that piles up tends to dwarf the original choice it is often still a better idea due to having employee churn. Its much easier to keep it within the standards of normal coding conventions.

I don't work at FB but near the bottom in the comments you can see that they build on existing C++ libraries that have been tried/tested. We do the [same thing][1] with smaller services simply because its easier in the SDLC process to move libraries that are already in-house.

[1]: https://github.com/bloomberg/bde/tree/master/groups/bsl


I'll have a look at the blog for more -- but this was something that couldn't be solved with haproxy and/or trafficserver?


haproxy only got TLS support in 2012, whereas work on proxygen began quite before that. In addition, haproxy does not support SPDY natively (although it can forward it to something that does).

Apache Traffic Server only got SPDY support in a release about 45 days ago according to their release notes.

There are many other relatively basic features besides those, and similar considerations exist for the other projects that existed back then (and even now).


He ask the reason for `not contributing` to that projects, not `using` them.


Any kind of internal service that listens and accepts an API. Like you could put a bunch of monitoring agents on all your machines and those agents get information about your machines and report back to the service that makes use of Proxygen. Your service would use Proxygen instead of hiding behind an nginx or apache server and running via some kind of fcgi setup. The agents themselves could be using the Proxygen client to talk to the service using Proxygen server pieces.


What are the security implications of running native code on a public facing service? How many RCEs has Proxygen had? Has it been audited for security? What is the testing procedure?


In general, native code is not the same as unsafe code (that's why I'm personally excited about Rust. Native + safe = awesome). In C++'s case though it is true that a memory error could be used to exploit the process. Careful use of safe memory abstractions are our main safeguard against this. I can't comment on RCEs, but Proxygen has been externally and internally audited for security. We have many unit tests in the project, and we also internally have more comprehensive integration tests and a fuzzer.


I currently use nginx for all my projects, which for me mostly consist of (1) serving static content and (2) proxying to gunicorn. My setups are usually not overly complicated, basic redirects for stuff like www/non-www or ssl/non-ssl, installing ssl certs can be a pain sometimes, and for the most part I use a lot of default settings and have never needed to go in-depth with tweaking settings.

Does somebody like me have a reason to check out Proxygen?


Maybe not today. We haven't open sourced a configurable proxy yet, so you wouldn't be able to do your redirects and some of these other features without writing C++ code.

If you're interested in how HTTP frameworks are designed and implemented, I'd definitely suggest checking it out though. This project is initially going to be more interesting to people integrating with HTTP quite directly.


Question for you and the general C++ community. I am a Java developer mostly but it looks like I'll be transitioning to a few C++ projects in the near future. Any good resources you can recommend for learning modern C++? Particularly anything you've used to get developers on your team similarly up to speed.

I've cruised through the Proxygen code base and there are definitely some head scratchers.


This SO thread has a good overview of the books out there:

http://stackoverflow.com/questions/388242/the-definitive-c-b...

I'd recommend "C++ Primer" first, "Effective C++" second, and "The C++ Programming Language" as a reference. Other must haves are "Modern C++ Design", the other Scott Meyers books, and the Herb Sutter books.


Hey, could you recommend any books or tutorials for C++1y/C++14? I had learned C++ in my CS courses but I've used java/python/javascript these days. I'm thinking of coming back to C++ because C++14 looks cool.


Start with Bjarne's new book, "Tour of C++", it shows how to make proper use of C++11 without those C unsafe influences.

C++14 is mostly fixing what was left out in C++11, so the book is already a good starting point.

C++14 is definitely cool, I use JVM/.NET nowadays at work, but am an old C++ dog (since 1993), so I always used the language on a few side projects.

C++14 kills quite a few complaints I had about the language. Now if we just could get proper modules.


You surely should pick Bjarne's latest book, "Tour of C++".

It is show how to make proper use of C++11 without any bad C influences.


Hey! So Proxygen was originally a reverse-proxy load balancer. Is that still how Facebook utilizes it now? If not, what is its current role? Are there any plans for integrating this with Hiphop/PHP in any way?


Yup, we still use Proxygen (the library) in our reverse proxy. Maybe some day we'll be able to open source the reverse proxy too, but it's pretty deeply integrated with internal FB code right now so that's tricky.

We already use the Proxygen HTTP code for the webserver part of HHVM internally. We hope to release that webserver part too (in the HHVM project).


Websockets were mentioned in the blog post. Has proxygen been deployed with websockets at facebook scale? How much support for websockets is there in the opensourced proxygen?


I would love to see Proxygen integrated as a HackLang extension. Especially the HTTP parser. As far as I know, the PHP world is sorely lacking a robust HTTP parser.


In response to my comment here, I did some more experimentation with pecl_http. Turns out version 2.1.4 looks relatively stable. Just be sure not to make the same mistake I made of referring to the documentation on the php.net website. Instead one should look here for version 2.x documentation:

http://devel-m6w6.rhcloud.com/mdref/http/


The blog post mentions that HHVM uses parts of Proxygen.


Sorry; I had read through most of it but was scanning for PHP or Hiphop. Honest mistake, I swear :).


Hi! Looks interesting!

I have a question -- in particular, the blog post mentions that the framework "includes both server and client code", any my question is about the second part :-)

I'm wondering, how does it compare to the other C++ HTTP client solutions: Is it closer to higher-level libraries like cpp-netlib, Casablanca, or POCO -- or more on a lower-level / comparable to Asio (or Boost.Asio)?

In your view, what are the main relative advantages/disadvantages (namely in the scenario mentioned in the blog post, i.e., integration into existing applications)?


I'm curious to know why there is a file named PATENTS in the repo (alongside the LICENSE file). What kind of legal protections would this "additional grant" give you?


This is similar to the Apache License, which allows developers to use the project with the confidence that we grant a license to any patents that may affect the project. This is a grant we use for all of our projects and is not anything specific to Proxygen.


I am sorry to point this out, but the patent license of Proxygen does not look similar to that of Apache License for two reasons.

- the license is terminated when one files a claim against _any_ of Facebook's software or services (IIRC Apache License gets terminated only when filed against the software)

- the license also terminates when you claim that "any right in any patent claim of Facebook is invalid or unenforceable"

The second clause seems very agressive (or pro-patent) to me, which makes me feel sorry for the developers of Proxygen, since IMO such a clause would harm the acceptance of the software outside Facebook.

It would be great if you reconsider the patent license.

Disclaimer: I am developer of H2O, an open-source HTTP/1 and HTTP/2 library, so there is obviously a conflict of interest here. But I wanted to leave a comment anyways since, honestly, I feel sorry if my friends at Facebook needs to go with this kind of license.


Thats a good point though, thats not exactly the kind of stuff people expect in open source license.

Then everyone goes and complain about the GPL vs BSD.. but this is waaaaaaaaay worse.


I would love to see a comparison between Proxygen and another server (ideally nginx or golang server). The numbers are impressive, but the client is on the same box, it is a big box, and it is a simple and short response. So I'm not sure if I should be wowed or not. From reading the post, Proxygen was written as a server that would be well-integrated into Facebook tools. I don't really use Facebook tools, so I'm not sure if Proxygen would be right for me?


I think https://news.ycombinator.com/item?id=8563766 is a good answer to use cases.


Maybe I am being dense, but why is Proxygen a better solution here?


It's a library that you can integrate into your application rather than passing requests via an intermediary (so performance would improve). It's not necessarily a better solution unless you're optimizing for performance. It probably isn't if you're going for ease of development and maintainability.


How tested is the websocket support? Does facebook use this for TLS termination?


If you need a really simple embeddable C++ webserver that supports websockets (and as the primary developer) can I suggest SeaSocks: https://github.com/mattgodbolt/seasocks


Websocket support isn't out yet unfortunately. It's something we hope to get to soon.

Our reverse proxy uses proxygen and does TLS termination too, yes.


Hey, I am a bit late, but I wonder how you generally handle unicode in C++, since the language itself does not have much support for it. That's what always makes me weary of writing any kind of server in C++.



Could you comment on the verisimilitude of the benchmark parameters? I would expect in Facebook's production environment the number of active sockets would be hundreds or thousands of times greater than 400.


You're right we see many more connections in production. We didn't have enough time to run exhaustive benchmarks for all the different possible combinations of active sockets, requests per connection, etc. As you might suspect, 400 was a sweet spot for our performance numbers at 1 core. Overall RPS didn't dip too much when increasing the number of connections. The table we included is just to give a rough idea of perf.


Are there any plans for spdy and http2?


From the first paragraph:

In addition to HTTP/1.1, Proxygen (rhymes with "oxygen") supports SPDY/3 and SPDY/3.1. We are also iterating and developing support for HTTP/2.


We already do support SPDY/3 and SPDY/3.1, and we're working on HTTP/2 currently.


I really like your idea of the 4-part abstraction. Still… Your library really is not "that" easy to use.

I'm currently working on my own library using libuv, http-parser, nghttp2 and wslay, which is very similiar in it's use to node.js. As you might guess a echo server is therefore only about 15 lines of code, but about as performant as your framework. The downsite is that it's not as flexible due to the missing "4-part abtraction" (really… an excellent idea).

That's why your release somehow saddens me: When I'm going to release my framework to the public, it might be pretty good for cross platform apps etc. compared to others, but it will never ever be as popular as yours. Heck… I don't even have 10 twitter followers.


Care to show us what you've got instead of just saying how much better it is than Proxygen?

Also Facebook is a group of people so saying "your" doesn't really sound right.


I'm sorry… English is not my native tongue. But I'm learning fast. :)

In no way I intended to say that my framework is better overall, but I do think it's better suited for simple things, like apps.

In fact, I think I will integrate something like their "four-part abstraction", because I really think this is a great idea.


is it open source?


dcsommer (author of proxygen) gave a great talk about this at the last Sourcegraph open-source meetup. Here's the video: https://www.youtube.com/watch?v=-yxQIRl6Qic


I'm flattered, but I'm just an author. Proxygen is the work of about a dozen people over 4 years at Facebook.


"You will need at least 2 GiB of memory to compile proxygen and its dependencies." What?


fbthrift needs 2GiB of memory to compile (https://github.com/facebook/fbthrift) and proxygen has a dependency on fbthrift.


Offtopic: I have a problem using fbthrift and cpp2 server together. I'm find that there is two IDL to C++ compilers that I need to use - one old thrift compiler written in C++ and the new one written in python. Code generated by both this compilers is used in tests! Is there any tutorial or article with step by step instructions?


That then begs the question of what in fbthrift needs so much memory... is it mainly due to heavy use of C++ features like template metaprogramming?


And then we go back to minimalism, Lua, ... always in this circle.


That is not unreasonable for modern software. You don't need 2GB of RAM to use it.


Completely reasonable for modern software of this size/complexity.


i wouldnt like to try compiling my work projects on anything less than 16GB ram, never mind 2GB


We used to do nginx + gunicorn for our rest services, it was not responding well beyond a point (for a given ec2 instance). We replaced that with nginx + lua (openresty module), we saw almost 10x increase in response times. Would it make sense for us to invest in something like this and hope to see a significant performance gain? Lowering response times is not a big deal but being able to get those same response times on a lower priced instance would definitely help. We have no real C++ skills in the team but we could learn or hire.


what are the units on the table? the top looks like number of workers, but large numbers are unitless.


It's requests per second (averaged over a 60 second test run).


Looks to me as facebook's answer to golang ?

Building simple, standalone http services with good performances seems to me what those two projects (proxygen and golang) are really about.

Now the question is, how much faster using C++ is, and how much safer and faster writing golang is...


I think Facebook is a lot into D, sounds a more appropriate, for lack of a better word, "replacement" for golang.


There are some D advocates, users, and enthusiasts (I fall into the last category) at Facebook, and a slowly increasing amount of D code, but the vast majority of infrastructure projects are done in C++, and most people are still choosing it for new projects.


I thought so as well, but then why release a c++ lib that's supposedly core to your infra ?


Go is a nice language, but so is C++14, specially if you like the expressiveness it provides.

Plus, there are still lots of scenarios where Go tooling still doesn't provide support for. They will eventually, but not if you are starting a project today.


I'm guessing that at facebook scale a GC would be expensive.


Compared to google scale which can handle it just fine ? :)))


nope, golang can't handle google either, it would be to expensive for them too.



Google's a huge company and its products have a wide variety of requirements. For context, dl.google.com is basically a download server: its goal is to get static files from Google's servers to your client as quickly as possible, with some minimal business logic. Something like this is basically ideal for Go, because it is relatively stateless and focuses on shuffling bytes from one IO interface to another. Another ideal use-case was the project that Matt Welch blogged about moving to Go.

When most laypeople think about "Google", they probably mean the search engine. I cannot imagine the serving system of Google - the part that actually retrieves the results and ranks them - ever being written in anything but C++. The scale of data that it operates on is just too large, the complexity of the code too high, and the CPU budget per request too small.

Now, there are other binaries in the search serving stack that I think should ideally be written in Go. I said as much when I was at Google, though I doubt it'd ever happen simply because of inertia. But that's probably not what people are thinking of when they say "Google-scale".

Source: I worked in search for 5+ years.


By "the part that actually retrieves the results and ranks them" you mean the nodes of a cluster, that run the RPCs?

I guess the "hotspot" would be the code that has to merge the top results from the different nodes and actually deliver top rated 10 items to the user?


That some google services were rewritten in golang doesn't really say much, in a video Rob Pike himself said that lots of services in google can't and shouldn't be rewritten in Go because of the GC, the video: http://channel9.msdn.com/Events/Lang-NEXT/Lang-NEXT-2014/Pan... - At 31:20 Andrei starts talking about the problem of the GC with the scale of Facebook, you can start watching there.


Excellent, I've been toying around with my own and looking at LibUV. I think the time is right for something like this. I want to maintain state on everything that connects to me.


Looks interesting. Does anyone have a comparison of all these plethora of C++ HTTP frameworks, such as Civetweb etc.?


Facebook oughta hire some better scripters, the deps.sh is of terribly low quality. I didn't get more than a couple of lines until i stumble upon this (which tells me the author has no clue :):

'start_dir=`pwd`; trap "cd $start_dir" EXIT;'...

No need to say that the script can be dangerous, in case directory change fails for instance, there's no checks but sudo make uninstall is run anyway in another dir than the intended one.


Bash isn't my expertise and I put this together pretty quickly. Please send pull requests! Forgive my ignorance, but what's the danger of the cd'ing in the EXIT trap? Also, I did set -e, so there's no problem of running "sudo make uninstall" from the wrong directory, afaict.


Last msg sounded a bit harsh, sry about that. Anyway, some things to consider:

1. You don't need bash, but rather use /bin/sh to be more compatible with other shells (I don't have bash, neither does a lot of other systems after latest Shellshock incident). There's really no need to limit it to bash (bash is one of many shells but very commonly mistaken for "shell script").

2. The script is executed in a subshell, so the directory your script is in when exiting is irrelevant, it doesn't affect the caller at all. Try by creating a new script that just does 'cd a_dir_that_exists' and run it from a terminal. :)

3. set -e makes the program stop in case of _unhandled_ errors yes, so you're right, the example I gave is indeed wrong and it would stop on the failed cd attempt.

Instead of using '|| true' (to deliberatly ignore errors), the std way to do it is '|| :' (which doesn't fork the true binary). However I would really recommend taking care and handling possible errors.


Skimming through the article, it seems to me that this server spawns a thread per connection, is that correct?


We use a very different model actually. Since spawning OS threads is expensive, we opted for the popular nonblocking-IO approach. Each worker thread (usually 1 per core on the CPU) is given connections in a round robin fashion from the listening socket. The worker thread runs an event loop processing events on the accepted socket.


What kind of programming technique did you use to implement the handling of the protocols? Did you implement them as finite-state machines, or did you use coroutines, or some other technique?

Do you think that C++ is a well suited language for this kind of processing? Is it possible to say, now this project is in a mature state, that other languages (e.g. Rust) could have helped make your implementation simpler?


Hey, I'm a Software Engineer on Proxygen as well. Proxygen heavily relies on folly's buffer management abstractions such as IOBuf (https://github.com/facebook/folly/blob/master/folly/io/IOBuf...) and Cursor (https://github.com/facebook/folly/blob/master/folly/io/Curso...). Protocol parsing implementation uses folly::io::Cursor to safely read byte sequences across non-contiguous buffers. Errors during parsing are wrapped up in Result types (https://github.com/facebook/proxygen/blob/master/proxygen/li...) which take an inspiration from Rust. Such constructs simplify our implementation to a reasonable extent and are still low-level enough to extract performance.


Note: as google discovered a few years ago, round-robining the connections is probably less than ideal.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: