Well besides the fun of hacking around and building little HTTP servers, I could see this being useful if you want to save money by running fewer instances of your HTTP service. For instance, if you have a widely deployed python webservice that isn't scaling well, you could rewrite it in C++ with very little boilerplate using proxygen's httpserver.
It's early stages for the proxygen open source project. Maybe further down the road we'll provide off-the-shelf binaries, but we think the library is already interesting enough to warrant a release.
Is that the inception of the project? I am more curious about the lineage. How was the decision made to go this route instead of throwing more instances at it?
The blog post goes into more detail, but Proxygen started with an effort to write a L7 reverse proxy that could deeply integrate into FB internal services. We pulled out a lot of the non-FB specific stuff into this open source release. Before that, we used hardware load balancers for this role, which was expensive.
By expensive, not just capital costs, but costs around operating them - they weren't as reliably configurable, health-checkable, and instrumentable as we'd want, and Proxygen (and a later L4 load balancer) were.
Also the previous load balancers had constraints we weren't willing to accept - they required special connectivity to our networks, we could not use particular combinations of options, and we had to rely on vendors to solve problems that most of their customers were not encountering and/or able to detect.
Yeah, we are hitting the same wall right now and we've been going down the same route using HAProxy/Chef. Our plan is to put an API in front of it and treat it similar to an ELB. Are you still using hardware load balancers for SSL termination?
That's interesting. I had been working something similar years ago. I eventually open sourced it, but discontinued work on it. Http://github.com/baus/swithflow
I'm surprised more systems don't take this approach of doing more work in L7 proxies.
I am not a huge fan of NIH but more often than not it is simply easier in the long run to roll something in house that does the job. Although the debt that piles up tends to dwarf the original choice it is often still a better idea due to having employee churn. Its much easier to keep it within the standards of normal coding conventions.
I don't work at FB but near the bottom in the comments you can see that they build on existing C++ libraries that have been tried/tested. We do the [same thing][1] with smaller services simply because its easier in the SDLC process to move libraries that are already in-house.
haproxy only got TLS support in 2012, whereas work on proxygen began quite before that. In addition, haproxy does not support SPDY natively (although it can forward it to something that does).
Apache Traffic Server only got SPDY support in a release about 45 days ago according to their release notes.
There are many other relatively basic features besides those, and similar considerations exist for the other projects that existed back then (and even now).
Any kind of internal service that listens and accepts an API. Like you could put a bunch of monitoring agents on all your machines and those agents get information about your machines and report back to the service that makes use of Proxygen. Your service would use Proxygen instead of hiding behind an nginx or apache server and running via some kind of fcgi setup. The agents themselves could be using the Proxygen client to talk to the service using Proxygen server pieces.
What are the security implications of running native code on a public facing service? How many RCEs has Proxygen had? Has it been audited for security? What is the testing procedure?
In general, native code is not the same as unsafe code (that's why I'm personally excited about Rust. Native + safe = awesome). In C++'s case though it is true that a memory error could be used to exploit the process. Careful use of safe memory abstractions are our main safeguard against this. I can't comment on RCEs, but Proxygen has been externally and internally audited for security. We have many unit tests in the project, and we also internally have more comprehensive integration tests and a fuzzer.
I currently use nginx for all my projects, which for me mostly consist of (1) serving static content and (2) proxying to gunicorn. My setups are usually not overly complicated, basic redirects for stuff like www/non-www or ssl/non-ssl, installing ssl certs can be a pain sometimes, and for the most part I use a lot of default settings and have never needed to go in-depth with tweaking settings.
Does somebody like me have a reason to check out Proxygen?
Maybe not today. We haven't open sourced a configurable proxy yet, so you wouldn't be able to do your redirects and some of these other features without writing C++ code.
If you're interested in how HTTP frameworks are designed and implemented, I'd definitely suggest checking it out though. This project is initially going to be more interesting to people integrating with HTTP quite directly.
Question for you and the general C++ community. I am a Java developer mostly but it looks like I'll be transitioning to a few C++ projects in the near future. Any good resources you can recommend for learning modern C++? Particularly anything you've used to get developers on your team similarly up to speed.
I've cruised through the Proxygen code base and there are definitely some head scratchers.
I'd recommend "C++ Primer" first, "Effective C++" second, and "The C++ Programming Language" as a reference. Other must haves are "Modern C++ Design", the other Scott Meyers books, and the Herb Sutter books.
Hey, could you recommend any books or tutorials for C++1y/C++14? I had learned C++ in my CS courses but I've used java/python/javascript these days. I'm thinking of coming back to C++ because C++14 looks cool.
Hey! So Proxygen was originally a reverse-proxy load balancer. Is that still how Facebook utilizes it now? If not, what is its current role? Are there any plans for integrating this with Hiphop/PHP in any way?
Yup, we still use Proxygen (the library) in our reverse proxy. Maybe some day we'll be able to open source the reverse proxy too, but it's pretty deeply integrated with internal FB code right now so that's tricky.
We already use the Proxygen HTTP code for the webserver part of HHVM internally. We hope to release that webserver part too (in the HHVM project).
Websockets were mentioned in the blog post. Has proxygen been deployed with websockets at facebook scale? How much support for websockets is there in the opensourced proxygen?
I would love to see Proxygen integrated as a HackLang extension. Especially the HTTP parser. As far as I know, the PHP world is sorely lacking a robust HTTP parser.
In response to my comment here, I did some more experimentation with pecl_http. Turns out version 2.1.4 looks relatively stable. Just be sure not to make the same mistake I made of referring to the documentation on the php.net website. Instead one should look here for version 2.x documentation:
I have a question -- in particular, the blog post mentions that the framework "includes both server and client code", any my question is about the second part :-)
I'm wondering, how does it compare to the other C++ HTTP client solutions: Is it closer to higher-level libraries like cpp-netlib, Casablanca, or POCO -- or more on a lower-level / comparable to Asio (or Boost.Asio)?
In your view, what are the main relative advantages/disadvantages (namely in the scenario mentioned in the blog post, i.e., integration into existing applications)?
I'm curious to know why there is a file named PATENTS in the repo (alongside the LICENSE file). What kind of legal protections would this "additional grant" give you?
This is similar to the Apache License, which allows developers to use the project with the confidence that we grant a license to any patents that may affect the project. This is a grant we use for all of our projects and is not anything specific to Proxygen.
I am sorry to point this out, but the patent license of Proxygen does not look similar to that of Apache License for two reasons.
- the license is terminated when one files a claim against _any_ of Facebook's software or services (IIRC Apache License gets terminated only when filed against the software)
- the license also terminates when you claim that "any right in any patent claim of Facebook is invalid or unenforceable"
The second clause seems very agressive (or pro-patent) to me, which makes me feel sorry for the developers of Proxygen, since IMO such a clause would harm the acceptance of the software outside Facebook.
It would be great if you reconsider the patent license.
Disclaimer: I am developer of H2O, an open-source HTTP/1 and HTTP/2 library, so there is obviously a conflict of interest here. But I wanted to leave a comment anyways since, honestly, I feel sorry if my friends at Facebook needs to go with this kind of license.
I would love to see a comparison between Proxygen and another server (ideally nginx or golang server). The numbers are impressive, but the client is on the same box, it is a big box, and it is a simple and short response. So I'm not sure if I should be wowed or not. From reading the post, Proxygen was written as a server that would be well-integrated into Facebook tools. I don't really use Facebook tools, so I'm not sure if Proxygen would be right for me?
It's a library that you can integrate into your application rather than passing requests via an intermediary (so performance would improve). It's not necessarily a better solution unless you're optimizing for performance. It probably isn't if you're going for ease of development and maintainability.
If you need a really simple embeddable C++ webserver that supports websockets (and as the primary developer) can I suggest SeaSocks: https://github.com/mattgodbolt/seasocks
Hey, I am a bit late, but I wonder how you generally handle unicode in C++, since the language itself does not have much support for it. That's what always makes me weary of writing any kind of server in C++.
Could you comment on the verisimilitude of the benchmark parameters? I would expect in Facebook's production environment the number of active sockets would be hundreds or thousands of times greater than 400.
You're right we see many more connections in production. We didn't have enough time to run exhaustive benchmarks for all the different possible combinations of active sockets, requests per connection, etc. As you might suspect, 400 was a sweet spot for our performance numbers at 1 core. Overall RPS didn't dip too much when increasing the number of connections. The table we included is just to give a rough idea of perf.
I really like your idea of the 4-part abstraction. Still… Your library really is not "that" easy to use.
I'm currently working on my own library using libuv, http-parser, nghttp2 and wslay, which is very similiar in it's use to node.js. As you might guess a echo server is therefore only about 15 lines of code, but about as performant as your framework. The downsite is that it's not as flexible due to the missing "4-part abtraction" (really… an excellent idea).
That's why your release somehow saddens me: When I'm going to release my framework to the public, it might be pretty good for cross platform apps etc. compared to others, but it will never ever be as popular as yours. Heck… I don't even have 10 twitter followers.
Offtopic: I have a problem using fbthrift and cpp2 server together. I'm find that there is two IDL to C++ compilers that I need to use - one old thrift compiler written in C++ and the new one written in python. Code generated by both this compilers is used in tests! Is there any tutorial or article with step by step instructions?
We used to do nginx + gunicorn for our rest services, it was not responding well beyond a point (for a given ec2 instance). We replaced that with nginx + lua (openresty module), we saw almost 10x increase in response times. Would it make sense for us to invest in something like this and hope to see a significant performance gain? Lowering response times is not a big deal but being able to get those same response times on a lower priced instance would definitely help. We have no real C++ skills in the team but we could learn or hire.
There are some D advocates, users, and enthusiasts (I fall into the last category) at Facebook, and a slowly increasing amount of D code, but the vast majority of infrastructure projects are done in C++, and most people are still choosing it for new projects.
Go is a nice language, but so is C++14, specially if you like the expressiveness it provides.
Plus, there are still lots of scenarios where Go tooling still doesn't provide support for. They will eventually, but not if you are starting a project today.
Google's a huge company and its products have a wide variety of requirements. For context, dl.google.com is basically a download server: its goal is to get static files from Google's servers to your client as quickly as possible, with some minimal business logic. Something like this is basically ideal for Go, because it is relatively stateless and focuses on shuffling bytes from one IO interface to another. Another ideal use-case was the project that Matt Welch blogged about moving to Go.
When most laypeople think about "Google", they probably mean the search engine. I cannot imagine the serving system of Google - the part that actually retrieves the results and ranks them - ever being written in anything but C++. The scale of data that it operates on is just too large, the complexity of the code too high, and the CPU budget per request too small.
Now, there are other binaries in the search serving stack that I think should ideally be written in Go. I said as much when I was at Google, though I doubt it'd ever happen simply because of inertia. But that's probably not what people are thinking of when they say "Google-scale".
By "the part that actually retrieves the results and ranks them" you mean the nodes of a cluster, that run the RPCs?
I guess the "hotspot" would be the code that has to merge the top results from the different nodes and actually deliver top rated 10 items to the user?
That some google services were rewritten in golang doesn't really say much, in a video Rob Pike himself said that lots of services in google can't and shouldn't be rewritten in Go because of the GC, the video: http://channel9.msdn.com/Events/Lang-NEXT/Lang-NEXT-2014/Pan... - At 31:20 Andrei starts talking about the problem of the GC with the scale of Facebook, you can start watching there.
Excellent, I've been toying around with my own and looking at LibUV. I think the time is right for something like this. I want to maintain state on everything that connects to me.
Facebook oughta hire some better scripters, the deps.sh is of terribly low quality. I didn't get more than a couple of lines until i stumble upon this (which tells me the author has no clue :):
'start_dir=`pwd`; trap "cd $start_dir" EXIT;'...
No need to say that the script can be dangerous, in case directory change fails for instance, there's no checks but sudo make uninstall is run anyway in another dir than the intended one.
Bash isn't my expertise and I put this together pretty quickly. Please send pull requests! Forgive my ignorance, but what's the danger of the cd'ing in the EXIT trap? Also, I did set -e, so there's no problem of running "sudo make uninstall" from the wrong directory, afaict.
Last msg sounded a bit harsh, sry about that. Anyway, some things to consider:
1. You don't need bash, but rather use /bin/sh to be more compatible with other shells (I don't have bash, neither does a lot of other systems after latest Shellshock incident). There's really no need to limit it to bash (bash is one of many shells but very commonly mistaken for "shell script").
2. The script is executed in a subshell, so the directory your script is in when exiting is irrelevant, it doesn't affect the caller at all. Try by creating a new script that just does 'cd a_dir_that_exists' and run it from a terminal. :)
3. set -e makes the program stop in case of _unhandled_ errors yes, so you're right, the example I gave is indeed wrong and it would stop on the failed cd attempt.
Instead of using '|| true' (to deliberatly ignore errors), the std way to do it is '|| :' (which doesn't fork the true binary). However I would really recommend taking care and handling possible errors.
We use a very different model actually. Since spawning OS threads is expensive, we opted for the popular nonblocking-IO approach. Each worker thread (usually 1 per core on the CPU) is given connections in a round robin fashion from the listening socket. The worker thread runs an event loop processing events on the accepted socket.
What kind of programming technique did you use to implement the handling of the protocols? Did you implement them as finite-state machines, or did you use coroutines, or some other technique?
Do you think that C++ is a well suited language for this kind of processing? Is it possible to say, now this project is in a mature state, that other languages (e.g. Rust) could have helped make your implementation simpler?