Scaling a startup from 0 to 40 hits per second in 3 days

epi0Bauqu · on Aug 22, 2007

I'm curious about your previous set up. You said mod_perl2, MySQL and Apache2 weren't cutting it for you, but I've scaled to that level fine with mod_perl2, PostgreSQL and Apache2. The key was database pooling via Apache::DBI, all the perl modules cached in a startup.pl via mod_perl2, keep alive and host lookups (and a few other things) completely off in Apache2, and all queries using indexes and indexes all in memory via postgres.

Early on, like in your situation, I did go to the file system at first because I couldn't "make it work," but then eventually I went back to postgres after I figured out its scalability details. If you have a lot of little files and you hit them a lot, that will eventually probably become your bottleneck. Did you figure out the bottleneck in your db setup or has there just not been enough time yet? Just curious.

mmaunder · on Aug 22, 2007

Ah, a fellow mod_perl hacker. :)

I'm using Apache::DBI, caching everything via startup.pl on load, keep_alive is on but with a very small timeout, host lookups are off.

Also I'm using the worker MPM in apache2, just FYI. I've found it to be really memory efficient.

I've been using MySQL with Apache::DBI for years and it's usually brilliant - I ran WorkZoo.com, a high traffic job search engine with a combination of MySQL and a full-text api.

With feedjit I'm basically storing weblogs. I either have to dump them into a single table and query that - which is what I was doing and the high query rate with read/write was a problem - or have lots of individual tables which isn't feasible after about 500 with MySQL. So small files works best for me.

epi0Bauqu · on Aug 22, 2007

We're talking one file per unique domain or per unique url?

Also, I take it from your comment that the bottleneck was in MySQL doing the writes. I assume the read side is indexed appropriately so MySQL finds the right part on the disk almost instantly. Do you think it is a locking issue then, e.g. table lock vs row lock? (Forgive me, I haven't used MySQL in a while.)

mmaunder · on Aug 23, 2007

One file per URL. The problem with MySQL is pretty much what you've described. I have (had) an index on a table that gets read a lot by the application. It's amazingly fast - MyISAM table's really rock for fast reads on indexes. But therein lies the problem because it also gets written to a lot. Every time it gets written to MySQL needs to lock the table and rebuild the index.

You can improve things a bit by using INSERT DELAYED. When you use that, mysql doesn't guarantee that it'll insert the row immediately, but the mysql query returns immediately when you do the insert (it doesn't block) and mysql queues up inserts and inserts them in bulk when it feels like it. The non-blocking and bulk inserts that INSERT DELAYED give you speed things up, but only to a point because you're still constantly rebuilding an index on a table that's getting a lot of reads.

Mark.

palish · on Aug 22, 2007

Congrats. It seems like the main bottleneck of webapps is the on-disk database. I'd say that if your app is small enough you shouldn't even start with a database, but that would be a preoptimization.

mmaunder · on Aug 22, 2007

Thanks. A while back I was playing with a ramdisk on linux and syncing the data to non-volatile storage. That worked quite well. I can't use that in this case because there are lots of tiny files and they occupy too much space for ramdisk. But looking at the output of vmstat, it looks like there isn't much disk io, so I think the linux filesystem cache is working quite well.

kf · on Aug 22, 2007

I'm curious how you reached so many Japanese bloggers. Were there some big Japanese blogs that covered you at launch?

mmaunder · on Aug 22, 2007

We got covered by http://www.100shiki.com/ and then got installed by two very high traffic japanese bloggers and it went viral from there.

Mark.

waleedka · on Aug 22, 2007

That's impressive. What's your hardware?

mmaunder · on Aug 22, 2007

Not much. :) AMD Athlon(tm) XP 2100+ with 1 gig of RAM and a single SATA drive.

I'm about to upgrade though - mostly because I need more RAM to support more concurrent connections.

piers · on Aug 22, 2007

Is that just for Feedjit or is there other stuff on there too?

mmaunder · on Aug 23, 2007

There are 5 other websites on the server. They're low traffic but some are PHP which means I've had to compile mod_php into the server so each thread takes up a lot more RAM.

I have a new server ordered which comes online today and that's just for feedjit, so I can compile a feedjit-only apache and it'll be able to handle many more connections.

Mark.

tocomment · on Aug 22, 2007

I noticed if you come to a site using Feejit (sp?) via google, you're search term is included in the coming from URL. Does this present any kind of privacy issues? I remember AOL got in a lot of trouble for releasing search queries from their users.

mmaunder · on Aug 23, 2007

I've been thinking a bit about this. I have a couple of blogs of my own, so I always think about feedjit in that context. Am I happy with my readers seeing what search terms are sending people to my site and do I think my readers will get mad because people can see what they're searching on?

feedjit only shows the most recent 10 referrers and clicks, so I don't think theres anything there that'll give a 'competitor' some sort of strategic advantage. Besides, tehy can just google around and find out what I'm showing up for in the SERPS.

As far as privacy goes, as long as I'm not personally identifying people and showing what search terms they're using, I think there aren't any privacy issues. I see my own search terms showing up and my location 'Seattle, WA' and no one knows who actually searched that term.

I haven't really applied my mind to this as much as I should, but those are my initial thoughts and I'd love to hear if anyone feels different.

Mark.

tocomment · on Aug 23, 2007

I guess I'm just generally confused as to why AOL got in trouble. But IIRC each person had a unique idea so you could put their searches together, and I guess some of those people had searched for their own names. Since you're not putting multiple searches from the same person together it should be alright.

On the other hand, why should anyone expect their searches to be private at all? If I wanted to share searches coming to my website and even group them by IP address, what's stopping me? What if I said I would do this in my privacy policy?

BTW, I've been thinking of a Feedjit type thing but for search keywords instead of location. That's what got me wondering about these privacy issues. I hope that idea wasn't in your future plans ...

_david · on Aug 23, 2007

Isn't the http-referrer already available to any link you visit from a search?

tocomment · on Aug 22, 2007

Would you mind expanding a bit on keep-alive settings? In what cases would you want to set it to a low value?

Question 2. As a web developer, should I be worried that I don't know as much as you about scaling a web app? What will I do when my web app makes it big?

epi0Bauqu · on Aug 23, 2007

KeepAlive dedicates a process to one client until the KeepAliveTimeout. If you have 150 MaxClients and your KeepAliveTimeout is 5 sec, and you get 150 clients in one sec, you can't serve another for 5 more seconds. Ditching KeepAlive gets rid of this problem.

So you want to lower it basically if you have a lot of different clients coming at the same time. On the other hand, if you have a few clients and they each make a lot of requests in a short amount of time, KeepAlive can save some time and resources.

In answer to your other question, I wouldn't worry too much about this stuff until you need to. When that happens, there are a lot of good resources on the Web, and you can post here and I'm sure people will help you. If not, contact me, and I'll do the best I can :)

Here are some further explanations about KeepAlive: http://virtualthreads.blogspot.com/2006/01/tuning-apache-par... and http://www.perlcode.org/tutorials/apache/tuning.html

mmaunder · on Aug 23, 2007

Just a quick comment. I had keepalive turned on for my webserver but set to a low timeout value of 2 seconds. The server is getting a lot of traffic this morning and was approaching maxclients. I had a few reports of pages or images not loading which suggested that the server was occasionally out of connections. I just turned off keepalive completely and wow! I've basically doubled (or more) the amount of traffic I can handle.

The down-side is that browsers need top open a new connection for every page component they load. But things will load more reliably now and I can handle more traffic.

I have a new server ordered which comes online today with lots more memory. When I deploy that I'll probably try turning it back on to make loading faster for my users. Then when things get hairy again I'll turn it off. Come to think of it it'd be nice if I could do that without rebooting apache based on server load.

Mark.

tocomment · on Aug 23, 2007

I should have definately posted these questions in his blog's comments. I don't think he's coming back :-(

mmaunder · on Aug 23, 2007

Sorry, I've been on the road - I'm busy driving from Denver to Seattle (currently in a hotel in Bozeman). I've replied to a bunch of questions and will check back again later today.

Mark.

zaidf · on Aug 22, 2007

congrats on the great launch!

patrickg-zill · on Aug 23, 2007

I suppose I will sound like a troll, but why not consider PostgreSQL?

mmaunder · on Aug 23, 2007

I think at this stage both db's will probably do a fine job. PostgreSQL has been getting faster and mysql has more features. I just know mysql. I know where it keeps it's data files, how to recover from a crash, I know the my.cnf config file backwards and how to tune it, so it's really just down to what I'm comfortable with.

Mark.