Hacker News new | past | comments | ask | show | jobs | submit login
A million-dollar engineering problem (segment.com)
570 points by gwintrob on March 16, 2017 | hide | past | favorite | 252 comments



A friend of mine was annoyed that a small service he liked was shutting down.

He contacted the developer who said that they were shutting it down because the server costs were higher than the money they were making.

They were spending 5k a month on AWS crap and claimed it was impossible to get any lower.

He helped them consolidate everything onto a single rented dedicated server costing 400 a month. Now the service is profitable, and will stay up.

It runs way faster on the single server. It also has required less maintenance after the move too.

This kind of shit is everywhere. At this point simply not using AWS is a competitive advantage.


> At this point simply not using AWS is a competitive advantage.

Respectfully, I'm going to disagree. I consult full time on AWS cost optimization / reduction / understanding.

If you blindly run things on AWS without an understanding of the costing model, that'll work for a time. As you scale, you start to realize "oh my god it runs on money."

There are myriad ways around that, but a blanket "never use AWS" isn't going to address it constructively.


> I consult full time on AWS cost optimization / reduction / understanding.

Curious, have you ever consulted a client not to use AWS? As in "your cost optimization is going to be not using AWS and using $x instead"?


Absolutely. Great example: container workloads without high data volumes. Slap those into GCP if you can-- or arbitrage between different cloud providers / regions / on-prem if you have them.

That said, clients who come to me already have a significant investment with AWS. Getting rid of that is a high bar in terms of engineering effort.

That said, I've no financial incentive to recommend one provider over another. My clients pay me, the vendors do not.


Makes total sense. Thank you for responding!

We've pushed clients to competitors' products a few times, I was young and thought that was silly. However those clients spread the word about us and we didn't even have much of a marketing department, it was all word of mouth. Gaining their trust was gold and a really good long term investment.


Curious about this as well. Otherwise, it's like a car salesman saying you shouldn't buy a bike, rather, understand mpg before buying a car.


Sure-- but I'm usually not retained until the environment is built out. The client's starting point is relevant to what comes next.


I would guess the network egress costs, specifically, might justify a blanket statement like...

"If you have content where you can't put a cache in front of it, and it drives significant bandwidth use, then AWS is not a great option"


This. I really feel like egress traffic costs is the most overlooked issue with popular cloud services like AWS or Google Cloud. Charging 9 or 12 cents for a mere GB of regular traffic is just insane, keeping in mind that even most CDNs are cheaper.

Data driven applications on AWS are already expensive because of the high traffic costs, but storing/serving media on S3 single handedly lets your bill explode. For everyone who feels like the statements above are exaggerations, just check out their actual prices (https://aws.amazon.com/ec2/pricing/on-demand/) in comparison with a regular CDN (https://www.stackpath.com/pricing/).


A quick and dirty comparison would be 3TB out in a month. That's what's typically included at popular VPS services for a decent sized VPS that's $20/month. On an EC2 instance, just the bandwidth for that month would be $269.91.


A blanket "never use AWS" isn't going to address it constructively.

But that's not what the commenter was saying, of course.

More like (paraphrasing) "before you go full-hog on AWS, do some simple math first." Which an amazingly high percentage of people neglect to do, these days.


> before you go full-hog on AWS, do some simple math first

I had a coworker who was turned off of AWS by clicking the "Dedicated IP" checkbox when deploying a CloudFlare distribution with SSL.

One month and $600[0] later he swears up and down that AWS is the devil.

You can spend an amazing amount of money on AWS by accident and/or without need. Hourly pricing is a huge footgun. Best to run everything through the "Simple" Monthly Calculator: https://calculator.s3.amazonaws.com/index.html

0: https://aws.amazon.com/cloudfront/pricing/


Weeks of work can save hours of meetings.


interesting thing to say, but did you plan for it to mean something in this context? :)


Nothing special - just the whole "Yeah, saw your email† ... look, I'm so busy, did you get the site up and running again? I'll get back to you on this after my vacation" management style which seems to drive a great many "decisions" that are made in this realm. And on which AWS, implicitly, banks a great deal of its business model on.

† About the monthly you-know-what bill now well into 5 figures, and climbing.


> At this point simply not using AWS is a competitive advantage.


I read this as suggesting that a simple arbitrary rule to not use AWS is likely to save both money and also the time that would otherwise be spent tuning your use of AWS to cost less money.

Your time is valuable too. It could be spent developing something novel instead of messing around with server configuration.


I am curious your thoughts on the following. It seems like there is a trough of benefit to AWS. In the mid-size world, especially where computing needs are relatively stable/predictable, you are much better leasing dedicated servers from the usual suspects. Below that size, the linear scaling allows small scrappy startups to really scale/burst with their needs. And at large scale, but below Google/Facebook et al scale, it's cheaper than building your own datacenter yourself. At least this has been my casual observations. And fwiw, the startup I have just launched fits into this trough perfectly. So hopefully I am not just biased. I love the cloud for the spot market. But not for our base load. Spot is the nat gas of the server world, and OVH is the coal.


I agree with a lot of what you're saying-- but that trough is a lot wider than most people think.

At the single instance very small scale, it simply doesn't matter all that much. "Our business failed because our infrastructure bill was $400 instead of $150 every month" isn't that common of a story.

Base load can be effectively migrated to AWS, depending upon how it's built / structured. It's hard to lose money on infrastructure by converting worker servers to spot instances or lambda functions, as a for-instance-- but it's VERY easy to lose months of engineering effort in doing so.

Remember as well: done right you're not just buying services from the cloud, but flexibility as well. Resizing on the fly, bursting, duplicating your infrastructure for a test-- these things add value.


Resizing, bursting, and duplicating on the fly aren't exclusive to AWS, though.

A lot of teams rent space from places like OVH or Hetzner because the raw power is so much cheaper, and they already have the talent and tools to scale and manage a computing fleet (people like to think it's harder than it actually is).


Are bursting and resizing an issue in the trough, though? GP's premise is that these companies have platform and load stability. There can be marginal gains in having a dynamic architecture, sure, but when it also costs "months of engineering," I don't see where the savings would be compelling, or even guaranteed.


There are companies "in the trough" who are contently spending tens of millions a year on AWS. It's hard to give generic advice for all situations-- as with so much else, context matters heavily.

Those marginal gains could potentially pay for a dozen engineers...


My point was based on how dynamic the architectures are (or aren't) of those in the trough, not their level of spending.


Ah, my apologies!

It depends. I have a toy service that exists entirely as an autoscaling group. That's pretty dynamic!

Counterpoint: Mainframes.


A toy service is not what I'm thinking of as being in "the trough." The trough is where scaling and bursting aren't generally necessary.


How do you approach a horrifying AWS bill? Can you recommend any reading? I’m going to need to optimize a tiny fleet on GCE pretty soon.


It gets into the specifics of "what makes it horrifying" pretty quickly. It usually boils down to "nobody's responsible for cleaning things up," a misunderstanding of the billing model, and by the time anyone cares to fix it the billing report is multiple gigabytes.

Step 1 is always "figure out what you've got, and why it is this way."


So if the best way to avoid the mess is accountability for keeping things clean, how do you do that from the start? Would something as simple as tagging resources to specific teams/departments be a good place to start? I'm helping with setting up a completely new account/environment and I know once the buzzword fever passes and the annual costs are reviewed there will be frantic requests to cut down the bill. Getting ahead of that inevitability with a plan and some preparation will save us a headache.


Propagate tags to secondary resources, like EBS volumes. Separate out production from your dev environment-- preferably with a second AWS account with consolidated billing. Buy in to something like Terraform or CloudFormation for building the non-instance parts of your stack.

As you grow, start adding project tags to things. Automatically tag resources with the IAM user who creates them. Have an "untagged resources report" that goes out on a schedule.

Build with an eye towards "when someone freaks out in a year about the bill, what questions will they have, and how can I best position Future Me to answer them?"


Running / monthly cloud cost and comparison to other options (e.g. run on a rented dedicated machine / co-location) should already been looked at when you do even the most basic business plan.

Generally most of the time the best approach is to create solutions that can scale out independent from the HW / infrastructure they run on. With frameworks like K8s / Docker / e.a. this is now relatively easy.

With that for example you can start of with a few small cloud machines / docker containers. When you get continuous load - migrate some of them to 1-2 dedicated machines. Same with peak load - just spin up more images.

All that of course depends on what kind of application you are providing. If you are running a data intensive apps you might start directly with a dedicated machine.

Easy in, expensive out. One cost with cloud services many forget is the cost to get out again. Have a look at the AWS / Azure price-lists.


I used to do data-center and virtualization consulting- we always designed our enterprise client's systems to handle the baseload in-house and burst to public cloud. Not rock science. We even automatically live-migrated load from and shutdown onsite hypervisors at night when the system had surplus capacity...

I think the real issue is that people assume all workloads are public cloud workloads. The bigger and less dynamic your workload the less that is true.


This is a great approach. A company I worked at was running mostly on dedicated hardware, and due to the pain of moving jobs around and getting new hardware, we focused hard on optimizing our code to run within our existing capacity whenever we ran into problems. After a couple years of customer growth, I'm sure we would have been running on at least 10x as many machines if we had never done that. We then set up cloud spot instances for when we just needed extra short-term capacity.


> I'm sure we would have been running on at least 10x as many machines

Seeing the opposite trend.

Because your own machines are so expensive and annoying to manage, you only get 500GB memory 40 cores servers. You don't optimize and you run stuff randomly on whatever seem available.

Whereas in the cloud, you put VMs per role, with appropriate sizing. And when someone asks for 5 machines with 16 cores, you can be like, WTF are you running with all that power???


I would believe that, but we were hitting the limits of our huge servers. There's only so much QPS you can throw at a single box. These servers were also "pets" as opposed to "cattle", and a big part of the prep for cloud was treating them as cattle.


anon account for obvious reasons.

Came here to share my story, which is very close to what you describe.

tl;dr: Moved from AWS to colo in desperate attempt to make balance sheet more attractive. It worked.

Before I took over the engineering bits at this plateau/failing startup, they were all in on AWS. Monthly costs were running at around £3k. Which is not much, but since the revenue was tiny, the costs of hosting was "considerable". After a monumental effort on my part to try to make things more efficient and streamlined, I eventually run out of motivation (but that's another story) so I expressed my wish to leave to the founder and he finally agreed to my years old suggestion of going lean-and-mean and in one last attempt to make the company more attractive to one last investment round (E by then). Make it or break it, as the say goes.

On that same week I came across a HPE "Buy One Get One Free" offer (similar to this one https://www.serversdirect.co.uk/pdf/BOGOF-Gen9-Servers.pdf). Then found guy who was re-selling colo space at a under-used datacentre in London. Since I (sysadmin/DevOps/InfoSec/Backend engineer Jack-of-all-trades master of none) was the person who would end up managing the new metal, I picked a location within an hours drive from home.

And so it happened.

2x 1U HPE proliant servers with decent CPU and 32G of RAM. And 6x Samsung 850 SSDs off Amazon. Total bill for the hardware (BOGOF refund included): £2600 ish.

Post migration, monthly hosting costs: £50 for the colo; £50-ish AWS S3+CloudFront (10M assets would increase the SSD storage costs too much); £40 in taxi fares when I visit the DC once a month for kernel updates.

And just like that, we broke even the very next month after the last "big" AWS bill. Our product wasn't all that exciting, but 2 months later we found a buyer who was happy to snatch this "non loss making" operation.

3 months after the acquisition, they had already migrated the whole thing back to AWS and the HPE proliant servers were gathering dust in their office.


Of course, because how do you scale up when you're building everything yourself? Suddenly you have to hire a whole bunch of hardware guys, storage guys, db guys etc, etc.


Why do you need a DB guy if you host it on your server or AWSs? You have a DB either way. You can have a guy if you want or not... Physical location means nothing.


DB guy to tune the instance the database runs on + figure out your streaming backups etc. It's not a full time job, but it's a couple of days a week; and completely taken care of if you use RDS


I think your estimate is off by a factor of about 40. RDS is great at what it does, but it doesn't do that much. Getting the stuff RDS gives you for "free" might take a a few days to set up, but after that, your automated backups are going to take just as much of your time as the ones RDS gives you, ie, none.

If RDS saves you 30 minutes a week I'd be surprised.


A couple of days a week? That's ridiculous, once you have established procedures the only thing left is regular OS maintenance tasks which are the same as the rest of your servers.


Indeed. RDS saves like 1 day setup. No ongoing difference.


A lot of people are in denial that there isn't using some magic efficiency to cloud services that other datacenters don't have. Primary cost savings on cloud VM's is from overprovisioning. The more abstracted away the service is from the hardware, the more they can overprovision without customers noticing.

The 50%+ profit margins have to be coming from somewhere. AWS is not made of magic, it's made from largely the same PC parts you buy on newegg


This is just factually incorrect. AWS hard partitions all instance types, except T2 (which are overcommitted, and clearly advertised as such: "burstable").

So, if you provision an x1.32xlarge with 128 vCPU and 1.92TB of RAM, you get a single, dedicated host with that much CPU and memory. Nobody else gets it - it's dedicated to you 100%.

The profit AWS is making is purely due to datacenter efficiency and being able to automate their operations at scale.


>The profit AWS is making is purely due to datacenter efficiency and being able to automate their operations at scale.

That's what they want you to think, but a lot of it comes in the form of ripping you off for bandwidth to actually get data out of EC2.


I've been with AWS for about 10 years now on a number of projects. Bandwidth pricing has always been my major gripe. Almost everything else I consider fairly priced for what I get and how hands-off I get to be.


The instant you wrote "128vCPU" you undermined your own argument.

What is a vCPU?


It's early defined in the docs. The machine also has a physical core count that is yours.


There is nothing stopping Amazon from having a machine that is double that config, and renting you half of it, however.

As soon as 32GB DIMMs drop enough in price, I would expect them to do exactly that. You can configure a Dell R930 with e7 Haswell CPUs online, if you want to double-check it. Not sure if the R930 can go to 8 CPUs (which the e7 Haswells CPUs support).


But as soon as they double the size of the machine there is no reason for them to not have an instance type that is the size of the machine minus the host layer.


> The 50%+ profit margins

Charge more than what it costs you. That's how to make money.


well to be honest its because it sounds cool, its in the news, and so how can such proposals be wrong. Seriously the number of times I heard "the cloud" bandied about by people who have continuously failed or were "thwarted" by people who knew better is getting silly. I am sure it happens in other places but too many think it makes them look smart to suggest it.

if anything its a diversion from fixing what is broke or even admitting to it within some organizations


The problem is not AWS. The problem is poorly architected systems and software. We see a similar issue with the "we switched from PHP to Haskell on .NET in Azure and improved performance 3000x" -- the issue wasn't the fault of the language, it was the way it was used.

You put bald tires that are over-inflated on a Porsche and you're gonna have a shitty experience. Sure, bare-metal is gonna have performance benefits over AWS and at times might be cheaper -- but that isn't the only thing an eng organization is looking for.

tl;dr – you can build systems on AWS that perform well and are reasonably priced.


> You put bald tires that are over-inflated on a Porsche and you're gonna have a shitty experience.

This is off topic, but bald tires have better traction than grooved ones in dry conditions. (But the slightest bit of rain and they will not hold.)

Also tire inflation does not have a strong correlation with traction (there is some, but it's not large), so it's not so obvious that over-inflated means a bad experience.


Bald tires also mean that the tire compound has degraded. So it is going to have worse traction. Slick tires on the other hand are new without any grooves.


If you have a more or less fixed workload then AWS ends up being a lot more expensive. If you have a highly variable workload and you have a bunch of servers sitting idle most of the time then it becomes more competitive. AWS lets a small startup go from one server to 10K servers without breaking sweat, at least in theory, but it's not for everyone. At the same time not everyone wants to build DCs with thousands of servers in different geographies all by their own.

But yeah, AWS generally is expensive, which is part of why it's making so much money.

It definitely makes business sense to try to avoid locking yourself in to AWS but at the same time it's a good place to ramp up or test your business ideas without needing to invest a lot of money in servers, real estate etc.

[You can probably substitute any other cloud provider for AWS here]


The variable workload has been a little less important with reserved instances. They used to have a "light/heavy utilization" pricing structure but now it's basically cheaper to reserve the instance than it is to scale it up and down, unless you need the extra capacity less than 60% of the time.


That's going between 1 and 0 servers for an instance. Real scaling needs are met in a setup going from 3-50+ servers throughout the week.


I think you misunderstand me. Doing 100 servers for 16 hours and scaling down to 30 servers for 8 hours a day is more expensive than running 100 reserved instances for 24 hours a day. (100 reserved instances == average of 60 on demand instances).


The company I work for is backed by a bunch of data in a MySQL database, scripts to aggregate and process more data into that database, and simple internal webapp to both manage inventory, track tasks and run lots of reporting on the aggregated data.

We made do for quite a while on a single colocated database with 32 GB of RAM and a few fast SSDs (for the time, which was about 6 years ago), and a few multi-TB archival drives, but disk space is a concern since our compressed innodb tables are well over 100GB now. I set up a Google Cloud SQL instance (the largest possible at the time I believe, 104GB RAM, and enough disks to hit peak IOPS) a few months back in preparation of migrating the data store to it, and found it wholly inadequate for our needs prior to even getting it going.

First, the second generation instances don't support being a replication slave to a non Cloud SQL instance, making migration a pain. Second, just running through the replication logs on a client and piping it to a connection to the cloud SQL instance showed that the cloud SQL instance was just barely able to keep up with the insert and update traffic from the replication log, even while handling no other queries. Catching up the the master would take over a week, even though the backup was from less than 48 hours prior.

We signed up for an actual hardware server from Rackspace at about 2/3 the cost, but with 128 GB of RAM (which I can request expanded), and nice hardware RAID 10 of fast SSD drives, and tons of logical CPUs. It's appropriately faster compared to the original database server, to the degree you would expect from from stats quoted. It fits our needs, because really I just don't want to manage the hardware anymore (this the decision not to deploy more hardware in the local datacenter we are colocated at), and that's covered.

On demand spin up/down of resources and per-minute billing is nice, and I've used it to good effect for non-database resources before, so I understand the versatility of it and how it can really enable some new types of usage and am a fan of that aspect, but I just wasn't prepared for how substandard the DB offerings are if you actually have any sort of load. At this point, the only good thing I have to say about the higher end database offerings is that they were quick to destroy so I didn't get billed anymore. :/


May I make a suggestion: Try Amazon Aurora. You'll get 500,000 read operations per second, and 100,000 writes per second. It supports up to 64TB of total capacity, and 244GB RAM on your primary master. You can also have up to 15 read replicas with a similar footprint. You should seriously try it. It's amazing.


Maybe on the next upgrade iteration. As it is now, the server is performing well, and it can scale to 1.5 TB of RAM, so that aspect has quite a bit of room.


Quick question. Which version of MySQL were you running? 5.6 is single thread on the slave and falls behind more easily. 5.7 gives you multiple threads and does a better job of keeping up.

Disclosure: I work on Google Cloud and Cloud SQL


Whichever the highest version offered in the second generation Google Cloud SQL offering as of November/December 2016. If 5.7 was available, I would have chosen that. I don't specifically recall though.


You shouldn't be on AWS in the first place if everything you do can fit on a single server.

Use the right tool for the job.


I disagree. Depends on the service's business model obviously, but there are services where the convenience will more than offset the extra costs.

Once the company grows big enough, yes, then it might make sense to go off AWS (or not - see: Netflix).


Remember that Netflix runs their core product (content distribution) internally on their own CDN. All of the supporting technologies (billing, content discovery, etc.) is on AWS, but the core product is not.


This was a pretty recent change. Netflix has classically relied on third-party CDN providers. They launched the streaming service in 2007, and it took them 9 years to move 100% of their traffic to their own CDN infrastructure. That said, they still have one of the largest AWS footprints.


The core products are the shows, not the tech.


Without the tech, no one sees the core products.


Without the shows, the tech has nothing to show. Who's gonna pay for nothing?


Those with the shows didn't build the tech, but Netflix did.


And now Netflix is even building shows.


I've always found it interesting that Netflix has not at least tried to go off AWS; if only because Amazon has a competing service with Amazon video. I'm sure its not an easy problem to solve, but Dropbox has gone off AWS as well; it seems like they would be much better served with their own "Video Cloud" with specialized hardware for streaming/processing videos.


Netflix realizes that the value presented by AWS for outweighs the costs. If you feel differently, you should probably ask yourself what information you are missing, rather than just dismissing Netflix and their decision. It's clear that Netflix knows more about operating applications with tens of millions of simultaneous global users than most companies...


>If you feel differently, you should probably ask yourself what information you are missing

That's something that should be applied almost universally. :)


I don't think Netflix and Amazon really compete yet. Very few people see the two services as either/or. They're more allies against the cable companies at the moment.


Netflix doesn't use AWS for streaming video.

https://openconnect.netflix.com/en/


You make a good point that Netflix has a pretty sophisticated system for delivering video (not just served from AWS Cloudfront). However IIUC they still use AWS for preparing video files for their CDN.


Dropbox is an example where their core business is very data-intensive and AWS costs can be considerable.

I'm thinking of services, say the automated robo-built sadwiches, or the Easy 401 service, or the luxury shoes startup (all real YC-accepted companies), where the core business proposition isn't necessarily dependent on having a huge infrastructure.


Netflix doesn't pay retail prices for AWS.


I have services that do not fit on one server and still I don't need AWS. Distributing load/dividing services on x-xx servers is not rocket science, especially with tools we have available at the moment.


How does one begin to learn about these things? Minimizing cost of running services sounds super interesting but as a student I've never had to deal with it and am basically starting with 0 knowledge.


I worked through the labs at https://pdos.csail.mit.edu/6.824/ for fun. It's more along the lines of "How can we write a distributed fault-tolerant database?" but you might like it anyway.

Lab 4 is a beast with more lives than <insert metaphor>. The moment you think you've finally written your distributed system correctly, the unit tests will prove your service fails during XYZ network partition topologies. It's very worthwhile to be forced to think about issues like that and to design distributed systems for correctness.

But to address your question more directly, it's generally just a matter of scaling your service as much as possible on a single server. The server has finite resources (CPU, Memory, network, disk) so you either know how your system consumes those resources (because e.g. you wrote the service, and you know it uses O(n) memory w.r.t. the workload) or you graph your usage over time and try to predict when you'll exhaust your available resources. At that point you can usually think of some straightforward optimizations, which keeps everything on a single server. But eventually you might run out of resources with no obvious path to optimize it, so what then?

It depends entirely on your service, but typically you can just use off-the-shelf software to scale to multiple servers. For example you could set up three servers, each running Redis, and have Redis keep a list of "work to be done." Then your central process just farms out the workload to each of the three servers round-robin style, for example.

But at that point your service becomes a lot more brittle, e.g. you'll need to set up a failover solution so that your cluster can survive partitions and outages. (Redis uses Sentinel for that.) So it's worth keeping everything on a single server for as long as possible, if you can work out the optimizations to do so, since it's so much simpler with only one server. (HN is still running on a single core and serving 350k daily uniques, I believe, which shows just how effective it can be to keep your architecture as simple as possible.)


Wrt the first link, I'm going to be taking a distributed systems course next semester that (hopefully) covers the same material. Nice to know what I learn at school is somewhat applicable.

My question is learning about design ideas like having Redis keeping a list of "work to be done" and such. Using modern tools to combat 'modern' problems. Is it just something you figure out after knowing the fundamentals (i.e. when you can identify a problem you know what the solution needs to be)?


Yeah, pretty much. I've never actually done it, but I know I could do it if I needed to.

If it sounds mysterious, think of it this way: Imagine you were thrown into a room with a computer, internet, and endless food, and the only way out of the room was to solve this problem. I bet you'd figure it out within a week or two, or a couple months max. (If only to go have a shower.)

One thing to watch out for: Solving modern problems can be pretty unsatisfying. Before you experience what life is like at a modern company, you tend to think of them as functional, orderly, and planned. Real companies are almost universally dysfunctional, disorderly, and haphazard. It's very rare that a plan is conceived, followed, and deployed without growing some warts and throwing in some hacks to get it done.

So I think you should enjoy this time, where you're free to think of thought-problems like "What's the most correct and extensible way to solve this problem?" instead of being forced to solve them on a time crunch.


Thanks for the link to 6.824! It looks amazing and I can't to start working through the labs.


A million dollars buys a shitload of fast 1U dual socket xeon servers each with 512GB of RAM.


-looks at this datacenter-

Well, actually it doesn't.

RAM and Xeons are expensive.


So how many does it buy?


I would estimate the cost for three years like this:

- 1U single socket Supermicro Xeon E5v4-2630 10core+1GTX1070 GPU = 1500EUR

- operation costs (power+cooling) over 3 years = 1500EUR

- cost of 2 hardware admins 160k/year

- renting local rack space 15EUR/U

This equals about 145 servers for 3 years for us for 1 Million. The setup is highly optimized for _one_ set of data crunching tasks, not DB. We do not need much ram, or network but require a specific CPU/GPU ratio.


>cost of 2 hardware admins 160k/year

This is by far the most expensive item in your estimate, and I'm sure that you don't suddenly need to hire another pair of admins when you buy a single Xeon blade.

Once you factor this fact in your estimates, you'll realise that the unit cost of each server is, unlike your estimate, negligible.


Let's say 1 Rack of 42 * 1U servers.

That's about the upfront costs of servers ONLY.

Ignoring cooling, all personal costs, electricity, networking gear, network storage and future maintenance.


Approximately 4?


There are companies that will lease you a dedicated server with 40 cores and 256gb RAM for $499/month.

Figure a 10% quantity discount and you are looking at at least 200 of those for about $1 million per year.


Just because a company is ripping you off, it doesn't mean that's representative of the real market.

For 500€ a month, it's possible to buy your own 40 core blade with 256gb of RAM. In fact, it's possible to buy a pair of those servers.


Hetzner will rent you 24 cores (48 'cloud' cores), 1024GB RAM, and 4TB SSD for less than $499/month.


> simply not using AWS is a competitive advantage.

The way you wrote your comment about using a dedicated server implies that there are no other viable cloud providers besides AWS. That is the problem, people do not know or do not acknowledge the existence of a number of other cloud/VPS providers that, in most cases, would work perfectly fine for their startup, and be a massively better value.

Such as Digital Ocean, Linode, or many small relatively less well-known providers that are perfectly fine like UltraVPS etc.

Sure, some few people do actually need DynamoDB or some other ultra-scaling AWS service, but people who just need VPSs can get a much better deal elsewhere.


Totally. If the growth rate is reasonable (<50% y/y) and haven't really broken beyond what can be done on a single machine, using AWS or GCE or Azure is a ludicrous decision, short of a "serverless" style architecture. It's pretty easy to get sucked into the "best practices" and overengineer something that costs a lot of money.


Exactly this. I am using dedicated servers everywhere, I would pay easily 10-20x more for the same if I would use AWS instead.


Well, it seems obvious that if you want scalable, redundant, no downtime infrastructure, it will cost more than non-redundant, non-scalable infrastructure.

While it was obviously a good trade off for this company, it seems silly to then extrapolate that all companies would benefit by getting rid of redundancy and scalability.


IMHO, AWS is fabulous when your needs are a fast moving target. When you have something stable and have time to optimize for performance and cost, dedicated servers are better (cheaper and/or faster). There is no silver bullet.


I guess you could go even lower by not renting, but buying a server instead.


Not necessarily true. We run a decent amount of gear with OVH, and because if their scale and bulk discounts from OEM, our TCO is actually lower than it otherwise would be. I'm sure this isn't true at much larger scale, but I would bet it's true for most people.


You are absolutely correct, and I have seen this sort of thing time and again... "cloud" doesn't mean that the server is magically faster - it usually means "a great deal slower" because it is a shared resource.


So, how about backup, availability and a potential additional employee to make sure the server runs fine?

If they can indeed fit their footprint in one single server, then it should be fisible to just host the server at someone's home with enough upload speed, no?


However they miss the easiest fix: Calling their account rep at AWS and cutting a deal.

AWS loves startups that could end up being huge customers so they're willing to slash bills upfront to help you get to growth stage; not only will they assign you an account rep but they'll have a rep whose job it is to build a good relationship with your VC. Speak to your account rep. Have your VC speak to their Amazon rep. Push them to cut your prices.

I've known startups which have cut their monthly bill by six figures going this route. AWS doesn't want your startup to fail and on the margin these resources cost them close to zero so they're willing to deal, make use of it.


That was definitely part of the effort; it just doesn't make for terribly compelling technical content. We have a healthy, active relationship with our AWS team. They worked with us to put in place minimum commits for several products, and examined both EDP as well as bulk RI purchase discounts.


I'd say especially in your case you could be pushing for additional discounts beyond what's available to everyone else.

You're the ideal case for this, not only will you be a large customer in your own right if you succeed but you have a multiplier effect in that your own clients become heavier users of AWS because of your product encouraging users to move large amounts of data from Salesforce, etc. into AWS.

I don't what level the rep you're dealing with is, but if you're not doing so already I'd strongly encourage you to build higher level relationships with AWS.


Some of the minimum commits aren't exactly standard. I'd talk more about them but they're covered by non-disclosure. I see what you mean by the multiplier effect increasing our bargaining position — totally makes sense.


Gross. AWS is becoming the Cisco of cloud computing. Get people hooked when they are low revenue and twist the screws in later when they get big.


In what way is this gross?

They're giving people what is essentially a loan, which they are uniquely able to give, to help them start a business, in the hopes that this will be an even bigger business. This is almost the definition of a win-win.

Customers are free to leave at any time or work with other providers, but they choose AWS because they give them more value.


It's not a loan. They don't require any repayment. It's a discount.

>This is almost the definition of a win-win.

Not long-term for the company that gets lured into this trap. Just like with Cisco/Oracle, companies will waste millions of dollars once they are cornered into a situation like this because the switching costs are so high even though there are plenty of other alternatives.


Wouldn't it irritate AWS if startups left once they got big, despite receiving special care services from AWS? How do they deal with that?


> How do they deal with that?

By pricing it in to the discounts in the first place. So smaller discounts. And by making sure their offerings are priced competitively to comparable products.


It's open market, they can do nothing.


It's customer obsession, one of the leadership principles of Amazon.


Why not both? Get the best deal and turn off unused servers.


The Dynamo incident highlights an important lesson when using consistently hashed distributed data stores: make sure the actual distribution of hash keys mirrors the expected distribution. (though to their credit, someone writing an automated test using a hard-coded key was beyond their control).

Incidents like this are generally why rate limits exist, which they don't currently have [0], but perhaps they'll consider a burst limiter in place to dissuade automated tests but not organic human load spikes.

Unfortunately there doesn't seem to be an easy way to fix the per-user ID write bottleneck, short of adding a rate limit to the API – which would push backpressure from Dynamo to the Segment API consumer. Round-robin partitioning of values would fix the write bottleneck, but has heavy read costs because you have to query all partitions. They undoubtedly performed such analysis and found that it didn't fit their desired tradeoffs :)

Great post, very informative. Thanks for sharing! Also, love the slight irony of loading AWS log data into an AWS product (Redshift) to find cost centers.

[0]: https://segment.com/docs/sources/server/http/#rate-limits


Thanks for the warm feedback!

We currently have an internal project underway to detect hot keys in our pipeline and collapse back-to-back writes before they're written out to DynamoDB.

It's difficult to apply throttling on these conditions synchronously within the ingestion API (i.e. return a 429 based on too many writes to one key) because of the flexibility of the product: that workload is perfectly acceptable for some downstream partners. It also gives me pause from a reliability perspective. We try to keep our ingestion endpoints as simple as possible to avoid outages, which for our product means data loss.


Ah, gotcha. Yeah, it makes sense to avoid synchronously turning away data as that does defeat the point of the product. And the cost for rejecting false negatives is high because the moment when a client is receiving lots of data is when it's most important for them to store it.

If you don't mind answering: for your warehouse offering, do you pull data from some services (e.g. Zendesk, SFDC), have them push it to you (which is what I interpreted your "downstream partners" comment to mean – though perhaps those are "upstream partners"), or a mix of both?


For downstream partners I mean data flows from user -> Segment -> partner. Event data is ingested through our API & SDKs, and this is fed to partner's APIs. For Cloud Sources, generally data is pulled from partners using their public APIs at an interval and pushed to customer warehouses in batches. In a few special cases partners push data to our APIs.


> the per-user ID write bottleneck

The basics is to have less machines but more powerful. It helps to handle the targeted bursts.

The advanced is to have a layer of queuing before the ingestion, where you can do magic with distribution rules, rate limiting and dropping peak traffic.


Yep, you're absolutely right. In multitenant SaaS apps with extremely uneven distribution of traffic, it's pretty common for large customers to get their own dedicated DB servers.

> The advanced is to have a layer of queuing before the ingestion, where you can do magic with distribution rules, rate limiting and dropping peak traffic.

And batch loading – don't forget batching!


I didn't mean to have special databases, that is another level of optimization.

I mean to have bigger servers for everything. For instance, a farm of 4x 10cores servers is likely to process data more consistently than 10x 4cores servers.


I work in a start up, we own all our own hardware, and it is HELL.

We are forced to pay extremely large sums of money to upgrade our infrastructure as any purchase requires a redundant piece as well.

For example we have used 90% of our SANs storage, our IO is suffering and now were looking at purchasing two $10k SANs to upgrade. In the meantime, we have probably spent over 10k worth of development time to compress, clean up, and remove any data we don't need. We will not get the weeks of development effort back, we will continue to spend more money on hardware, and we will continue to suffer to delay projects because of poor choices.

If we had AWS, we could have written an entire sprints worth of work (also ~$14k worth of IT time) and moved on with our lives.

Clearly, some companies let the budget keep growing, but for most companies AWS flexibility is worth its weight in gold.


Why is it always colo vs cloud in these arguments? What about renting everything monthly from ovh, rackspace, softlayer etc? Many of them can provision dedicated servers within minutes, too.


Also some combination of cloud and private servers. If I was starting up a Netflix competitor for instance, the site could probably run fine from owned hardware, but the content itself would be better off being cloud based.


Ironically netflix does the exact opposite. Also, you definitely don't want to have the content on the cloud - the data transfer pricing is where they make significant profit. Both AWS and GCE are ~$100/TB outgoing.


AWS provides a SAN in the form of EBS.

Colocation provides a SAN in the form of calling up a datacenter storage company and having some consultants install you a SAN.

What are you going to do to get similar functionality on OVH, Rackspace, or Softlayer?


I have the opposite experience. We have a quite storage-heavy service (video delivery, multiple versions of each), and we recently had to upgrade our SAN to a) a newer system and b) more capacity.

As part of this process, we looked into AWS and similar products, and with a naive 1:1 move it would be 3 times more expensive than getting a hardware deal with ramp-up (make a deal for X, buy 20% * X upfront, then buy the rest as needed). Implement some better scaling up and down might have decreased the cost to only a 2x multiplier.

Having your own hardware is not a walk in the part, and you need at least one experienced sysadmin to help with server setup, but I feel that for a steady company the cost benefit alone is worth it. We still use a CDN on top of our setup to alleviate pressure to our servers, but otherwise everything runs on our own hardware.


>For example we have used 90% of our SANs storage, our IO is suffering and now were looking at purchasing two $10k SANs to upgrade. In the meantime, we have probably spent over 10k worth of development time to compress, clean up, and remove any data we don't need.

I suspect you haven't priced out AWS storage (and the bandwidth to actually access the data with frequency). If the cost of a SAN was hurting your company and you had to look to reduce data usage, AWS will absolutely wipe you out.


And how much would AWS cost for your total storage, per month? Because if that's $5k then it sounds like you still picked the cheaper route...


Exactly. Neither path is easy to walk; it always comes down to trade-offs. What do you value?


I've been joking with friends that my next job will be AWS efficiency guru. I've somewhat optimized our own use, but I think I could use similar, simple rules to get 20% out of a 500k / month budget.

Give me what I save you in 2 months and I'll have a good business :)


Go do it!

I used that exact same model in Conversion Rate Optimization - get your conversion rate up, give me 30% of what we improve.

And built that into a 20+ person digital agency billing millions of dollars a year before being bought out.

Exactly how I did that and you can to:

(1) Wrote topical, detail rich posts similar to the parent here about problems I was solving in CRO for a handful of customers, never disclosing confidential customer info.

(2) Marketed those posts strategically. EG I wrote one about "Which trust symbol gives you the highest return on conversion rate." and then literally just bought Google Adwords of people searching for that question! StackOverflow and other forums also are great ways to market by answering questions (free + put your details in contact info) or running ad campaigns specifically on those topics ($5k+).

(3) Turned the best performing / most viewed posts into "pitches" for speaking gigs at materially similar conferences, most were accepted and I became an "authority".

Every post / conference / etc had a little "Want us to fix it for you? Full service, performance fee model." banner or mention.

Work poured in after that and we were lucky enough to be very choosy.

If you can SAVE large enterprises money and are willing to do it on a performance basis you've got a business.


When you say "trust symbol" do you mean the "verisign" logos and similar? If so - do you have real data to prove these make a difference in conversion rate? I have clients that come to me parroting the same tips and I imagine they are all reading the same nearly identical blog posts out there making this claim.


This sounds super interesting! Where can we learn more about your story? Do you have an email or Twitter to contact you?


That's awesome, do you have more information about your acquisition?


This was several years ago before we had the "growth hackers" lexicon.

One of our customers bought the entire company to get a hold of the core team + essentially continued the "30%" deal as a long term incentive as convertible equity.

Was a good run, and reinforces my post of if you can bring measurable / substantial value to large enterprise companies amazing things can happen.

Large enterprise have fun accounting terms like "capitalizing an acquisition", eg they don't buy you out of cash flow. And can even carry debt on the purchase / incentive programs etc that not only make you more valuable to them, but create incentives for them to buy smaller companies.

Happy to answer any more specific questions.


What would you say is the most important, specific skill required in conversion rate optimization?

Is it copywriting? Is it UI/UX? Is it analytics?

Or would it be fairer to say CRO is holistic, requiring a multitude of skills?


Great question, by the end of the process we were experts @ holistic level.

But started with very basic web design skills at first.

From the start key "skill" for the pitch / and our clients was we had the BANDWIDTH to focus on implementing testing as a regular / rigorous process for them.

EG most - even big cos - wind up with their good web dev resources completely tied up with 100 other things.

And 10 departments competing for their attention.

I came in and said we can do this, full service and you just have to approve the tests and give us a dev environment.

But started with just basic web design skills / willingness to read the multitude of good resources / books / blogs / thought leaders on CRO an apply their ideas.

EG - Common CRO theme - "try different, high-contrast button colors on your call to action"

Ok, I can do that.

- Take existing page

- set up optimizely or VWO (at the time we used home brew or GoogleWO!)

- make some really great buttons (or outsource on elance for $10)

- get client buy in

- set up the test

- ensure production readiness / testing

- go live

- provide nice reporting format for client weekly that lets them stay involved and see results, and have confidence in your ability to execute.

- prepare next test while first one is running, and remember that a huge % of your best ideas will fail, be agnostic to results but be statistically honest and educate clients under same process / instruction

Simple example, but you can ramp the complexity up from there.

Like "better to have this "sign up now" button go to another page w/form or pop a modal window w/form?"

and on down the list, the CRO blogs / experts / etc have 1000 ideas.

And over time got better about understanding what did / didn't work across clients.

So my win rate crept up from say 20% of tests significant win to 50%+ out of gate.

And every person on my team I hired because we wanted to run a test that I / we couldn't implement ourselves :)

What's your specific interest / skill set that you're trying to adapt over?


I have an analytics background and have been applying for multiple CRO jobs. It's very frustrating the breadth of skills that people want the inhouse CRO guy to have (i.e I've had interview questions on JS frameworks, very advanced stats, non-trivial SQL, CMS systems, photoshop, SEO, paid search, Salesforce ...) especially when I know a lot of whole agencies couldn't tick all those boxes.

Would love to start freelancing this stuff but it seems like it needs so much personal branding


I actually do this as a full time thing; I started a consultancy to fix horrifying AWS bills.

Something I've learned is that flat fee pricing makes the most sense-- while tempting, the other models are off-putting.

Hourly is a great way to starve to death, and "percentage of savings" grows difficult to quantify. "Okay, you just recommended the following reserved instance purchases. Is this really the best for us, or does it boost the number you're taking a percentage of?"

It's very easy to end up misaligned with your clients as you go down that path...


Put a maximum amount to the fee. Let's say 100k for the month I am here.


If you're volunteering as a test case, I'm game! :-)


Nope, I'm the consultant advising you how to charge more. Actually a daily $500 per day + 10% of the savings with an upper limit of 100k is an easier sell.

My bills are so optimized you wouldn't make a penny :p


I'm pretty much that in my current job only with a salary. Here's why: I can make all the recommendations I want but change has to be driven by the will of the higher ups, often as high as C-level folks (CIO/CTO). So you pay me to make recommendations, not to actually save you the money, because the second part is largely out of my control.

Having said that, all the cost savings initiatives I've spearheaded are on my resume and LinkedIn profile and I take great satisfaction in optimizing those environments to save the client money.


The simplest one might be to convince a company to reserve 3 years worth of AWS resources and paying upfront. I am in this situation right now, but and it's a tough pill to swallow.

I decided that all of my personal projects will be GCE. It is much more cost efficient already and Google will soon allow me to commit to future usage and pay my commitment as I go (Right now AWS forces you to pay upfront to get the same discount (~50%))


With a couple of exceptions, 3 year RIs are a poor move.

You're locking in pricing, and opting out of both newer instance classes and future price reductions during that time period.

Generally, they're only useful for "that database we WILL NEVER MOVE," or if you're writing portions of your cloud spend as CapEx and want to amortize depreciation.


I agree on AWS it's a hard move. On GCE, however, you would be buying CPU and memory units, not machines, which to me is much more appealing. Even if the the price drops 50% every year, i would break even.


I don't think that's a bad proposition at all. If I were a business person running on AWS, I'd do it.


Or charge 20% of what you save them over the next year. This way you're charging more overall (especially if their costs are growing). Also your revenue will be more recurring rather than a one time thing. And by the time the 12 months is up, maybe they'll need your service again. :P


Also as a SaaS founder running on AWS, I would totally do this once our AWS bill is in the 4-5 figures.


It's almost like clockwork. Companies start wondering around $10K a month; they start doing something about it at $50K a month. I can almost set my watch by it.

This turns into a fun parlor trick when I can estimate a client's bill based upon the story they tell me!


We got pinged by our CEO to reduce our AWS bill which was $8k at the time. After a bit of work, we got it down by a little over a grand. One lunch, he said "guys, what are you doing about that bill?"... "What, we got it down by over a grand!"... "Yeah, but the exchange rate has gone the other way..."

Gotta love the Australian dollar. The Australian economy is solid - about to set a world record for longest continuing period without recession, including the GFC years - but the AUD swings around like a mad animal.


The savings are too miserable to pay a consultant.

Remember that he has to charge at least $1000 a day. He's more expensive than your entire bill.


Right. Surprisingly, the comment you're responding to resembles some of my clients. This is a business problem.

In this case it's not "save us a few grand" that they're asking for-- the real ask is "help us analyze and forecast our spend as we continue to grow." Identify the knobs and dials that impact infrastructure costs, devise a costing model that states "each additional user costs $X to service," identify what makes sense to arbitrage between AWS and other providers / on-prem...

If I'm being honest, "lower the bill" is where it starts, but not nearly where it ends. :-)


Agreed, consulting is more than just lowering the bill once.

That doesn't change that they don't have the funds to pay the fees though.


2 months ? That's cheap. Make it a year.


Like many others, I'm interested, do you have a preferred way to contact you?


Hi, drop me a line at tom@wearewizards.io :)


If I have to give that to you it's not "savings"... :D


but you do save money every month after that!


Remember LowerMyBills.com. You will LowerMySaasBills.com


I've always been set aback by how much AWS servers cost. Maybe I'm just too cheap but you can go to very reputable hosting companies and get things at fractions of the price.

For example, if I need 16GB of memory and 4 cores here are my options:

    * AWS (t2.xlarge) $137.62
    * OVH (EG-16)     $79
    * So You Start/OVH (SYS-IP-1S) $42
    * Kimsufi/OVH (KS-3B) $29
    * Hetzner (EX41) €46.41 [Lowest cost model is 32GB of ram]
    * Joe's Data Center (No model name) $45 [Duel socketed L5420 which is 8 cores]
That's a crazy price difference and, unlike frmo what I understand about AWS, you don't need to pay for bandwith for most of these companies or they have some crazy high limit. IIRC it's common for 20TB to be the standard (FREE) bandwith limit. You're also on real hardware, not a VM.

Unfortunately not all of these are perfect systems. There are issues with them. Some have slower CPUs, some have slower disks, some are in other countries but you can just pick the ones that suit your need for whatever you have to run. You can also use VMs that are far cheaper then these dedicated systems. If your workload is under 50% duty cycle on a CPU and under 4GB of memory you don't need a dedicated server. Buy a VM from one of these companies:

    * RamNode (8GB SVZS) $28 [Fully SSD storage, very fast network]
    * OVH (VPS CLOUD RAM 2) $22.39 [12GB(!) of ram and 2 vCores]
    * Linode (Linode 4GB) $20
These are all VPSs but they will get the job done and are cheap enough that you can build something great on a budget.

This is just from a few minutes of remembering when I had to buy my own hardware. I'm sure this isn't an exhaustive list. You can usually find information on the web hosting talk forums and on a site called low end box. I don't have links on hand but they're worth a read.


Most of those providers you mentioned aren't as reliable and scalable as AWS, Google Cloud, Azure, etc. That isn't an apples to apples comparison.

I would not want to host my business on So You Start, Kimsufi, Hetzner, and especially not Joe's Data Center. I have personally used Joes DC and they have had numerous outages in the past. Hetzner is known for terminating you for any sort of "DOS" like traffic, including high PPS. To add to that, those providers (except maybe OVH?) don't let you scale up and down with automated APIs.


Reliability wise, do you actually use these services, as in my experience that's a complete myth.

One of my clients on a dedicated server has never gone down. The site is blazingly fast, barely touches 5% CPU and pages have sub 50ms response times. Deploys take 10 seconds or so, I could make it faster but it's not really worth the cost/benefit.

My client on Azure, with a significantly lower visitor count, pays 3-10 times as much, the site is sluggish, takes a long a time to spin up after deploys, hangs ocassionally, we've caught the whole site being offline a couple of times then it mysteriously starts working again with nothing in the logs, had a deploy to one site take down other sites and on top of that there's a 3-5ms delay between the database and website which causes all sorts of performance problems when a page makes too many DB requests.

They've had to "scale up" to premium database in the past to handle loads I know a much cheaper dedicated server would have handled without even thinking about it.

On top of that, Azure's management portal is super slow, regularly fails to execute commands and is incredibly frustrating to navigate and use.

The claimed machine you get on cloud aren't anything like as performant as supposedly similar machines on dedicated.

I'll admit I've generally found AWS to be significantly better than Azure, but still very expensive.

And AWS went down a couple of weeks ago too.


> on top of that there's a 3-5ms delay between the database and website which causes all sorts of performance problems

Yea that's something I don't get too. With all of that IPC going over amazon's network for all of their services how much time are programs wasting sending and waiting for messages from other amazon services?

I never really thought about that. If someone could measure this it would be interesting.


I can't comment on Azure, but most of the complaints about networking issues in AWS went away when they redesigned their networks around an SDN model with VPC. We see consistent <1ms RTTs within region, and <300µs within availability zone (datacenter).


I've got a reply from Google Compute before after complaining about how bad Azure was saying they'd has similar problems but solved the latency between servers and databases. It basically just sounds like the Azure guys have severely mucked up their infrastructure.

We have a test page that runs 100 "SELECT 1" SQL queries ON THE SAME CONNECTION between an Azure website and an Azure database and it takes a whopping 250ms to complete at best. That's literally all it does. And their infrastructure is so bad that it'll vary from 250ms to 600ms in the space of 20 seconds in peak periods. That's damning, cloud infrastructures are there to part you from your cash, nothing more.

It's ridiculously bad, in my opinion no professional should EVER recommend Azure. My client is enamoured with telling people the system runs on Azure, with its "secure" network, nothing I've told him makes a difference, but he'd have a much better service if he moved it off Azure.

Because we develop with the entity framework it's quite easy to muck up and forget an include or two and then suddenly the queries spawn a hundred or two hundred basic SELECT queries to populate a `Order.OrderItem` or something as trivial. On a standard dedicated server setup or even your worst 5 year-old crappy dev laptop that's a ms or 2 extra, but on Azure it's performance death.

There are advantages to using Azure, I admit, the easy deploy from github for example, but that's more because setting up deploying from a repo to IIS is such a bad workflow at the moment and the IIS management too. They don't want to make it easy. With other clients I've setup moderately complicated deployment scripts, and once the initial work is done, it's much better than Azure, you run a bat, boom, deploy much faster than Azure manages. 10-20 seconds without even pre-compiling the pages, Azure will take 5 or 10 minutes to finally get round to doing it and woe betide you doing it at peak times as it will use up your memory allocation and simply hang for 20 minutes while you frantically try and restart it while the azure management portal has a massive spaz (yes, we unfortunately deployed a serious bug, fixed it, tried to re-deploy while under heavy load, cue website down for a ridiculous amount of time while their management portal threw a ton of weird and inscrutable errors before we finally managed to restart it).


Yikes. I've only really had true production experience with AWS. I guess all clouds aren't made the same.


I have used AWS at a couple work places and I personally use Google App Engine. My sites on App Engine have less than 5 minutes of downtime per year as measured by Pingdom. Deploys to App Engine take a few seconds (though it can depend on your app).

You can choose to run your DB servers on top of generic compute instances in Azure or any other cloud provider. You aren't stuck using their database services.

> The claimed machine you get on cloud aren't anything like as performant as supposedly similar machines on dedicated.

This is true, I don't think anyone will dispute that.


> One of my clients on a dedicated server has never gone down. The site is blazingly fast, barely touches 5% CPU and pages have sub 50ms response times. Deploys take 10 seconds or so, I could make it faster but it's not really worth the cost/benefit.

One of your clients... where? Without even giving a location, this doesn't even qualify as 'anecdata'.


I think that:

1) Most problems can, but shouldn't necessarily be, solved with adding more servers.

2) SYS & Kimsufi are just OVH resellers. Same datacenters just the last generation of hardware under a different brand.

3) Hetzner is great for if you want to serve up content from the EU. Great ping times all across there. (Your friends over the pond will thank you)

4) OVH is a huge, and very reliable, hosting company. I don't think there's anything wrong with them. They also do supply such an auto-scaling API.

5) You don't necesarily need to run mission-critical systems on a traditional hosting solution to see big savings. Move your high-compute/high-bandwith programs onto a dedicated server and save big. This is really great for batch analytics type systems where after a day of operation, you dump a backup file and want to pull some data out of it for the morning. It's great for dev systems and thousands of other non-mission-critical systems.


AWS is so much more than servers. If you just want a server to run some stuff on, you're right, it's faster and cheaper to buy a dedicated server from a reseller.

With Amazon you have so many options when it comes to storage, compute power, load balancing... They have a huge portfolio of products and you can get everything in one location from one vendor.


"With Amazon you have so many options when it comes to storage, compute power, load balancing..."

It used to be different, but today their most important offerings have open source equivalents. They are essentially off the shelf, because for your own projects there is no need to care about abuse or noisy neighbors. Most importantly, though, it's going to be cheaper and you still will be able to use cloud where it shines.


> It used to be different, but today their most important offerings have open source equivalents.

Seriously? What about S3, RDS, VPC, IAM? The engineering $$$ (in time) to set up and maintain open source equivalents of these would not be cheap.


AWS isnt cheap either


Great writeup. The "user_id" one really hit home for me. @ Userify (ssh key management for on-prem and cloud) we currently have hotspots where some companies integrate the userify shim and put 'change_me' (or similar) in their API ID fields. Apparently, sometimes they don't always update it before putting into production... so we get lots and lots of "Change Me" attempted logins! It's not just one company, but dozens.

Fortunately, we cache everything (including failures) with Redis, so the actual cost is tiny at most, but if you are not caching failures as well as successes, this can result in unexpected and really hard to track down cost spikes. (disclaimer: AWS cert SA, AWS partner)

Segment's trick to detect and record when throttling, and using that as a template for "bad keys" (which presumably are manually validated as well) seems like a great idea as well, but I'd suggest first caching even failure calls on logins if possible, as that probably would have mitigated the need to ever hit dynamo.

PS the name 'project benjamin' for the cost cutting efforts.. pure genius.


Reading through this, I would change "Dynamo is Amazon’s hosted version of Cassandra" to "DynamoDB is Amazon's hosted key-value store, similar to Cassandra". The former (to me) sounds like you're saying they vend a managed Cassandra.


You are totally right. This was an oversight that we didn't catch during our review process. Thanks so much for the tip, we've updated the post.


I wish I could invest in Segment. Lot of smart people over there.


These last two engineering blog posts[1] were great reads and they're doing good open source work. Not looking for a job but would love to meet the team if they host an event!

[1] https://segment.com/blog/ui-testing-with-nightmare/


DISCLAIMER: It can be viewed as an advertisement for my project

Few months back I have start a company for that exact purpose : helping you decreasing you AWS bill. https://wizardly.eu/

The focus is around reservation, unused resources and tracking cost evolution on a day to day basics.

I don't get that much traction ( I'm not a marketer and it's my first SaaS project, still a lot to learn in that area). I'm looking for more users feedback, if you willing to invested few minutes in my project that would really helpful. It's also completely free now.

As the article say, their is no silver bullet. Specially if you need infrastructure changed. But it's possible to have an interesting cut by using all the AWS pricing options.

Making teams aware and responsable of their costs is also making a big differences. Often, people creating and managing infrastructure have no view of what it is costing for their company.


Have you guys considered going bare metal or a hybrid approach? With such immense spendings (even when saving the $1m/yr) it would probably be a lot cheaper.


It could be if our workload was relatively stable and there were spare engineering cycles to undertake a migration and all that this entailed. Neither of these is the case.

Much of what allowed us to implement these savings quickly with a small team was the flexibility afforded by cloud infrastructure. Poor decisions are easy to reverse, but in a bare metal world you better be damn sure what you're doing, which slows down the decision-making process and seriously complicates experimentation. The number of people who know how to build out datacenters at the scale of thousands of machines is vanishingly small.

We'd also need to replace IaaS services like ECS, ELB, ElastiCache, RDS, and DynamoDB. There are certainly off-the-shelf replacements, but we'd need to build-out the expertise within our teams to operate these systems. We're talking roughly a dozen or so engineers working full time for many, many months to get these systems in place from scratch, on top of the even larger effort to design and build out datacenters. I'd much rather plow those cycles into efforts like expanding to multiple regions and improving reliability of internal services. That's a much better return on investment for our customers.

Right now we're in the sweet spot for the cloud. We're way too big to run on trivial amounts of hardware, growing at a rate that makes it difficult to stay ahead of demand in a datacenter-centric world, and too small to justify investing in a scalable hardware build-out.


> It could be if our workload was relatively stable

+1. If your entire environment is steady-state, it may be worth considering a migration to bare metal.

That said, you can expect:

* Worse uptime

* A fair bit of retraining / hiring different skillsets

* A loss of engineering focus as they work on the migration instead of feature expansion

It can make sense economically, but that discussion goes well past "the monthly AWS bill."


Try all the same on Google Cloud, you'll save half the money compared to AWS, probably more.

Disclaimer: I am NOT affiliated to Google in any way.


I would bet you 1000 USD that it would be eye-opening to buy a 64gb RAM, 8 core CPU server, put 4 SSDs in it in RAID10 config, and test your entire stack on just that box using Docker, KVM, whatever. Just running it beside your desk and having your other devs beat on it.

You might have to stub out some things using e.g. Cassandra or another open source alternative; just use a Docker or VM image without the production redundancy.

You did admit, that you have no idea how much performance is being dropped on the floor due to not using dedicated hardware. So why not test it?

My belief: you would find you are paying AWS far more than they are actually worth.


I've operated on-prem and colo physical data centers. I've also been part of larger companies where someone else deploys baremetal on my behalf. I've been running infrastructure on top of AWS off-and-on since 2008. I've done a live migration of one of the largest Internet sites from AWS to baremetal in data centers. I feel pretty good about keeping our infrastructure in AWS for the foreseeable future.


I apologize; I didn't read that part of your bio on the site.

Your post seemed to indicate that you hadn't done any calculations for the bare metal server with your own vms scenario. But clearly you have, if only mentally.


No worries! I'm not super into self-promotion, but I do think my background is relevant in this case. Either way, the decisions that go into a strategy like this involve a mountain of variables.


This article makes a good case for having at least one early team member who knows how to build out physical infrastructure at scale, even if you initially decide on AWS / GCE / Azure:

http://firstround.com/review/the-three-infrastructure-mistak...


This might sound a bit extreme, but contrary to most FRR pieces this is garbage advice. Startups should focus first on getting their product-market fit before trying to future-proof themselves from a future that doesn't yet exist.


Register your startup to Amazon or Google, you'll get 100k of free credits.

Keep a look on your bill because you don't want to run anything. But take advantage of having no hassle for your first years.


I was curious about that myself. What's the best way to model the costs associated with maintenance overhead that comes with bare-metal, vs the savings from managing bare-metal servers using a service (like packet), co-locating your own hardware or running your own datacenter.

My gut feeling is that you have to get to an extraordinary size to realize any meaningful savings, but that's primarily based on Dropbox's migration off AWS (https://www.wired.com/2016/03/epic-story-dropboxs-exodus-ama...).


It's rather simple, you count the days you are wasting on handling the low level hardware.

The entry cost is £500 per day or $1000 for an engineer.

5 hours on the phone to find the hardware and agree on the order with DELL + an afternoon with customer support because they shipped the servers without hard drives + your project is delayed by an entire week because you don't have the resources => 1 day + 1 day + half a week.

These things would have been 5 minutes on a cloud.


It depends upon your constraints.

As a for-instance, I was asked this exact question a year or so ago about an on-prem object store instead of S3.

The break-even price to go multi-region was ~15PB on paper; that included DC space, hardware, software (build vs buy was another factored discussion), and the staff to run it.

That assessment was delivered with the caveat that their uptime was not going to approach S3's by any stretch of the imagination-- and their infrastructure outages weren't going to sync with "most of the internet's."

It's a complex topic, and there are many hidden costs...


That's a big giant "it depends."

It depends on what your service looks like: CPU intensive, Memory Intensive, Storage intensive? (In reality some unique mix).

You probably won't see a huge savings year one, as you'll be spinning up a lot of new things and have a fairly large CapEx expenditure. Now if your growth pattern is steady/predictable then you should be able to plan out your hardware buys or do a hybrid solution to handle traffic bursts.

One of the nice things about running your own hardware is that there are some costs that are easier to control. Don't need new hardware? Don't need to spend on new hardware for example.

You also have much more control over your environment so you are able to really optimize your code, and infrastructure so that you don't need to scale as large system wise.

But, back to the question on how to model it? You just gotta dig in, and make some educated guesses about performance,test and repeat.


Addendum missing from the article: Segment charges per request, these users who run performance testing with analytics enabled should be very profitable.

Business rule => You gotta optimize for your cash cows. ;)


Actually, our pricing is no longer based on API calls-- https://segment.com/pricing.


$10 / 1000 MTUs

"Monthly Tracked User, or the sum of logged in plus anonymous users that access your product in a given month. "

P.S. I understand that you have volume discounts.


> After a three months of focused work, we managed to cut our AWS bill by over one million dollars annually.

That's great, but now I'm wondering how many engineers were paid for three months of full time work to save $1 million/year on AWS, and how much it costs the company to pay those engineers for 3 months.

It's also interesting to wonder about how much growth your startup could have managed if it had had three additional months of work from those engineers.


If you assume USD 250k per engineer, they would need to have had more than 16 engineers working on this for the 3 months to spend more than a million on salary. So, I'm guessing they probably came out ahead on that...?


>The ALB allows ECS to set ports dynamically for individual containers, and then pack as many containers as can fit onto a given instance.

Maybe I have a different view, but I've been avoiding running the same service on the same instance. If I'm running multiple services, I'm doing it for HA, and if an instance goes down, you could lose all your services. Why would you run the same service on the same instance?


Really interesting read. We've been in a similar mindset and it really comes down to understanding what's driving your costs and being creative around ways to minimize it. As others on here mentioned bandwidth and data transfer is a huge cost so if there's a way to make your caching more aggressive it's a big and easy win. Beyond that the low hanging fruit is in using reserved instances - there are a few options there but are significantly cheaper than paying the on demand price.

A shameless plug is I built a small visualization for the AWS detail log to help understand the cost drivers. All it is loading the data and visualizing it across a few dimensions but might come in handy for others - https://github.com/dangoldin/aws-billing-details-analysis


Umm... i take issue with the comment "distribution is essentially free". nonononnononooooo. Other than the rare case of virality, I'd like to know in what software industry where distribution is essentially free - i'm happy to be wrong about this, let me know.

the rest of the article was fine.


A friend of mine developed something to do just that: understand AWS/Google/Azure costs and compare prices (and save big).

Shameless plug: https://trackit.io/


Probably not the ideal response, sorry, but I only got as far as the homepage - really slow, jerky scrolling, scroll-to-anchors miss the mark, and the social media links don't work. Then, mouse wheel/trackpad scrolling stopped working. :/


THe website is probably hosted on AWS :) As for the social media links, they all work for me...


Had an interesting experience when I went to the pricing page for Segment - Had a little sales guy pop up in the lower right corner asking "Would (my enterprise employer name) like to know if you qualify for a volume discount?"

I only recently started working (and web browsing) from an enterprise network, so maybe this has been done before, but it's my first time seeing it and I thought it was really interesting.

Edit: To be clear, I'm not intrigued at the little guy himself, I'm intrigued at how personalized it was.


Built it using Drift, Clearbit and Madkudu.

Challenge: there's a race condition because the enrichment & scoring takes a little while, but the chat widget checks the drift database at the same time.

Source: I'm part of the team who built this @Segment


I don't get it right now but it's possible that is Intercom (chat) with data from Clearbit (https://clearbit.com/reveal).


Close! It's Drift (drift.com), not Intercom.


Thanks David.

You're right it's Drift. I'm the CEO. Love working with Segment, Mixpanel, Clearbit and other great companies to deliver a personalized experience on their sites like in this example.


We just published an article on how Segment doubled their sales using Drift. Here's all the details on how it works: https://blog.drift.com/segment-case-study/


Did the box have a "Powered by x" statement anywhere? Would be curious to know if it's Intercom or something else.


It's Drift (www.drift.com). I'm the CEO. We work closely with Segment and others including Clearbit to personalize the experience.


Don't see anything like that. Graphically it's seamlessly integrated into the website.

I went back and clicked on it this time and it then says "Our scoring models shows (company name) could be a good fit for our business plan. Want to schedule time to learn more?"


Is this the same thing as segment.io?


Yes; they changed their name to Segment at the Series A.


A great write up. Really wish that they had given some indication about how much the original bill was. Sounds very effective but 1m of 100m isn't that much!


Given that Elite game [1] used to run on Sinclair with Z80 processor and 48kB of memory, any complaints about the costs nowadays sound ridiculous. If our software infrastructure was coded by Elite programmers, we would be able to run in at 0.01% of the cost.

[1] https://en.wikipedia.org/wiki/Elite_(video_game)


> It was written in machine code using assembly language, giving much care to maximum compactness of code.

From the link. It seems unlikely that people would/could build applications in the modern world from scratch in assembler. Start with reimplementing the entire HTTP parsing engine and go from there? No thanks.


Idk how common the setup in the article is, but seems they could easily test for heatmap singularities such as the one that happened and alert for that.


(just alert any keys with heat > mean +- x *standard deviation or something stupid like that)


Actually if you have 3d features, could use edge detection algorithm like Kirsch after fuzzing the colors a bit.


> In some cases, we’re currently packing 100-200 containers per instance.

I have a very basic "Docker 101" question here.. Can someone explain how you might get to >100 containers on a given instance? That seems to imply that there are multiple containers of the same microservice running on one machine. Why would you pack duplicate containers like that instead of using one larger container?


> Why would you pack duplicate containers like that instead of using one larger container?

Microservices that are scaling horizontally to handle concurrent requests, etc.

Depending on how seriously your architects take the "micro" in microservice, you may easily get to 100+ unique microservices/processes each running in their own Docker container.


That's not what's going on here (100-200 unique microservices). Aside from the fact that that's insane, they specifically state they want to "pack" multiple containers of the same service on one machine:

"You cannot run two containers of the same service on a single host because they will collide and attempt to listen on the same port. (no bin packing)"

So, what's the sense in doing that instead of one bigger container?


They're reimplementing the actor model using Docker containers. Let it crash.


The goal is just having a compute pool where you don't have to care at all where things are actually running.

It definitely depends on the use case, but I could imagine more, smaller units allowing for more granular (and therefore a bit more cost-effective) scaling.


But isn't that actually less granular that one container that can size itself optimally? I.e. it could use as little or a much resources available on the machine. And it would eliminate any redundancies among the containers of the same type.


If that's the case you would hit separate containers fighting over resource usage though, no? Seems better to give them a specific block they can use all of.


> blocking any new discovered badly behaved keys. Eventually we reduced capacity by 4x.

I wonder how it felt when they discovered how much money they'd burned on such a dumb little bug. I'd feel like throwing up.


If only they'd spent a couple more dollars on hosting for https://segment.com so it didn't return HTTP 503 errors...


I don't know if LBO funds are doing this already but I think it's most likely going to become a trend in the future :

- buy a SaaS company which hosts all its infra on the cloud

- move everything to bare metal

- double the company's margin

- profit


Dollar dollar bills y'all


Is it just me or is segment.com down?


how does a bar chart with no y-axis label show any trend? scale could be $100-101/mo


Can you lower our bill now? :)


My one step solution to bill lowering involves "blackmailing your TAM." Have you tried that? ;-)


Great article. I wish/hope/expect everybody to take this type of work/attention as seriously as they have. I am as much of a penny pincher as you can get, so our team built a database adapter on AWS that can handle 100M+ records for $10/day, demo here: https://www.youtube.com/watch?v=x_WqBuEA7s8 .

I'm surprised not more people are price/performance sensitive. My goal is to get a system up to 10K table inserts/second on a free Heroku instance, we're already doing on low end hardware 1750 table inserts/second across a federated system. But price tuning is one of the most important things when engineering for high load, as well.

Glad to see articles like this. :)


For us the ECS autoscaling feature is way too naive. We need to do our own controller on some more responsive metrics. The customization support is pretty bad.


CPU + memory autoscaling has worked pretty well for most services at Segment. We did have to build a custom queue-based autoscaler that feeds into ECS for our integration stack but the limitations of the ECS autoscaling rules makes it a bit clunky.


Would you mind saying more about what you had to build, and what you would liked to see?

One thing I'd like to see in the future is scaling actions that supported multipliers, as well as metric math support (scale service to request_count/acceptable_requests_per_service).


One thing we experienced at Segment was the fact that we needed to quickly handle a surge in volume but couldn't overload partners. Essentially we wanted something that scaled up quickly at first but was pretty conservative after that.

We settled on constant increases/decreases using queue depth thresholds but ideally ECS would support feeding multiple metrics and doing some basic math to figure out how much we're currently process, how long it would take to flush the queue, and how many more containers we need to drain in a timely fashion.


Have you looked into kubernetes?


A bunch of people at Segment have looked into it. We manage our infrastructure with Terraform and operationally IMO Kubernetes introduces too much complexity compared with ECS and Terraform.

ECS is pretty dead simple (just run an agent on the host) and while it doesn't offer nearly the same feature set, it's really good operationally.

Some people here are tinkering with it for some non-core services.


It is my impression that AWS makes its bank on I/O, while everyone makes buying decisions on CPU/RAM


Clickbait title.


Here's amazing way to optimize your AWS bill don't use it. Compared to dedicated you overpaid millions than you spent non-trivial developer time/money to get the number down but you are still overpaying. At 5K/month AWS might make sense (although debatable) at your level of spend it's a really bad idea. At this level an ops team of 2 1 on-site 1 remote (diff time zone) would give you way more flexibility and a ver low bill.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: