Along with this, vertical scaling is severely underrated. You can do a lot and possibly everything ever for your company with vertical scaling. It would apply to 99% of the companies or even more.
Edit:
Since people are confused, here is how StackOverflow handles of all of its web operations. If SO can run with this, so can your 0.33 req/minute app which is mostly doomed for failure. I am only half joking.
Everytime you go to SO, it hits one of these 9 web servers and all data on SO sits on those 2 massive SQL servers. That's pretty amazing.
I want to be clear though, Horizontal scaling has a place in companies that has a team of corporate lawyers. Big. And in many many other scenarios for ETL and backend microservices.
Several years ago, I was chatting with another engineer from a close competitor. He told me about how they'd set up a system to run hundreds of data processing jobs a day over a dozen machines, using docker, load balancing, a bunch of AWS stuff. I knew these jobs very well, they were basically identical for any company in the space.
He then mentioned that he'd noticed that somehow my employer had been processing thousands of jobs, much faster than his, and asked how many machines we were using.
I didn't have the heart to tell him we were running everything manually on my two-year-old macbook air.
F- me. I love this. This is a really important message.
It's like Jonathon Blow asks: why does Photoshop take longer to load today than it did in the 90s despite the (insane) advances in hardware?
I believe it's due to a bunch of things, but over complicating the entire process is one of the big issues. If people (developers/engineers) would only sit back and realise just how much computing power they have available to them, and then realised that if they kept things simple and efficient, they could build blazing fast solutions over night.
I cringe thinking about the wasted opportunities out there.
I see the point you're trying to make, however the increase in features (and their complexity) plus the size of the average graphic that a high-end professional has maybe grown by 300-500% since the 90s. In fact I'll tell you what: I'll give you a growth of 10,000% in file sizes and feature complexity since the since the 90s...
... computational power has grown ~%259,900 since the 90s.
The point being made is this: Photoshop does one job and has one focus should), yet it has gotten slower at doing that one job and not faster. Optimising the code AND introducing incredibly hardware to the consumer market should see Photoshop loading in milliseconds, in my opinion.
>The point being made is this: Photoshop does one job and has one focus should), yet it has gotten slower at doing that one job and not faster.
Has it though? Without measurements this is just idle talk.
And I've used Photoshop in the 90s and I use it today ocassionally. I remember having 90s sized web pictures (say, 1024x768) and waiting for a filter to be applied for tens of seconds - which I get instantly today with 24MP and more...
And if we're into idle talk I've always found Photoshop faster in large images and projects than competitors, including "lightweight" ones.
It's hella more optimized than them.
In any case, some e.g. image filter application (that takes, e.g. 20 seconds vs 1 minute with them) just calls some optimized C++ code (perhaps with some asm thrown in) that does just that.
The rest of the "bloat" (in the UI, feature count, etc) has absolutely no bearing as to whether a filter or an operation (like blend, crop, etc) runs fast or not. At worst, it makes going around in the UI to select the operations you want slower.
And in many cases the code implementing a basic operation, filter, etc, hasn't even been changed since 2000 or so (if anything, it was optimized further, taken to use the GPU, etc).
I recall my dad requiring overnight sessions to have Photoshop render a particular filter on his Pentium 166MHz. That could easily take upwards of an hour, and a factor 10 more for the final edit. He'd be working on one photo for a week.
To me it feels as though the last decade and a half computational power has not grown vertically. Instead Intel and AMD have grown computational power horizontally (i.e adding more cores). I'm looking at the difference the M1 has had on compute performance as a sign X86 strayed.
It has also grown substantially vertically: single-core speeds keep going up (about 10x from a decade and a half ago), even as core count increases. (and M1 is not substantially faster than the top x86 cores, the remarkable thing is how power efficient it is at those speeds).
Clock speed != single-threaded performance. Clock speeds plateaued a long time ago, single threaded performance is still improving exponentially (by being able to execute multiple instructions in an instruction stream in parallel, as well as execute the same instructions in less clock cycles), though the exponent approximately halved around 2004 (if the trend had continued we would be at about a 100-500x improvement by now).
Hard to say it's still "exponential"...what do you think the current constant doubling period is now?
Here's the single thread raw data from that repo. If you take into account clock speed increase (which, as you agree, have plateaued) we're looking at maybe a 2x increase in instructions per clock for conventional int (not vectorized) workloads.
Is there even another 2x IPC increase possible? At any time scale?
Back in the day I was using Intel Celeron 333 MHz and then AMD Duron 800 MHz.
I did not know how to use Winamp playlists because Winamp has been "an instant" app for me, I just clicked on a song and it played within miliseconds. That was my flow of using Winamp for years. This did not change between the Celeron and Duron, the thing was instant on both Celeron and Duron.
Then Winamp 3 came out and I had to use playlists because a song once clicked took good second or two to start playing. Winamp 5 from 2018 still starts slower than my beloved 2.73* did 20 years ago. On Celeron 333 and 5400 RPM HDD with 256 MB of RAM. I think even the good old Winamp 2.x is not as fast as it was on Windows 98/XP.
Something went wrong.
* not sure if it was 2.73, but I think so
Note: I realise Winamp 3 was crappy as hell, but still...
This is why I did choose and stick to coolplayer at the time (before I converted to Linux) : no install, so light, so fast, and it had everything I needed. I loved it when I could find such an elegant a versatile app. I don't need to get "more" every 6-12 months.
I learned that every time you gain something, you also loose something without realizing it, because you take it for granted.
Are you me? Perhaps a little bit later on and a different set of signifiers (foobar2000/xp/celeron 1.7) but the same idea. Things were so much snappier back then than on my previously-SOTA MBPR 2019. Sigh.
I was at a graphic design tradeshow back in the mid 90's and there was a guy there demonstrating this Photoshop alternative called Live Picture or Live Photos or something like that. And he had a somewhat large, at the time, print image on the screen, probably 16mb or so, and was zooming in and out and resizing the window and it was redrawing almost instantly.
This was AMAZING.
Photoshop at the time would take many many seconds to zoom in/out.
One person in the group asked, "Yeah, but how much memory is in that machine?"
The guy hemmed and hawed a bit and finally said "It's got a bit, but not a lot, just 24mb."
"Yeah, well that explains it." Nobody had 24mb of RAM at that time. Our "big" machine had 16mb.
Live Picture was the first to use what I think are called image proxies, where you can have an arbitrarily large image and only work with the screen-image sized image. Once you have applied all the changes and click save, it will then grind through the full size image if needed.
A feature that Photoshop has since added, but it appeared in Live Picture first.
This might be a long shot, but did the demo include zooming in on a picture of a skull to show a message written on one of the teeth? If so, I've been trying to find a video of it for years.
Perhaps, but size of photos should affect load times when starting the app (and in my opinion, so shouldn’t most feature either, but that depends on your architecture I suppose).
Yeah Jonathan Blow isn't exactly a luminary in computer science. I once read him going on a meltdown over linux ports because "programming is hard". This is the kind of minds Apple enable, i.e. "why isn't this easy?"
The entire point of computers is to make things easy.
The Apple question "why isn't this easy" is missing from 95% of UX in modern software. Stop acting as if software devs are the only users of software. And even then: computers should do work and not produce work or mental overhead.
> The Apple question "why isn't this easy" is missing from 95% of UX in modern software.
I switched to a MacBook at work a few months ago, and it's been an epic of frustration and Googling: e.g.,
1. I set the keyboard layout to UK PC, but it kept switching back to default. (I was accidentally mashing a keyboard shortcut designed to do just that. I've since disabled it.)
2. Clicking on a link or other web-page element will, occasionally and apparently randomly, take me back a page or two in my history rather than opening the link or activating the element. (At least in Chrome: I've not used another browser on macOS yet.)
3. Command-tabbing to a minimised application does not, as one would naively expect, automatically unminimise it. Instead I'm left staring at an apparently unchanged screen.
4. If I open Finder in a given folder, there is, as best can tell, no easy way to navigate to that folder's parent.
Now arguably #1 was the fault of my own ignorance (though some kind of obvious feedback as to what was happening and why would have been nice), and #2 may be down to Google rather than Apple.
But #3 and #4 are plain bad design, bordering on user-hostile.
So far I'm not seeing that Apple's reputation for superior UX is justified, at least not in laptops.
Window switching in OS X is so unintuitive it drives me MAD when you are remoting into a Mac.
Finder is just god awful and does all it can to obscure your actual filesystem location but the go to folder (cmd+g?) can get you where you need to go.
The Apple reputation was absolutely justified back in the OS9 era, and the early iPhone as well. However both OS X and iOS7 and beyond were huge steps backwards in usability.
At this point I think Apple still deserves their reputation for superior UX, however that's a result of how epically bad Google, MS, and Facebook are at UX, not Apple doing a great job like they used to.
for 4. ...as best can tell, no easy way to navigate to that folder's parent
You can add a button to the toolbar in Finder (customize it from options) that when dropped down will show the complete path to the current folder as a list. You can use that to move up the tree.
Clearly you've never tried it, because it's certainly not 5 minutes, it's optimized for selling you bullshit Apple services and it's buggy as hell, with no feedback to the user why everything is broken and why you're having to re-authenticate five times in a row.
And good luck if you're setting up a family account for several devices with different iOS versions. You're gonna really need it.
>Clearly you've never tried it, because it's certainly not 5 minutes, it's optimized for selling you bullshit Apple services and it's buggy as hell
I've tried it tons of times, have over 20 iOS/macOS devices over the years, and for some pervese reason, on macOS/OS X I like to install every major update on a clean disk too (and then re-import my data. It's an old Windows 95/XP-era reflex), so I do it at least once every year for my main driver (plus different new iOS devices).
And the whole "optimized for selling you bullshit Apple services" is a couple of screens you can skip with one click -- and you might want to legitimately use too.
Honestly, literally millions of people do this every year, and for most of them, it's like 10 minutes, plus the time waiting for the iCloud restore. Even my dad was able to set up his new iPad, and he's as techophobic as it gets.
Watching the video that coldtea posted, no, it is not. Ubuntu and most of its derivatives have very easy installers, and take a fraction of the time. The video didn't even include the time it would take to read and understand all the terms and conditions!
I don't know what you mean by this- care to elaborate? I have several fully working computers running Ubuntu derivatives without having to do anything after the install.
I currently have two laptops, one a Lenovo from work with Kubuntu and the other cheap Asus with KDE Neon. Both required no additional work to be fully working after install.
> This is the kind of minds Apple enable, i.e. "why isn't this easy?"
I dunno, personally that's why I've used Apple products for the past decade, and I think it's also maybe part of why they have a 2T market cap, and are the most liquid publicly traded stock in the world?
Where's your evidence for that? The argument that it "treats users as dumb" and that doing so is "profitable" is oft trotted out, but I never see any substantiation for it. Plenty of companies do that. What's so special about Apple, then? I mean, it's gotta be something.
You gotta be careful about these arguments. They often have a slippery slope to a superiority complex (of "leet" NIX users over the unwashed "proles") hiding deep within.
Needlessly complex or powerful user interfaces aren't necessarily good. They were quite commonplace before Apple. Apple understood the value of minimalism, of cutting away interaction noise until there's nothing left to subtract. Aesthetically speaking, this is approach has a long, storied history with respect to mechanical design. It's successful because it works.
What Apple understood really acutely and mastered is human interface design. They perfected making human centric interfaces that look and feel like fluid prosthetic extensions of one's body, rather than computational interface where power is achieved by anchoring ones tasks around the machine for maximal efficiency. Briefly, they understood intuition. Are you arguing that intuition is somehow worse than mastering arcana, simple because you've done the latter?
Now, I'm not going to say that one is better than the other. I love my command line vim workflow dearly, and you'll have to pry my keyboard out of my cold dead hands. But there's definitely the idea of "right tool for the right job" that you might be sweeping by here. Remember, simplicity is as much a function of cherished *NIX tools you probably know and love. It's where they derive their power. Be careful of surface level dismissals (visual interfaces versus textual) that come from tasting it in a different flavor. You might miss the forest for the trees!
It's easy to take a stab at someone online, behind a keyboard, but I'd suggest you show us all your work and we'll judge your future opinions based on it.
By no metric I am comparatively as successful as the guy, but I am still able to disagree on his point that Linux is held back by its tools. The fact that he did not want or had the time to learn Linux's tooling don't mean anything in particular except that he's either very busy or very lazy. Any interview I read with him he's just ranting and crying over this or that "too complex" matter.
If he does not want to deal with the complexity of modern computers, he should design and build board games.
One is that there’s a minimum performance that people will tolerate. Beyond that you get quickly diminishing user satisfaction returns when trying to optimize. The difference between 30 seconds and 10 seconds in app startup time isn’t going to make anyone choose or not choose Photoshop. People who use PS a lot probably keep it open all day and everyone else doesn’t care enough about the 20 seconds.
The second problem is that complexity scales super-linearly with respect to feature grown because each feature interacts with every other feature. This means that the difficulty of optimizing startup times gets harder as the application grows in complexity. No single engineer or team of engineers could fix the problem at this point, it would have to be a mandate from up high, which would be a silly mandate since the returns would likely be very small.
The problem is that if everything is simple enough. How to set your goals this year? Complication creates lots of jobs and waste. This makes us not starved but others somewhere in the world or in the future when resource all gone.
I see where you're coming from with this, but this is getting into the realm of social economics and thus politics.
To solve the problem you're describing we need to be better at protecting all members of society not in (high paying) jobs, such as universal basic income and, I don't know, actually caring about one another.
But I do see your point, and it's an interesting one to raise.
Most orgs (and most devs) feel developer productivity should come first. They're not willing to (and in a lot of cases, not able to) optimize the apps they write. When things get hard (usually about 2 years in) devs just move on to the next job.
> If people (developers/engineers) would only sit back...*
It is a matter of incentives. At many companies, developers are rewarded for shipping, and not for quality, efficiency, supportability, documentation, etc.* This is generally expected of a technology framework still in the profitability growth stage; once we reach a more income-oriented stage, those other factors will enter incentives to protect the income.
> I think you can build something complex quickly and well at the same time.
One definitely can. It's a real crapshoot whether the average developer can. For what it's worth, I consider myself a below-average developer. There is no way I could grind l33tcode for months and even land a FAANG interview. I can't code a red-black tree to save my life unless I had a textbook in front of me. Code I build takes an enormous amount of time to deliver, and So. Much. Searching. You get the picture. I'm a reasonably good sysadmin/consultant/sales-engineer; all other roles I put in ludicrous amounts of effort into, to become relevant. Good happenstance I enjoy the challenge.
For the time being however, there is such enormous demand for any talent that I always find myself in situations where my below average skills are treated as a scarcity. Like near-100 headcount testing organization in a tech-oriented business with an explicit leadership mandate to automate with developer-written integration code from that organization...and two developers, both with even worse skills than mine. When a developer balks at writing a single regular expression to insert a single character into the front of an input string, that's nearly the definition of turning one wrench for ten years; while I'm slow and very-not-brilliant, I'm a smart enough bear to look up how to do it on different OS' or languages and implement it within the hour.
This is not unusual in our industry. That's why FizzBuzz exists. That's just to clear the bar of someone who knows the difference between a hash and a linked list.
To clear the bar of "something complex quickly and well at the same time" though, I've found it insufficient to clear only the technical hurdle and obtain consistent results. The developer has to care about all the stakeholders. Being able to put themselves into the shoes of the future developers maintaining the codebase, future operators who manage first line support, future managers who seek summarized information about the state and history of the platform, future users who apply business applications to the platform, future support engineers feeding support results back into developers, and so on. That expansive, empathetic orientation to balance trade-offs and nuances is either incentivized internally, or staffed at great expense externally with lots of project coordination (though really, you simply kick the can upstairs to Someone With Taste Who Cares).
I'd sure as hell like to know alternatives that are repeatable, consistently-performing, and sustainable though. Closest I can think of is long-term apprenticeship-style career progression, with a re-dedicated emphasis upon staffing out highly-compensated technical writers, because I strongly suspect as an industry we're missing well-written story communication to tame the complexity monster; but that's a rant for another thread.
Reminded me of Fabrice Bellard's Pi digits record[1]
The previous Pi computation record of about 2577 billion decimal digits was published by Daisuke Takahashi on August 17th 2009. The main computation lasted 29 hours and used 640 nodes of a T2K Open Supercomputer (Appro Xtreme-X3 Server). Each node contains 4 Opteron Quad Core CPUs at 2.3 GHz, giving a peak processing power of 94.2 Tflops (trillion floating point operations per second).
My computation used a single Core i7 Quad Core CPU at 2.93 GHz giving a peak processing power of 46.9 Gflops. So the supercomputer is about 2000 times faster than my computer. However, my computation lasted 116 days, which is 96 times slower than the supercomputer for about the same number of digits. So my computation is roughly 20 times more efficient.
I recently did some data processing on a single (albeit beefy node) someone had been using a cluster for. I composed and ran ETL in a day what took them weeks in their infrastructure (they were actually still in the process of fixing it).
No you cannot, you cannot infinitely scale SQLite, you can’t load 100 G of data into a single SQLite file in any meaningful amount of time. Then try creating an index on it and cry.
I have tried this, I literally wanted to create a simple web app that is powered by the cheapest solution possible, but it had to serve from a database that cannot be smaller than 150GB. SQLite failed. Even Postgres by itself was very hard! In the end I now launch redshift for a couple days, process all the data, then pipe it to Postgres running on a lightsail vps via dblink. Haven’t found a better solution.
My rule of thumb is that a single processor core can handle about 100MB/s, if using the right software (and using the software right). For simple tasks, this kan be 200+ MB/s, if there is a lot of random access (both against memory and against storage), one can assume about 10k-100k IOPS per core.
For a 32 core processor, that means that it can process a data set of 100G in the order of 30 seconds. For some types of tasks, it can be slower, and if the processing is either light or something that lets you leverage specialized hardware (such as a GPU), it can be much faster. But if you start to take hours to process a dataset of this size (and you are not doing some kind of heavy math), you may want to look at your software stack before starting to scale out. Not only to save on hardware resources, but also because it may require less of your time to optimize a single node than to manage a cluster.
This is a great rule of thumb which helps build a kind of intuition around performance I always try to have my engineers contextualizing. The "lazy and good" way (which has worked I'd say at least 9/10 times in my career when I run into these problems) is to find a way to reduce data cardinality ahead of intense computation. It's 100% for the reason you describe in your last sentence -- it doesn't just save on hardware resources, but it potentially precludes any timespace complexity bottlenecks from becoming your pain point.
>No you cannot, you cannot infinitely scale SQLite, you can’t load 100 G of data into a single SQLite file in any meaningful amount of time. Then try creating an index on it and cry.
Yes, you can. Without indexes to slow you down (you can create them afterwards), it isn't even much different than any other DB, if not faster.
>Even Postgres by itself was very hard!
Probably depends on your setup. I've worked with multi-TB sized Postgres single databases (heck, we had 100GB in a single table without partitions). Then again the machine had TB sized RAM.
> but it had to serve from a database that cannot be smaller than 150GB. SQLite failed. Even Postgres by itself was very hard!
The PostgreSQL database for a CMS project I work on weighs about 250GB (all assets are binary in the database), and we have no problem at all serving a boatload of requests (with the replicated database and the serving CMS running on each live server, with 8GB of RAM).
To me, it smells like you've lacked some indices or ran on a rpi?
It sounds like the op is trying to provision and load 150GB in a reasonably fast manner. Once loaded, presumably any of the usual suspects will be fast enough. It’s the up front loading costs which are the problem.
Anyway, I’m curious what kind of data the op is trying to process.
I am trying to load and serve the Microsoft academic graph to produce author profile pages for all academic authors! Microsoft and google already do this but IMO they leave a lot to be desired.
But this means there are a hundred million entities, publishing 3x number of papers and a bunch of metadata associated. On redshift I can get all of this loaded in minutes and takes like 100G but Postgres loads are pathetic comparatively.
And I have no intention of spending more than 30 bucks a month! So hard problem for sure! Suggestions welcome!
How many rows are we talking about? In the end once I started using dblink to load via redshift after some preprocessing the loads were reasonable, and indexing too. But I’m looking at full data refreshes every two weeks and a tight budget (30 bucks a month) so am constrained on solutions. Suggestions welcome!
I’m trying to run a Postgres instance on a basic vps instance with a single vcpu and 8gb of ram! And I’ll need to erase and reload all 150 GB every two weeks..
Had a similar problem recently. Ended up creating a custom system using a file-based index (append to files named by the first 5 char of the SHA1 of the key)
Took 10 hours to parse my Terabyte. Uploaded it to Azure Blob storage, now I can query my 10B rows in 50ms for ~10^-7$.
It's hard to evolve, but 10x faster and cheaper than other solutions.
My original plan was to do a similar S3 idea, but I forgot about it’s charge per 1000 gets and puts and had a 700 dollar bill I had to bargain with them to waive! Does azures model not have that expense?
Curious if you tried this on an EC2 instance in AWS? The IOPS for EBS volume are notoriously low, and possibly why a lot of self-hosted DB instances feel very slow vs similarly priced AWS services.
Personal anecdote, but moving to a a dedicated server from EC2 increased the max throughput by a factor of 80 for us.
You can use locally attached SSD instances. Then you're responsible for its reliability so not getting all the 'cloud' benefits. Used them for provisioning own CI cluster with raid-0 btrfs running PostgreSQL. Only backed up the provisioning and CI scripts.
Got burned there for sure! Speed is one thing but the cost is outrageous for io heavy apps! Anyways I moved to lightsail which doesn’t have io costs paradoxically so while io is slow at least the cost is predictable!
You can skip Hadoop and go from SQLite to something like S3 + Presto that scales to extremely high volumes with low latency and better than linear financial scaling.
I've had similar experiences. Sometimes we'll have a dataset with tens of thousands records and it will give rise to the belief that it's a problem that requires a highly scalable solution because "tens of thousands" is more than a human can hold in their head. In reality, if the records are just a few columns of data, the whole set can be serialized to a single file and consumed in one gulp into a single object in memory on commodity hardware no sweat. Then process it with a for loop. Very few enterprises actually have big big data.
My solution started out as a 10-line Python script where I would manually clean the data we received, then process it.
CEO: "Will this scale?"
Me: "No, absolutely not, at some point we'll need to hire someone who knows what they're doing."
As time passed and we got more data, I significantly improved the data cleaning portions so that most of it was automated, and the parts that weren't automated would be brought up as suggestions I could quickly handle. I learned the very basics of performance and why `eval` is bad, set up my script so I didn't need to hard-code the number of files to process each day, started storing data on a network drive and then eventually a db...
I still don't know what I'm doing, but by the time I left it took maybe 5 minutes of manual data cleaning to handle thousands of jobs a day, and then the remainder could be done on a single machine.
I'm aware of a couple of companies who behave like that - "well, we could increase our user based by an order of magnitude at any point here, so better spring for the order of magnitude more expensive database, just in case we need it."
It's not just about scaling databases, some people are simply unable to assess reasonable limits on any system. A few years ago certain Scandinavian publisher decided to replace their standard industry tools by a single "Digital Experience Platform" that was expected to do everything. After a couple of years they understood it's a stupid idea and gave up. Then later someone in the management thought that since they already spent some millions of euros they should continue anyway. This behemoth is so slow and buggy the end users work at 1/4th speed but everyone is afraid to say anything as the ones who did have been fired. The current PM is sending weekly success messages. It's hilarious. And all because someone once had a fantasy of having one huge system that does everything.
I've noticed business people have a different idea of what 'big data' means to tech guys. The business guys think it means a lot of data like the records of a million people which is a lot of data but not the tech guy definition which tends to be data too large to process on a single machine.
Those come out at something like 1GB and 10TB which are obviously rather different.
Unfortunately, this kind of behavior will be rewarded by the job market, because he's now got a bunch more tech buzzwords on his resume than you. Call it the Resume Industrial Complex: engineers build systems with as many bells and whistles as possible, because they want to learn all the hot new tech stacks so they can show off their extensive "skills" to potential employers.
My favorite part of conducting design interviews is when a candidate has pulled some complex distributed system out of their ass, and I ask them what the actual throughput/memory usage looks like.
On that day, most probably nothing with regards to this task.
Then, later, probably someone would check out the scripts from a shared repo. Then, read an outdated README, try it out, swear a bit, check for correctness with someone dependent on the results, and finally learn how to do the task.
There is a lot of business processes that can tolerate days or weeks of delay in case of such a tragic (and hopefully improbable) event. The trick is to know which of them can't.
> There is a lot of business processes that can tolerate days or weeks of delay in case of such a tragic (and hopefully improbable) event. The trick is to know which of them can't.
This is really true BUT that kind of problems are OK - nobody cares - until somebody starts caring and then all of a sudden it is urgent (exactly because they were undetected for weeks/months due to their periodicity)
E. g. we have less than 1% chance per year that a given person leaves us on bad terms or suffers a bad accident or illness. In case it really happens, it will cost us X in delays and extra work. To lower the probability of this risk to Y% would cost us Z (money, delay, etc).
If you do this math, you can tell if it's a good idea to optimize here, or if you have more pressing issues.
In my experience, this sort of one-man jobs gets automated or at least well described and checked for fear of mistakes and/or employee fraud rather than "downtime".
I wasn't even using a db at the time, it was 100% pandas. We did eventually set up more infrastructure, when I left the data was loaded into the company's SQL Server db, then pulled into pandas, then uploaded back into a different table.
It's true – at that point, if I had disappeared without providing any transition help, the company would have been in trouble for a few days. But that goes for any employee – we were only 7 people at the time!
Eventually I built out some more infrastructure to run the jobs automatically on a dedicated machine, but last I checked everything still runs on one instance.
SO is always impressive - love that their redis servers with 256GB RAM peak at 2% CPU load :)
SO is also my go-to argument when some smart "architect" proposes redundant Kubernetes cluster instances for some company-local project. People seem to have lost the feeling what is needed to serve a couple of thousand concurrent users (For company internal usages which I specialize in, you hardly will get more users). Everyone thinks they are Google or Netflix. Meanwhile, SO runs on 1-2 Racks with an amount of server that would not even justify kubernetes or even docker.
SO really isn't a great example, they have considerations most companies don't - Windows and SQL Server licensing. When shit like that is involved, scale out rarely seems like a better choice.
It's not only about the amount of users, it's also a matter of availability. Even the most stupid low-use barely-does-anything internal apps at my company get deployed either to two machines or a Nomad cluster for redundancy ( across two DCs). Design for failure and all that. Failure is unlikely, but it's trivial to setup at least active-passive redundancy just in case, it will make failures much easier.
The "1-2 racks" perspective is great too, really makes you think the old XKCD joke [1] about tripping over the power cable might not be that far wrong. ;-)
> SO is also my go-to argument when some smart "architect" proposes redundant Kubernetes cluster instances for some company-local project.
Technically you don't need Kubernetes, yes. But: There are advantages that Kubernetes gives you even for a small shop:
- assuming you have a decent shared storage, it's a matter of about 30 minutes to replace a completely failed machine - plug the server in, install a bare-bones Ubuntu, kubeadm join, done. If you use Puppet and netboot install, you can go even faster (Source: been there, done that). And the best thing: assuming well written health checks users won't even notice you just had a node fail as k8s will take care of rescheduling.
- no need to wrangle with systemd unit files (or, worse, classic init.d scripts) for your application. For most scenarios you will either find Docker-embedded healthchecks somewhere or you can easily write your own so that Kubernetes can automatically
- no "hidden undocumented state" like wonky manual customizations somewhere in /etc that can mess up disaster recovery / horizontal scale, as everything relevant is included in either the Kubernetes spec or the Docker images. Side effect: this also massively reduces the ops load during upgrades, as all there is on a typical k8s node should be the base OS and Docker (or, in newest k8s versions, not even that anymore)
- it's easy to set up new development instances in a CI/CD environment
- generally, it's easier to get stuff done in corporate environments: just spin up a container on your cluster and that's it, no wrestling with finance and three levels of sign-off to get approval for a VM or, worse, bare metal.
I won't deny that there are issues though, especially if you're selfhosting:
- you will end up with issues with basic network tasks very quickly during setup, MetalLB is a nightmare, but smooth once you do have set it up. Most stuff is made with the assumption of every machine being in a fully Internet-reachable cluster (coughs in certbot), once you diverge from that (e.g. because of corp requiring you have to have dedicated "load balancer" nodes that only serve to direct traffic from outside to inside and "application" nodes not be directly internet-reachable) you're on your own.
- most likely you'll end up with one or two sandwich layers of load balancing (k8s ingress for one, and if you have it an external LB/WAF), which makes stuff like XFF headers ... interesting to say the least
- same if you're running anything with UDP, e.g. RTMP streaming
- the various networking layers are extremely hard to debug as most of k8s networking (no matter the overlay you use) is a boatload of iptables black magic. Even if you have a decade of experience...
Your arguments are true, but you did not consider the complexity that you have now introduced in a small shop operation. You will need kubernetes knowledge and experienced engineers on that matter. I would argue that the SO setup with 9 webservers, 2x2 DB servers and 2 redis servers could easily be administered with 20 year old knowledge about networks and linux/windows itself.
And I also argue, lack of experience of fiddling with redundant kubernetes is a more likely source of downtime than hardware failure and keeping things simple.
> You will need kubernetes knowledge and experienced engineers on that matter.
For a small shop you'll need one person knowing that stuff, or you bring in an external consultant for setting up and maintaining the cluster, or you move to some cloud provider (k8s is basically a commodity that everyone and their dog offers, not just the big 3!) so you don't have to worry about that at all.
And a cluster for basic stuff is not even that expensive if you do want to run your own. Three worker machines and one (or, if you want HA, two) NAS systems... half a rack and you're set.
The benefit you have is your engineers will waste a lot less time setting up, maintaining and tearing down development and QA environments.
As for the SO setup: the day-to-day maintenance of them should be fairly simple - but AFAIK they had to do a lot of development effort to get the cluster to that efficiency, including writing their own "tag DB".
Ah yes, I’ll make my critical infrastructure totally dependent on some outside consultant who may or may not be around when I really need him. That sounds like a great strategy. /s
SO is a great counter example to many over complicated setups, but they have a few important details going for them.
> Everytime you go to SO, it hits one of these 9 web servers
This isn't strictly true. Most SO traffic is logged out, most doesn't require strictly consistent data, most can be cached at the CDN. This means most page views should never reach their servers.
This is obviously a great design! Caching at the CDN is brilliant. But there are a lot of services that can't be built like this.
Are you an SO dev? I had thought I read about the use of CDNs and/or Varnish or something like that for rendered pages for logged out users? I don't want to correct you on your own architecture if you are!
We went all-in on vertical scaling with our product. We went so far we decided on SQLite because we were never going to plan to have a separate database server (or any separate host for that matter). 6 years later that assumption has still held very strong and yielded incredible benefits.
The slowest production environment we run in today is still barely touched by our application during the heaviest parts of the day. We use libraries and tools capable of pushing millions of requests per second, but we typically only demand tens to hundreds throughout the day.
Admitting your scale fits on a single host means you can leverage benefits that virtually no one else is even paying attention to anymore. These benefits can put entire sectors of our industry out of business if more developers were to focus on them.
Our technology choices for the backend are incredibly straightforward. The tricky bits are principally .NET Core and SQLite. One new technology we really like is Blazor, because their server-side mode of operation fits perfectly with our "everything on 1 server" grain, and obviates the need for additional front-end dependencies or APIs.
Our backup strategy is to periodically snapshot the entire host volume via relevant hypervisor tools. We have negotiated RPOs with all of our customers that allow for a small amount of data loss intraday (I.e. w/ 15 minute snapshot intervals, we might lose up to 15 minutes of live business state). There are other mitigating business processes we have put into place which bridge enough of this gap for it to be tolerable for all of our customers.
In the industry we work in, as long as your RTO/RPO is superior to the system of record you interface with, you are never the sore thumb sticking out of the tech pile.
In our 6-7 years of operating in this manner, we still have not had to restore a single environment from snapshot. We have tested it several times though.
You will probably find that VM snapshot+restore is a ridiculously easy and reliable way to provide backups if you put all of your eggs into one basket.
>> You will probably find that VM snapshot+restore is a ridiculously easy and reliable way to provide backups if you put all of your eggs into one basket.
Yep, this is something we rely on whenever we perform risky upgrades or migrations. Just snapshot the entire thing and restore it if something goes wrong, and it's both fast and virtually risk-free.
I’m not the OP but I’m the author of an open source tool called Litestream[1] that does streaming replication of SQLite databases to AWS S3. I’ve found it to be a good, cheap way of keeping your data safe.
I am definitely interested in a streaming backup solution. Right now, our application state is scattered across many independent SQLite databases and files.
We would probably have to look at a rewrite under a unified database schema to leverage something like this (at least for the business state we care about). Streaming replication implies serialization of total business state in my head, and this has some implications for performance.
Also, for us, backup to the cloud is a complete non-starter. We would have to have our customers set up a second machine within the same network (not necessarily same building) to receive these backups due to the sensitive nature of the data.
What I really want to do is keep all the same services & schemas we have today, but build another layer on top so that we can have business services directly aware of replication concerns. For instance, I might want to block on some targeted replication activity rather than let it complete asynchronously. Then, instead of a primary/backup, we can just have 4-5 application nodes operating as a cluster with some sort of scheme copying important entities between nodes as required. We already moved to GUIDs for a lot of identity due to configuration import/export problems, so that problem is solved already. There are very few areas of our application that actually require consensus (if we had multiple participants in the same environment), so this is a compelling path to explore.
You can stream back ups of multiple database files with Litestream. Right now you have to explicitly name them in the Litestream configuration file but in the future it will support using a glob or file pattern to pick up multiple files automatically.
As for cloud backup, that's just one replica type. It's usually the most common so I just state that. Litestream also supports file-based backups so you could do a streaming backup to an NFS mount instead. There's an HTTP replica type coming in v0.4.0 that's mainly for live read replication (e.g. distribute your query load out to multiple servers) but it could also be used as a backup method.
As for synchronous replication, that's something that's on the roadmap but I don't have an exact timeline. It'll probably be v0.5.0. The idea is that you can wait to confirm that data is replicated before returning a confirmation to the client.
We have a Slack[1] as well as a bunch of docs on the site[2] and an active GitHub project page. I do office hours[3] every Friday too if you want to chat over zoom.
I really like what I am seeing so far. What is the rundown on how synchronous replication would be realized? Feels like I would have to add something to my application for this to work, unless we are talking about modified versions of SQLite or some other process hooking approach.
Litestream maintains a WAL position so it would need to expose the current local WAL position & the highest replicated WAL position via some kind of shared memory—probably just a file similar to SQLite's "-shm" file. The application can check the current position when a transaction starts and then it can block until the transaction has been replicated. That's the basic idea from a high level.
Does your application run on your own servers, your customers' servers, or some of each? I gather from your comments that you deploy your application into multiple production environments, presumably one per customer.
Vertical scaling maybe works forever for the 99% of companies that are CRUD apps running a basic website. As soon as you add any kind of 2D or 3D processing like image, video, etc. you pretty much have to have horizontal scaling at some point.
The sad truth is that your company probably won't be successful (statistically). You pretty never have to consider horizontal scaling until you have a few hundred thousand DAU.
You don’t need to scale your application vertically even with media processing, you just need to distribute that chunk of the work, which is a lot easier (no state).
> Like someone else said, distributing work across multiple machines is a form of horizontal scaling.
Sure, but it is the easy kind, when it comes to images or videos. Lambda, for example, can handle a huge amount of image processing for pennies per month and there is none of the additional machine baggage that comes with traditional horizontal scaling.
It really depends. Handling streaming video service that does any kind of reprocessing of the data would probably be better off with horizontal scaling.
I imagine its still super simple to have one core app that handles most of the logic and then a job queue system that runs these high load jobs on worker machines.
Definitely. There is certainly a place for Horizontal scaling. Just wanted to highlight how underrated vertical scaling is and a good engineer would evaluate these scaling options with prudence and perspicacity, not cult behavior so often observed in software engineering circles.
I think somehow this is related to how business minded people think too. I went to a course where people learn to pitch their ideas to get funds but the basics of business simply did exist much among the technical people.
One simple example (which I suspect most of the business do) is that you do all work either manually yourself or on your laptop while advise them as a resource-rich service. Only when you truly can not handle the demand then you may 'scale up' and turn your business into 'real' business. And there are plenty of tricks like this (as legally as possible).
> Everytime you go to SO, it hits one of these 9 web servers and all data on SO sits on those 2 massive SQL servers. That's pretty amazing.
I don't find it amazing at all. Functionality-wise, StackOverflow is a very simple Web application. Moreover, SO's range of 300-500 requests per second is not a mind-blowing load. Even in 2014, a powerful enough single physical server (running a Java application) was able to handle 1M requests per second[1]. A bit later, in 2017, similar performance has been demonstrated on a single AWS EC2 instance, using Python (and a blazingly-fast HTTP-focused micro-framework Japronto), which is typically not considered a high-performance option for Web applications[2].
The amaziness is that the leadership allows it to be simple.
This is such a great competitive advantage.
Compare this to a leadership that thinks you absolutely must use Akamai for your 50 req/secs webserver. You end up with tons of complexity for no reason.
Fair enough. Though not too surprising still, considering the original leadership of the company, one of whom (Joel Spolsky) is still on the board of directors. Having said that, the board's 5:4 VC-to-non-VC ratio looks pretty scary to me. But this is a different story ...
SO is a bit more complicated than returning a single character in a response. You can achieve high throughput with just about anything these days if you aren't doing any "work" on the server. 300-500 reqs/second is impressive for a web site/application with real-world traffic.
Thing is 99% of companies could run like SO if their software would be like SO.
But if you are confronted with a very large 15+ year old monolith that requires multiple big instance machines to even handle medium load. Then you're not going to get this easily fixed.
It's very possible that you come to the conclusion that it is too complex to refactor for better vertical scaling. When your demand increases, then you simply buy another machine every now and then and spin up another instance of your monolith.
> if you are confronted with a very large 15+ year old monolith that requires multiple big instance machines to even handle medium load. Then you're not going to get this easily fixed
Last 15+ year old monolith I touched needed multiple machines to run because it was constrained by the database due to an insane homegrown ORM and poorly managed database schemas (and this is a common theme, I find.)
Tuning the SQL, rejigging things like session management, etc., would have made it go a lot quicker on a lot fewer machines but management were insistent that it had to be redone as node microservices under k8s.
I totally agree with your main point and SO is kind of the perfect example. At the same time it is kind of the worst example because for one, to the best of my knowledge, their architecture is pretty much an outlier, and for another it is what it is for non-technical historical reasons.
As far as I remember they started that way because they were on a Microsoft stack and Microsofts licensing policies were (are?) pretty much prohibitive for scaling out. It is an interesting question if they would design their system the same way if they'd the opportunity to start from scratch.
Yes but Stackoverflow is a now mostly a graveyard of old closed questions, easily cached, I am only half joking.
Most startup ideas today are a lot more interactive, so a SO model with two DBs would probably not serve them well. Horizontal scaling is not only for ETL and I am uncertain in why you say that it needs many lawyers.
Genuine question, how is 9 web servers vertical scaling? And also, peak CPU usage of 12% means this is about 10x oversized for what is needed. Isn't it much better to only scale up when actually needed, mostly in terms of cost?
Stack-overflow's use case has the benefit of being able to sit behind a Content Delivery Network (CDN) with a massive amount of infrastructure at the edge offloading much of the computational and database demands. This reduces the requirements of their systems dramatically. Given their experience in the segment, its plausible to expect they understand how to optimize their user-experience to balance out the hardware demands and costs as well.
Not the OP, but yes, getting more powerful machines to run your program is what "vertical scaling" means (as opposed to running multiple copies of your program on similar-sized machines aka "horizontal scaling" ).
A ‘single’ big box with multiple terabytes of RAM can probably outperform many ‘horizontally scaled’ solutions. It all depend on the workload, but I feel that sometimes it’s more about being ’hip’ than being practical.
There's always a limit on how big you can go, and a smaller limit on how big you should go, but eitherway it's pretty big. I wouldn't go past dual Intel Xeon, because 4P gets crazy expensive; I haven't been involved in systems work on Epyc, 1P might be a sensible limit, but maybe 2P makes sense for some uses.
Get a more powerful single machine (in contrast to multiple machines).
However I wonder if multisockets Xeons count as vertical or horizontal. I never understood how programmable those machines are..
It might apply to 99% who have specific requirements, but the vast majority of internet companies need more. Deployments, N+1 redundancy, HA etc... are all valuable, even if some resources are going to waste.
None of those things are mutually exclusive with vertical scaling?
Having two identical servers for redundancy is doesn't mean you are scaling horizontally (assuming each can handle the load individually, nothing better than discovering that that assumption was incorrect in an outage).
Edit: Since people are confused, here is how StackOverflow handles of all of its web operations. If SO can run with this, so can your 0.33 req/minute app which is mostly doomed for failure. I am only half joking.
StackOverflow architecture, current load (it will surprise you): https://stackexchange.com/performance
Everytime you go to SO, it hits one of these 9 web servers and all data on SO sits on those 2 massive SQL servers. That's pretty amazing.
I want to be clear though, Horizontal scaling has a place in companies that has a team of corporate lawyers. Big. And in many many other scenarios for ETL and backend microservices.