How Shopify reduced storefront response times with a rewrite

pqdbr · on Aug 20, 2020

Some of the listed optimizations were:

> We carefully vet what we eager-load depending on the type of request and we optimize towards reducing instances of N+1 queries.

> Reducing Memory Allocations

> Implementing Efficient Caching Layers

All of those steps seem pretty standard ways of optimizing a Rails application. I wished the article made it clearer why they decided to pursue such a complex route (the whole custom Lua/nginx routing and two applications instead of a monolith).

Shopify surely has tons of Rails experts and I assume they pondered a lot before going for this unusual rewrite, so of course they have their reasons, but I really didn't understand (from the article) what they accomplished here that they couldn't have done in the Rails monolith.

You don't need to ditch Rails if you just don't want to use ActiveRecord.

pushrax · on Aug 20, 2020

(contributor here)

The project does still use code from Rails. Some parts of ActiveSupport in particular are really not worth rewriting, it works fine and has a lot of investment already.

The MVC part of Rails is not used for this project, because the storefront of Shopify works in a very different way than a CRUD app, and doesn’t benefit nearly as much. Custom code is a lot smaller and easier to understand and optimize. Outside of storefront, Shopify still benefits a lot from Rails MVC.

I’ll also add that storefront serves a majority of requests made to Shopify but it’s a surprisingly tiny fraction of the actual code.

why_only_15 · on Aug 21, 2020

Out of curiosity, why continue to implement in ruby? If milliseconds are important as you mention, interpreted languages will always be slower.

adventured · on Aug 21, 2020

They have a very good thing going. Perhaps there is no great reason to bite off so much at one time. They can take their time and do that later if it makes enough sense. I would expect it would require a very substantial effort to rebuild their platform in a different language.

If you're 75/100 of where you want to be on performance, it can be easy to lose immense amounts of time chasing a 95/100 type ideal performance outcome when you can maybe far more easily get to 90/100 by making eg straight-forward caching improvements to what you already have and not have to rewrite all of your code.

Good enough is almost always underrated in tech. People destroy opportunity, time, money, and entire businesses chasing what supposedly lies beyond good enough.

John Carmack has a good example of this in his Joe Rogan interview [1], in how id Software burned six years on Rage, making incorrect (in hindsight) choices that involved trying to do too much. He regrets his old standard line and approach that it'll be done when it's done. He wishes they had made compromises instead and shipped Rage several years earlier. That's a pretty classic storyline in all of tech, taking on far too much when 85% good enough would have worked just as well most likely.

[1] https://youtu.be/udlMSe5-zP8?t=8630

switch11 · on Aug 21, 2020

very good point

very good example

ksec · on Aug 21, 2020

>I’ll also add that storefront serves a majority of requests made to Shopify but it’s a surprisingly tiny fraction of the actual code.

I am guessing that small pieces of code will the target for TruffleRuby.

joshmn · on Aug 20, 2020

> because the storefront of Shopify works in a very different way than a CRUD app

Any interesting/successful patterns you can share/resources you can share on said patterns?

pushrax · on Aug 20, 2020

“Not a CRUD app” isn’t a design decision, it’s just that storefront is almost entirely read-only, and the views are merchant-provided Liquid code. Most of a shop's data can be accessed on any page; data dependencies are in large part defined by the view, not the controller.

joshmn · on Aug 21, 2020

Oh I Know. As a Rails developer for 10 years, I'm always interested what patterns are developed when straying away from what the original guts of Rails.

tekstar · on Aug 20, 2020

Shopify's storefront is based around a liquid renderer instance. If you look up how objects are added to the liquid context that is pretty similar to the overall pattern (or at least was back when I worked there, hi pushrax :)

pushrax · on Aug 21, 2020

Yep, the main idea is to set up the liquid interpreter with the right variables/methods and the right liquid templates, and evaluate the result. There’s a lot of code that runs around that, but the path-specific code is quite small.

Hello :)

adrr · on Aug 20, 2020

If it’s rail, can you just expose the rails cache object in liquid? Give control to merchants. That would yield bigger speed improvements.

pqdbr · on Aug 20, 2020

Someone replied but deleted right when I was posting this answer, so I'm replying to myself:

What I didn't understand was why the listed performance optimizations couldn't be implemented in the monolith itself and ensued the development of a new application, which is still Ruby.

In a production env, the request reaches the Rails controller pretty fast.

I know for a fact that the view layer (.html.erb) can be a little slow if you compare it to, say, just a `render json:`, but if you're still going to be sending fully-rendered HTML pages over the wire, the listed optimizations (caching, query optimization and memory allocation) could all be implemented in Rails itself to a huge extent, and that's what I'd love to know more about.

nthj · on Aug 20, 2020

They talk about reducing memory allocations. My guess is the rest of the app is very large and they’re benefiting from not sharing memory and GC with that.

Of course, everything you said is true for a small-to-medium sized Rails application.

They likely could have explored a separate Rails app to meet this goal, but then they have to maintain the dependency tree and security risks twice. And if the Rails core refactors away any optimizations they make, they have to maintain and integrate with those.

There’s definitely some wiggle room and a judgement call here but their custom implementation has merit.

cbothner · on Aug 20, 2020

Don't forget that a Shopify store is 100% customizable by merchants using Liquid (Turing complete, not that you should try). There is no .html.erb layer. Think of Storefront Renderer as a Liquid interpreter using optimized presenters for the business models.

benatkin · on Aug 21, 2020

Liquid is designed so template authors don't have to be trusted. That's great and I wish it were more common.

Here's an example of a disclaimer that should be attached to most templating languages: https://mozilla.github.io/nunjucks/templating.html

harrisonjackson · on Aug 21, 2020

Seconded! I've had to do a lot of weird stuff in django to get around this for our user templates.

lazyant · on Aug 20, 2020

I didn't care especially for the technical details, what I like about this article is that the first thing they mention is the success criteria of the project (hopefully it was done at the very beginning, before any implementation). Then on top of that, they created an automated tool to verify such criteria automatically and objectively.

This is a great approach and unfortunately I don't think many (most?) software projects start out like that.

Not defining conditions of victory and scope creep are possibly the biggest risks in software projects.

chiefalchemist · on Aug 21, 2020

It's not only software.

1) What is the goal? What defines success?

2) What are the KPI's? How are we going to measure it?

These are baseline questions to any endeavor of substance. Yet, they are rarely defined.

vlovich123 · on Aug 21, 2020

It’s also important to remember that not everything worth doing or every “success” state you set can have KPIs defined (either actually impossible or the science may not be there yet).

chiefalchemist · on Aug 21, 2020

To clarify, I was using KPI in the abstract. That is, how do I/we define success? What does it look like? How will we know if we are or we're not?

gravypod · on Aug 21, 2020

Shopify has traditionally been an example people have pointed to for scaling a monolith with a large growth factor in all areas: team size, features, user base size, general "scale" of the company.

Does anyone on here, who has worked on this project or internally at Shopify, feel that this project was successful? Do you think this is the first, of a long and gradual process, where Shopify will rewrite itself into a microservice architecture? It seems like the mentality behind this project shares a lot of commonly claimed benefits of microservices.

> Over the years, we realized that the “storefront” part of Shopify is quite different from the other parts of the monolith

Different goals that need to be solved with different architectural approaches.

> storefront requests progressively became slower to compute as we saw more storefront traffic on the platform. This performance decline led to a direct impact on our merchant storefronts’ performance, where time-to-first-byte metrics from Shopify servers slowly crept up as time went on

Noisy neighbors.

> We learned a lot during the process of rewriting this critical piece of software. The strong foundations of this new implementation make it possible to deploy it around the world, closer to buyers everywhere, to reduce network latency involved in cross-continental networking, and we continue to explore ways to make it even faster while providing the best developer experience possible to set us up for the future.

Smaller deployable units; you don't have to deploy all of shopify at edge, you only need to deploy the component that benefits from running at edge.

MirrorNext · on Aug 21, 2020

At Shopify, we build into the monolith unless there’s a strong reason to build it as a new service.

It makes more sense for us to extract things than to make everything microservice.

Storefront makes sense to be on its own service, so we are making it so.

ww520 · on Aug 20, 2020

The performance related bits:

- Handcrafted SQL.

- Reduce memory usage, e.g. use mutable map.

- Aggressive caching with layers of caches, DB result cache, app level object cache, and HTTP cache. Some DB queries are partitioned and each partitioned result is cached in key-value store.

bdibs · on Aug 21, 2020

I’m aware that Ruby/Rails isn’t that quick, but it seems mind boggling that an 800ms server response time is considered tolerated, and 200ms is satisfying. I’ve never used Ruby in production so maybe my reference point is off and this is more impressive than I’m giving it credit for.

joelbluminator · on Aug 21, 2020

I'm not sure this has anything to do with Ruby, they're talking about user experience: what's perceptible to humans and what causes frustrations. Also - in most apps db and frontend take way more time than the Rails stack.

Scarbutt · on Aug 21, 2020

For page reloads, anything below 300ms is fine.

dx034 · on Aug 21, 2020

But you should also account for up to 100-200ms network latency (especially with mobile networks) plus some rendering time. A 200ms server response time can already lead to a perceived 500ms loading time.

ksec · on Aug 21, 2020

>https://stackexchange.com/performance

Page Rendered in 12.2ms - 18.3ms

Giving plenty of room for Network Latency.

tehlike · on Aug 20, 2020

This is very interesting. N+1 and lazy loading have been a very common problem that profilers can spot, but eager loading also has a cartesian product problem where if you have an an entity with 6 sub item, and 100 of another subitem, you'll end up getting 600 rows to construct a single object / view model.

I have been recently playing with RavenDB (from my all time favorite engineer turned CEO), it approaches most of these as an indexing problem in the database, where the view models are calculated offline as part of indexing pipeline. It approaches the problem from a very pragmatic angle. It's goal is to be a database that is very application centric.

Still to be seen if we will end up adopting, but it'll be interesting to play with.

Disclaimer: I am a former NHibernate contributor, and have been very intimate with AR features and other pitfalls.

balfirevic · on Aug 20, 2020

Didn't NHibernate have the cartesian product problem solved in a neat way by having various fetch strategies?

You could specify to eagerly load some collections and have NHibernate issue additional select statement to load the children, producing maximum of 2-3 queries (depending on the eager-loading depth) but avoiding both N+1 problem and cartesian row explosion problem.

tehlike · on Aug 20, 2020

yes, that's the common method, but you still end up issuing multiple network calls. The problem wit issuing select statements to load the children is you have to wait on the first query (root) to finish so you can issue others which adds to the network latency (usually low, but it also depends). It's still not as good as having materialized viewmodels on server where you can issue a single query to get everything you need. The disadvantage is the storage cost, though.

balfirevic · on Aug 20, 2020

I went and looked at the docs to refresh my memory - there was also a subquery fetch strategy where you didn't have to wait for the root entity to load, but that comes at the expense of searching through data twice - which might or might not be worth it, depending on how complicated the query is.

I do wish relational databases (PostgreSQL and SQL Server specifically, since I work with those) had better support for automatically updated real-time materialized views.

Anyway, thanks for working on NHibernate - I miss some of it's configurability and advanced capabilities.

jacques_chester · on Aug 21, 2020

> I do wish relational databases (PostgreSQL and SQL Server specifically, since I work with those) had better support for automatically updated real-time materialized views.

I've been keeping an eye on these folks: https://materialize.io/

tehlike · on Aug 20, 2020

Automatically updated materialized views are something I really want too.

Take a look at ravendb, it might be a good thing to try on next smallish project that you can move to a later one :)

Postgres has a nice advantage of supporting json, so in theory you could have embedded documents as materialized views and whatnot, but it's hard to make it play nice with orms.

aloukissas · on Aug 20, 2020

Naive question: the "storefront" piece seems like it's a static page. Why does it need SSR? Even so, it could be SSR'ed to static _once_ (kind of how NextJS does this from 9.3+), then have it served by CDN/edge. I'm probably missing something here.

raihansaputra · on Aug 21, 2020

Throwing opinions here, but after working a bit with Shopify themes, there might be some reasons to stick with SSR rather than aggressive caching. First, the storefront can be dynamic depending on visitor region/login/logout. Second, Shopify have most of the logic on the backend, even having non-js html nodes for ordering/add to cart. Third, I don't think the visit distribution of the stores makes caching economically viable (the top 20% store probably don't account for +60% server load).

bdibs · on Aug 21, 2020

They mentioned caching full HTML responses so I'm guessing that's what they're doing.

kn8 · on Aug 20, 2020

Is the new implementation still Rails?

bsaul · on Aug 20, 2020

That’s also my question after reading this post. When trying to shave off milliseconds by going for a full rewrite, moving away from ruby seems like an obvious decision...at least intuitively..

sbarre · on Aug 20, 2020

Obvious how?

Are you going to restructure literally thousands of employees and their teams, staffed with Rubyists and organized around your current setup?

Will you re-hire and/or re-train everyone?

That doesn't seem so obvious... At the scale of a team like Shopify, refactoring to a different language is probably a non-starter.

cookiecaper · on Aug 20, 2020

Yeah. Consider that BigCos end up writing transpilers and new runtimes for their target platforms before rewriting the application, which would entail discarding the decades of built-in bugfixes and application logic as well as reconstructing the organization around a different platform -- HipHop for PHP, Grumpy, etc. A language change is no small thing in any company of appreciable size.

nicoburns · on Aug 20, 2020

If you have thousands of rubyists then you surely have hundreds who also know other languages? Seems to make sense to use a fast langauage for the small performance sensitive part of your codebase.

Jach · on Aug 20, 2020

Seems also that since Ruby is not going to be taught as part of people's normal formal education in programming, you can expect Rubyists to be on average more capable of... learning new things.

So yes, "re-train". Give everyone a book on the new language, maybe pay for some online courses from pluralsight or wherever, cancel meetings for a week. You can learn a lot faster than in a school environment when you've got paid 8 hour days to put into a single subject + coworkers to chat with.

Besides, it's not like they don't get to avoid learning new things anyway, even if you restrict it to the Ruby ecosystem. In the JS world (which I'm sure they all know too, as one tends to when working on web sites even if you're mostly back-end) as new revisions of the language come out people have to keep up with the syntax and changing idioms.

"For some reason, programmers love to learn new stuff, as long as it's not syntax." -- Steve Yegge

jashmatthews · on Aug 21, 2020

"Faster" languages often have big advantages in small benchmarks which get a lot smaller or even reverse once you're looking at whole application performance.

Mandelbrot (from CLBG) Ruby 246s NodeJS 8s Java 4s

Web (fortunes from TE benchmarks) Ruby + Roda + Sequel 51k rps NodeJS + Express 46k rps Java + Dropwizard 62k rps

nicoburns · on Aug 21, 2020

You're comparing Ruby to other options that are still slow:

Java (vertx-postgres) 347k rps, Go (fasthttp) 320k rps Rust (actix-postgres) 607k rps

jashmatthews · on Aug 21, 2020

Right but I'm doing that because those are frameworks in other languages which offer a comparable developer experience.

fasthttp isn't even a web framework. It's not surprising that using a raw HTTP library is dramatically faster than using a full framework and ORM but it's also not a sustainable way to build complex web applications with 1000+ developers.

nicoburns · on Aug 21, 2020

You don't need to have 1000 developers working on the small performance sensitive part of your application though. Split it out into its own application, and then have a small dedicated team.

I can't speak to fasthttp as I haven't used Go much, but actix-web in Rust is a full framework (not as full as something like Rails, but certainly more than mature enough to be used for production projects).

jashmatthews · on Aug 21, 2020

I built and maintained a critical production web app using Iron for 3 years. Keeping anything like the performance advantage you see in simple benchmarks in a real app is a big challenge.

nicoburns · on Aug 21, 2020

Well sure, that's why it only makes sense unless you actually need the performance. But if you do need the performance then implementing it in a language that is designed to enable those optimisations can make a lot more sense than trying to hack around the runtime in a slower language.

mikepurvis · on Aug 21, 2020

Sentry (otherwise a Python application) built their Symbolicator service in Rust because it was a better fit for the domain. Probably also because Armin Ronacher has become a fan of the language and simply wanted to [1]. Now, Sentry is like 100 employees or something, so it's obviously a way more agile organization that Shopify at 10x the size, but having more limited resources is also a reason to avoid spreading yourself too thin.

1: https://www.softwaresessions.com/episodes/rust-in-production...

cutler · on Aug 20, 2020

Their monolith was written in Rails so Ruby alone was not the source of slow performance. In fact the solution was more to do with cloning the database in order to be able to isolate reads and writes so not even a programming language problem at all.

catsarebetter · on Aug 20, 2020

No way they're still doing monoliths? Is there a blog post on that?

crispyporkbites · on Aug 20, 2020

Ruby is more than fast enough for the web

bretthopper · on Aug 20, 2020

No, it's still Ruby but built directly on top of Rack.

top_sigrid · on Aug 20, 2020

How do you know?

mikeyouse · on Aug 20, 2020

Because he apparently works at Shopify?

https://twitter.com/swalkinshaw

k__ · on Aug 20, 2020

At least it's still Ruby. They wrote how they had to write non-idiomatic Ruby code to get better performance.

mikepurvis · on Aug 20, 2020

I'm assuming the details of exactly what the new implementation is have been deliberately withheld for some future post where they talk specifics (especially if it's something exciting like Rust/Elixir/Go). This keeps the focus of this post on the approach to migration, using the old implementation as a reference in order to burn down the list of divergences, etc.

strzibny · on Aug 21, 2020

It's still Ruby :).

hevelvarik · on Aug 21, 2020

>An example of these foundations is the decision to design the new implementation on top of an active-active replication setup. As a result, the new implementation always reads from dedicated read replicas, improving performance and reducing load on the primary writers.

Could someone please explain how the ‘as a result’ follows from the active-active replication setup?

throwdbaaway · on Aug 21, 2020

Based on the comment from pushrax, it looks like this is just circular async replication between the old writer and the new writer. For some reason, the old implementation had to send both read and write traffic to the old writer, while the new implementation can do proper read-write split, by reading from dedicated read replicas hanging off the new writer (again, via async replication).

Due to power law, ecommerce generally benefits a lot from things like caching and read-write split. Reading between the lines, it feels like shopify may not yet have sufficient experience in dealing with async replication, and all the potential issues caused by replication lag. Fun time ahead.

thejacenxpress · on Aug 21, 2020

Unfortunately they are still highly dependent on other APIs.

When San Diego Comiccon went live on funko.com (shopify) the website was fine but the checkout was bottlenecked by the API calls to shipping providers. Many never were able to checkout and Funko had to issue an apology.

Unfortunate that no matter how great you can improve your own product you may still be dependent upon others.

randomdude402 · on Aug 21, 2020

I'm interested to know more about this. I've used about five different e-commerce solutions and they all make API calls to shipping providers. What was different here?

thejacenxpress · on Aug 21, 2020

The amount of traffic was too high. Unsure if they were being throttled or if they use a task queue that went bad. Who knows.

https://comicbook.com/irl/news/funko-pop-comic-con-2020-excl...

momonga · on Aug 20, 2020

I wish the article detailed the performance issues with the old implementation, and why those issues necessitated a rewrite (other than "strong primitives" and "difficult to retrofit").

spondyl · on Aug 21, 2020

I'd be interested to know if setting Service Level Objectives were considered as an alternative to using Apdex? Given that it's nice to be able to then calculate an error budget out of your SLO and use that to determine whether changes were impacting to the customer experience or not. Well, so the theory goes anyway. Actually doing it in practice is a whole different story ;)

switch11 · on Aug 21, 2020

can anyone add to that article data on

What users saw in terms of response time

and perceived response time

And what users are seeing after the improvements

*

We had evaluated spotify for one of our projects and aesthetically it is really good. However, time wise their store takes forever to do stuff

This was a couple of years back, so hopefully things are much better now

Basically, the article covers how much better THE TEAM doing the coding feels

What is the effect on the users using the stores?

bradfeehan · on Aug 21, 2020

spotify?

gadders · on Aug 21, 2020

The bit I found interesting in this is how they compare and verify that two web pages rendered by different methods "match".

I wonder how you would do that? You can't hash the html. Do you take screenshots and compare?

notsureaboutpg · on Aug 20, 2020

Most commenters are focused on the optimizations made, but I actually think the custom routing and verification mechanism is the interesting bit.

That kind of a tool could be handy in lots of scenarios (comparing the same service written in two different languages or with different dependencies, etc).

But how does their verifier mechanism deal with changes in the production database between responses? If the response of the legacy service comes first and the response of the new service comes after, in between both responses (the request being the same) couldn't the data from the database change and thus result in the responses not passing verification when they otherwise should have? How do they manuever around that issue?

Great write-up by the way! I really liked it :)

pushrax · on Aug 21, 2020

Differing inputs causing verification failures is indeed an issue. In addition to data access races, replication latency also causes this. The legacy service always reads from the primary MySQL instances per shard, but the new service always reads from replicas for scalability and geo distribution.

One slightly helpful mitigation we have in place relies on a data versioning system meant for cache invalidation. The version is incremented after data changes (with debouncing). To reduce false negatives, we throw out verification requests where the two systems saw different data versions. It's far from perfect, but it's been effective enough.

polote · on Aug 20, 2020

tldr: rewrote the backend focusing on speed

Which is good. At Reddit they would have tried to rewrite everything on reasonML and then tried to prove at the end that it is now faster