Hacker News new | past | comments | ask | show | jobs | submit login
Applying SRE Principles to CI/CD (buildkite.com)
127 points by mooreds on Aug 30, 2023 | hide | past | favorite | 65 comments



> Back when I was a junior developer, there was a smoke test in our pipeline that never passed. I recall asking, “Why is this test failing?” The Senior Developer I was pairing with answered, “Ohhh, that one, yeah it hardly ever passes.” From that moment on, every time I saw a CI failure, I wondered: “Is this a flaky test, or a genuine failure?”

This is a really key insight. It erodes trust in the entire test suite and will lead to false negatives. If I couldn't get the time budget to fix the test, I'd delete it. I think a flaky test is worse than nothing.


"Normalisation of Deviance" is a concept that will change the way you look at the world once you learn to recognise it. It's made famous by Richard Feynman's report about the Challenger disaster, where he said that NASA management had started accepting recurring mission-critical failures as normal issues and ignored them.

My favourite one is: Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors. There's a decent chance that those errors are being ignored by everyone responsible for the system, because they're "the usual errors".

I've seen this go as far as cluster nodes crashing multiple times per day and rebooting over and over, causing mass fail-over events of services. That was written up as "the system is usually this slow", in the sense of "there is nothing we can do about it."

It's not slow! It's broken!


Oof, yes. I used to be an SRE at Google, with oncall responsibility for dozens of servers maintained by a dozen or so dev teams.

Trying to track down issues with requests that crossed or interacted with 10-15 services, when _all_ those services had logs full of 'normal' errors (that the devs had learned to ignore) was...pretty brutal. I don't know how many hours I wasted chasing red herrings while debugging ongoing prod issues.


we're using AWS X-ray for this purpose, i.e. a service is always passing on and logging the X-ray identifier generated at first entry into the system. pretty helpful for this purpose. And yes, there should be consistent log handling / monitoring. Depending on service we differ between error log level (=expected user errors) and critical error level (makes our monitor go red).


It often isn't as simple as using a correlation identifier and looking at logs across the service infrastructure. If you have a misconfiguration or hardware issue it very likely may be intermittent and only visible as an error in a log before or after the request. The response has incorrect data inside a properly formatted envelope.


I guess that's one of the advantages of serverless - by definition there can be no unrelated error in the state beyond the request (because there is none), except for the infrastructure definition itself. But a misconfig there you'll always see in form of an error happening at calling the particular resource - at least I haven't seen anything else yet.


That's assuming your "serverless" runtime is actually the problem.


You don't even have to go as far from your desk as a remote server to see this happening, or open a log file.

The whole concept of addressing issues on your computer by rebooting it is 'normalization of deviance', and yet IT people in support will rant and rave about how it's the fault of users for not rebooting their systems whenever they get complaints of performance problems or instability from users with high uptimes— as if it's not the IT department itself which has loaded that user's computer to the gills with software that's full of memory leaks, litters the disk with files, etc.


I agree with what you're saying, but this is a bad example:

> Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors.

It's true, but IME those "errors" are mostly worth ignoring. Developers, in general, are really bad at logging, and so most logs are full of useless noise. Doubly so for most "enterprise software".

The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message, so it's common that someone will put in a log.Error() call for that. In many cases though, that's just a user problem. The system operator isn't going to and in fact can't address it. "Email server unreachable" on the other hand is definitely an error the operator should care about.

I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..


> The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message

I’m sure you didn’t quite mean it as literal as I’m going to take it and I’m sorry for that. Any process that gets as far as attempting to send an email to something that isn’t a valid e-mail address is, however, an issue that should not be ignored in my opinion.

If your e-mail sending process can’t expect valid input then it should validate its input and not cause an error. Of course this is caused by saving invalid e-mail addresses as e-mail addresses in the first place which in it self shows that you’re in trouble, because that means you have to validate everything everywhere because you can’t trust anything. And so on. I’m obviously not disagreeing with your premise. It’s easy to imagine why it would happen and also why it would in fact end up in the “error.log”, but it’s really not an ignorable issue. Or it can be, and it likely is in a lot of places but that’s exactly GPS point isn’t it? That a culture which allows that will eventually cause the spaceship to crash.

I think we as a society are far too cool with IT errors in general. I recently went to an appointment where they had some digital parking system where you’d enter your license plate. Only the system was down and the receptionist was like “don’t worry, when the system is down they can’t hand out tickets”. Which is all well and good unless you’re damaged by working in digitalisation and can’t help but do the mental math on just how much money that is costing the parking service. It’s not just the system that’s down, it’s also the entire fleet of parking patrol people who have to sit around and wait for it to get to work. It’s the support phones being hammered and so on. And we just collectively shrug it off because that’s just how IT works “teehee”. I realise this example is probably not the best, considering it’s parking services, but it’s like that everywhere isn’t it?


Attempting to send an email is one of the better ways to see if it's actually valid ;)

Last time I tried to order pizza online for pickup, the website required my email address (I guess cash isn't enough payment and they need an ad destination), but I physically couldn't give them my money because the site had one of those broken email regexes.


I disagree about extensive validating of email addresses. This is why: https://davidcel.is/articles/stop-validating-email-addresses...


The article you link ends by agreeing with what I said. So I’m not exactly sure what to take it as. If your service fails because it’s trying to create and send an email to an invalid email, then you have an issue. That is not to say that you need excessive validation, but in most email libraries I’ve ever used or build you’re going to get runtime errors if you can’t provide something that looks like x@x.x which is what you want to avoid.

I guess it’s because I’m using the wrong words? English isn’t my first language, but what I mean isn’t that the email actually needs to work just that it needs to have something that is an email format.


> Developers, in general, are really bad at logging, and so most logs are full of useless noise.

Well, most logging systems do have different log priority levels.

https://manpages.debian.org/bookworm/manpages-dev/syslog.3.e...

LOG_CRIT and LOG_ALERT are two separate levels of "this is a real problem that needs to be addressed immediately", over just the LOG_ERR "I wasn't expecting that" or LOG_WARNING "Huh, that looks sus".

Most log viewers can filter by severity, but also, the logging systems can be set to only actually output logs of a certain severity. e.g. with setlogmask(3)

https://manpages.debian.org/bookworm/manpages-dev/setlogmask...

If you can get devs to log with the right severities, ideally based on some kind of "what action needs to be taken in response to this log message" metric, logs can be a lot more useful. (Most log messages should probably be tagged as LOG_WARNING or LOG_NOTICE, and should probably not even be emitted by default in prod.)

> someday I want to rename that call to log.PageEntireDevTeamAt3AM()

Yup, that's what LOG_CRIT and above is for :-)


In my experience, the problem usually is that severity is context sensitive. For example, a external service temporarily returning a few HTTP 500 might not be a significant problem (you should basically expect all webservices to do so occasionally), whereas it consistently returning it over a longer duration can definitely be a problem.


That is exacly what previous commenter meant - developers a bad at setting correct serverity for logs.

This becomes even a bigger proglem in huge organizations where each team has own rules so consistency vanishes.


> I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..

The second best thing (after adding metrics collection) we did as a dev team was forcing our way into the on-call rotation for our application. Now instead of grumpy sysops telling us how bad our application was (because they had to get up in the night to restart services and what not) but not giving us any clue to go on to fix the problems, we could do triage as the issues where occurring and actually fix the issues. Now with mandate from our manager because those on-call hours where coming from our budget. We went from multiple on-call issues a week to me gladly taking weeks of on-call rotation at a time because I knew nothing bad was gonna happen. Unless netops did a patch round for their equipment which they always seem to forget to tell us about.


  I want to rename that call to log.PageEntireDevTeamAt3AM() and see what
  happens to log quality
I managed to page the entire management team after hours at megacorp. After spending ~7 months being tasked with relying on some consistently flakey services I'd opened a P0 issue on a development environment. At the time I tried to be as contrite as possible, but in hindsight what a colossal configuration error. My manager swore up and down he never caught flack for it, but he also knew I had one foot out the door.


> Developers, in general, are really bad at logging

That's not the problem. I'll regularly see errors such as:

    Connection to "http://maliciouscommandandcontrol.ru" failed. Retrying...
Just... noise, right? Best ignore it. The users haven't complained and my boss said I have other priorities right now...


> my boss said I have other priorities right now

Way to bury the lede...


Horrors from enterprise - few weeks ago a solution architect forced me to rollback a fix (a basic null check) that they "couldn't test" because its not a "real world" scenario (testers creating incorrect data would crash business process for everyone)...


Your system could also retry the flaky tests. If it fails after 3 or 5 runs, it's for sure a defect.


This is the power of GitHub actions where each workflow is one YAML file.

If you have flaky tests, you can isolate them to their own workflow, and deal with it as isolated away from the rest of your CI process.

Does wonders around this. The idea of monolithic CI job is backward to me now


Skip your flaky tests should be a religion. There's nothing else I feel as strongly about regarding CI optimization. If a test is flaky, it gets immediately skipped. Even if you're working on a fix, it's skipped until you solve it, if you ever do. Most of your CI problems can start to be solved by applying this simple rule.

How do you know if it's flaky? You keep a count and any time a test fails/recovers 3 times it gets skipped, even if there's weeks between failures. You can make it more complex for little gain, but I've found this system will have teams actually prioritize fixing important tests, but mostly it has proven that many "important tests to keep even if they are flaky" never were actually important or end up getting re-written in different ways later on.


I feel the same, I can't understand why people value more having a flaky test than not having it.

In both cases, the feature is not validated (the test failure is ignored), but not having the test is transparent (the feature is not being validated), improves pipeline speed and reduces the noise.

Maybe people just like the lie, that the feature is being validated.. I don't really understand.


This is indeed a religion, because in my experience people tend to feel strongly holding very different positions. You can already see it in this thread.

I think quantifying and prioritizing is key, like you wrote. Respected engineering organizations like Google and GitHub all came to the same place. Flakiness is often unevenly distributed, so find & tackle ones that are the worst. Don't try to eliminate the flakiness because that's not economically viable.

I'm trying to put my money where my mouth is... we'll see how it goes.


> You keep a count and any time a test fails/recovers 3 times it gets skipped, even if there's weeks between failures.

Congratulations, you now have an untested feature.

Flaky tests should be fixed, of course. But just deleting them isn't fixing them, and it leaves you worse off, not better.

May i ask how you feel about TODO comments?


If you re-read my comment you'll see I already addressed this. Not having a test is better than having a flaky test.

The team should feel the pain of having their shit test skipped without them being able to stop it and it's up to them to either fix it and bring it back or cope.


> Not having a test is better than having a flaky test.

What i'm saying is that you are completely wrong about that. Sorry if i wasn't clear.


Ok that's more clear, this is fine then we know we disagree (also why i called it a bit of a religion). Have a great day!

PS: if the test was flaky, you already had an untested feature


Or just fix them? A test shouldn't ever, ever, ever be flaky. It can happen because we all make assumptions that can be wrong, or forget about non deterministic behaviours, but when it does a flaky test should be immediately fixed, with the highest priority.


That is out of your control across many teams. Your only hope is having hard rules on what to do with flakyness so that teams aren't able to spend 3 months telling you "please don't skip because we'll fix it in our next sprint".


Fix the test or remove it.


I've often been tempted by the "delete your flaky tests" extremist sect of that religion :)


If you have so many flaky tests, and their flakiness is so intractable, that you actually need to come up with SLOs for handling the flakiness to negotiate being allowed to address the flakiness in the test base, then quite frankly, you should be looking for another job, one where you can actually go ahead and just fix these things and get back to shipping value to customers, instead of one where you play bureaucratic games with management that cares neither about craftsmanship nor about getting value out the door as quickly as possible.


This is a bunch of wishful thinking...

Some software projects, let's call them "integration projects" use third-party software they can do nothing about. And it just doesn't work well. But you have to use it in testing because... well, you are the integrator. The users have already accepted the fact that the software you are integrating doesn't work well, so, it's "all good", except it makes it very hard to distinguish between failures that need to be addressed and those that don't.

Just to give you one example of this situation: JupyterLab is an absolute pile of garbage in terms of how it's programmed. For example, the side-bar of the interface doesn't load properly quite often, and you need to click on "reload" button few times to get it to show up. Suppose now you are the integrator which provides some features that are supposed to be exposed to the user through JupyterLab interface. Well, what can you do? -- Yes. Nothing. Just suck it up. You can manipulate the threshold for how many times you will retry reloading the interface, but you absolutely have to have a threshold because sometimes the interface will never load (because of some other reason), and you will be stalling the test pipeline if you don't let this test fail.

In general, the larger the SUT, the more "foreign" components it has, the harder it is to predict the test behavior, and the more flaky the tests are.

But this isn't the only source of test flakiness. Hardware is another source. Especially in embedded software that has to be tested on hardware that the software company has limited access to (think something like Smart TV, where the TV vendor provides some means of accessing the OS running on the TV set, but they deliberately limit the access in such a way as to prevent the SW company from getting access to the proprietary bits installed by the vendor). So, sometimes things will fail. And you wouldn't know why and wouldn't be able to discover (as in, if you tried to break into vendor's part of the software, they'd sue you).


> For example, the side-bar of the interface doesn't load properly quite often, and you need to click on "reload" button few times to get it to show up... Well, what can you do? - Yes. Nothing.

So... go debug the JupyterLab code. Or open a bug with upstream. Or talk to their support. Or rant on Twitter to try and mobilize upstream to fix their shit. There's not nothing to be done, that's defeatist nonsense. In the meantime, in your codebase, mock the stuff you can't control and move on. That's why mocks exist, so that your test suite can be fast and deterministic even when speed and determinism are otherwise systematically difficult to achieve.

> hardware

Hardware is a totally different ballgame. And not one where you're trying to come up with SLOs for flakiness in a CI system. You don't continuously deploy hardware. You look over the test results, make a manual decision what needs to be retested, and just retest that before deploying. In dev, hopefully you have some kind of emulated device to develop against, in which case, the point is moot.


> So... go debug the JupyterLab code. Or open a bug with upstream.

Users want this particular version of JupyterLab. Not the one in the future which might get fixed. In the case of JupyterLab, I can maybe try to patch it and promise users that "it's the same but with the patch", and maybe they'll take it, but maybe not, because their IT vetoed only version X, and my X+patch is not exactly X.

But, it could as well be closed-source third-party software (for example, we help users to install PBSPro, EG and LSF workload managers). It's illegal to debug them or to patch them...

> Hardware is a totally different ballgame. And not one where you're trying to come up with SLOs for flakiness in a CI system. You don't continuously deploy hardware.

So much nonsense here...

* CI isn't about deployment.

* Yes, when I worked for a hardware company, we continuously tested integration between hardware components.

* Why wouldn't the concept of SLOs apply here? You say it without giving any explanation.


It sounds like your integration architecture is a mash-up. It is indeed quite difficult to handle external systems that are so tightly coupled to your own.

For standard server-server integrations, where the coupling is limited to an (hopefully eventually) well-documented API surface, it's much more straightforward to replace an external service with an internal mock.

For CI purposes, we eventually develop a more-or-less complete internal version of every external service, quirks and all. It's fun to see their bugfixes show up in our git history in behavioral mocks.

We don't put much emphasis into automating integration testing. We've found that with all the vagarities of flaky external services, it's usually necessary to have a human in the loop somewhere for integration tests.


> more-or-less complete internal version of every external service

Imagine what an integrator who works with public cloud would have to do?

Now imagine that integration with three different public cloud is only a small fraction of what our software offers...


Our clients integrate with all manner of public and private APIs and this is how we build every one of them. The flakier the provider, the more the gambit pays off.

You don't have to build the whole API, just the parts you use. And you don't have to implement the real service at all. It amounts to just a bit more overhead the first time you use a new API, but that's dwarfed by the amount of time your team spends learning how the thing "really works" anyway.


Every flakey test is a production failure that really happens to some users sometimes. How often is just a matter of scale.

A 1 in 10,000 failure can be a daily annoyance for your users even with just 100 daily active users who each make 10 actions on your app. At “internet scale” a 1 in 10k error frustrates a user every few seconds.

If your tests are so flaky that you need SLOs … your poor users …


> Every flakey test is a production failure that really happens to some users sometimes

No. Tests can be flakey when the code they're testing is not. Every flakey test is not a customer problem.

Training developers to ignore flakey tests also trains them to ignore real failures. Potentially many of them. Any of those might be real customer problems.


This! Flaky tests become a customer problem when developers learn to ignore them! You gotta weed them out.


Depending on the problem domain, a lot of times the flakiness is in the way the test is written and not the code under test. You have to judge the relative cost of tracking down and solving that.


> Every flakey test is a production failure that really happens to some users sometimes. How often is just a matter of scale.

Not really, the one I’ve seen most often is a shared resource failing - for example a Gitlab Runner not handling a new VM for a DB so tests fail etc


Also, dirty test harness state not being cleaned up between tests/suites, and threading issues in the test code


I don't think the point of SLO = flakiness out of control. The point of framing it as SLO is the realization that neither extremes are good. Flakiness cannot be allowed to get out of control that some efforts must be spent to contain it, but it's also unnecessarily perfectionist and thus the waste of precious engineering bandwidth to eliminate them completely. The whole point is to avoid "bureaucratic games", as you call it.

My theory is that the lack of easy mechanism to measure the flakiness is stalling the progress. If the overall flakiness can be measured, and the top offending tests identified, then I think it becomes no brainer to spend efforts curtailing them back when the flakiness gets too high, but otherwise exclude flaky tests from, say, PR merge gate.


I'm just getting tired of these blog posts with meme's scattered throughout. It's so pre-2020. Oh haha, yet another Sean Bean 'one does not simply' meme <eyeroll>. Just give me the content and leave the snark and attempt at humor behind. Or maybe it's just me and I'm getting jaded in my old age.


Whenever I see those blog posts I immediately close the tab.

I've probably been in the "corporate" world too long, but I don't have time for that kind of nonsense.


This stuff was tired in 2016. Inexcusable now.


So I got in a discussion with someone why we need to test post-deployment. I was like because your environments are different you want to eliminate a failure point even if you've tested at build. You can make everything as closely similar to each other as possible but you want to eliminate the failures of a bad configuration, schema or integration. I once was at a place where someone deployed something and just let produce malformed data because a schema wasn't applied to the transformation process of the workers. You know what would have solved that? Pos deploy testing. How does is this in an automated pipeline, you automate the test.


With risk of tooting my own horn too much, this is exactly why I started my company https://checklyhq.com

We approach it a bit different: we blur the lines between E2E testing and production monitoring. You run an E2E test and promote it to a monitor that runs around the clock.

It's quite powerful. Just an E2E test that logs into your production environment after deploy and then every 10 minutes will catch a ton of catastrophic bugs.

You can also trigger them in CI or right after production deployment.

Big fat disclaimer: I'm a founder and CTO.


I've played with this concept a while back when I noticed the E2E tests I was writing closely matched the monitoring scripts I would write afterwards. At one point I just took the E2E test suite, pointed it at production and added some provisions for test accounts. Then I just needed to output the E2E test results into the metrics database and we had some additional monitoring. It's a kind of monitoring-driven-developent and as with TDD it's great for validating your tests as well.


This falls in line with my current worldview as well.

Many classes of automated test ("regression", "smoke", "e2e", etc.) are the same test and the difference is that values like the application location or maybe the expected data are different. To your point, iirc New Relic's synthetic monitoring was using Selenium under the hood. If you take this approach and you're tagging tests, you can have the [CI/CD system] use the tags to run specific sets of them at specific steps or times with the desired inputs.

But, some of that requires a coherent strategy for test data, which seems to be a common pain point for orgs.

[Edit: To be extra clear, not all types of tests fit this. You obviously don't need to run unit tests against your prod app. And, writing/managing things like API tests in a tool like Cypress isn't necessarily a good idea.]


Sounds in concept exactly what we are doing, but then "as a service". We let you manage it from your code base (or UI if you want) and take care of the metrics ingestion, dashboarding and alerting.


There are a lot of tools that do it especially Jenkins. Long as you have deployment webhooks on both Jenkins and CI/CD pipeline it is flawless.


It amazes me that there's a drive to make deploys as seamless as possible to avoid customer impact, but nobody wants to invest in systems integration testing in order to make sure that what they're actually serving is indeed correct. Most of the time, the "acceptance test" is verifying that it renders and the basic behaviors of the page operate as expected. It is insulting to the people and teams that actually spend the time to do what is right.


If there are enough tests then flakiness is unavoidable. The key is to separate "flaky test" from "actual failure" so that you can react quickly to actual and ignore flakiness or at least not address it as urgently.

For 99% of software projects this isn't even necessary however. If the number of tests is resonable (say less than a few hours) then you can probably do better with "fix all tests immediately, including addressing flakiness".

But for the large project that e.g. tests several 100k tests across a matrix of different platforms, then you need to resort to the "we'll never reach 100% but we can't dip below 99%" mindset. You need to analyze whether that failure in a PR validation build is actually reasonable to expect from the area that was changed, and if it isn't, then you need to merge it despite the validation not passing. Otherwise it'll never be merged. Modularization helps somewhat, but the small tests that stay within modules aren't the ones that will cause the problems.


If I understand the article correctly, the author offers SLOs as a way to pressure the management to allocate resources to fix CI problems. This would work under assumption that the management has resources to spare or would be able to divert those resources from other departments towards fixing CI problems.

And, sometimes this will probably work. But, I can easily imagine a situation where I, as a CI personnel come to my manager and tell her that we've burned past our 87% SLO objective, and get an immediate response that starting today our SLO objective is 77%.

In my experience, QA (and therefore CI tests) are the first chunk of the budget allocated to overall development that is being subtracted from if any subtraction is to take place. Very few companies bet on the quality of their software as a sale's driver. Most will probably fire the whole QA department and throw away all tests, if times were tough instead of trying to allocate more resources to software quality when hitting some percentage of test failures.


As someone who's worked in various sysadmin/DevOps/SRE roles over the years, there's one passage I'd like to address:

> SRE principles to the rescue

I approve this message

> Regardless of whether you can literally deploy on a Friday

no

> “Can I deploy on a Friday afternoon?” is an awesome way to

no

> We should all be able to say yes when asked the question,

    $ yes no
> if we can’t, we have some work to do to restore trust.

I've seen things you people wouldn't believe. I've seen deployments on a Friday when everyone had weekend plans. I've seen CI git failures with 502 bad gateway. All these moments will be lost in time. Like tears in rain. Time for the pub.


I feel the same way every time this “you should be able to deploy on a Friday” topic comes up. These are peoples lives. Yes - ideally, a Friday deployment should be safe. But if it isn’t? You just ruined a small slice of a lot of people’s lives. And for what? Stubbornness.


Off topic but I liked the presentation template a lot.


This is obvious. So of course most people will never do it.


Does the author mean Applying Site Reliability Engineering Principles to Continuous Integration / Continuous Delivery?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: