Relatedly, just listened to one of the oldest Engines of Our Ingenuity episodes: 32 -- Wright and Langley.
> Curtiss went to work, strengthening the structure, adding controls, reshaping it aerodynamically, relocating the center of gravity -- in short, making it airworthy. In 1914 he flew it for 150 feet, and then he went back and replaced the old motor as well. On the basis of Curtiss's reconstruction, the Smithsonian honored Langley for having built the first successful flying machine.[...] In 1942 the Secretary of the Smithsonian, Charles Abbot, finally authorized publication of an article that clearly showed the Langley reconstruction was rigged.
> Recent advances in text-to-speech (TTS) synthesis, such as Tacotron and WaveRNN, have made it possible to construct a fully neural network based TTS system, by coupling the two components together. [...] However, the high computational cost of the system and issues with robustness have limited their usage in real-world speech synthesis applications and products. In this paper, we present key modeling improvements and optimization strategies that enable deploying these models, not only on GPU servers, but also on mobile devices
Plus Estonia in particular is 200km away from St Petersberg, and 800km from Moscow. They are all but guaranteed to succumb to Russian expansion if allowed to continue unchecked.
No offense, but this is wildly underselling the goals of oncall SRE. LLMs are extremely crappy about causal analysis, or even just mitigation techniques for services that haven't been widely discussed on stackoverflow.com (i.e. your service).
> Creating an on-call process to manually inspect errors in test suites is more valuable than improving the project to be more reliable, as you can directly measure the amount of tests that failed on a weekly basis. It is measurable and presentable to the upper management.
You can also measure requests that failed on a weekly basis, and I do. In fact, I added a dashboard panel to do exactly that today for a service (10 years old!!) on a new team I just reorg'd into. I did this because I was annoyed to discover the first (internal) customer outage report of the day could have been repaired by the east coast half of the team team hours before the west coast QA team logged in for the day, but they were unaware anything was wrong. This is a trivial promQL query to implement, and yet it wasn't until today.
The problem isn't visibility but risk -- what if you make reliability fixes but the data gets worse? This is not hypothetical, a Youtube engineer documented a similar tale[1]. You can also imagine all kinds of fixes that sound good on paper but can produce paradoxical outcomes (i.e. adding retries causes a metastable failure state[2]). And heck, what if you make no changes, and the numbers decline all on their own? Are you going to scuttle this quarters project work (and promotion fodder!) just to bring this KPI back to normal? Of course, all numbers, even the test suite pass rate, come with risks of missing targets, so the incentives are to commit to reporting as few of them as possible.
> tools automate the mundane tasks of an on-call engineer: searching for issues related to a customer report, tracking related software (or hardware) crashes, verifying if the current issue that arose during an on-call is a regression or a known bug and so on.
I have a coworker trying to use LLMs for ticket triage, but there's a huge GIGO risk here. Very few people correctly fill in ticket metadata, and even among the more diligent set there will be disagreement. Try an experiment: pick 10 random tickets, and route copies to two of your most diligent ticket workers. Then see how closely their metadata agrees. Is it P1 or P3? Is the bug reported against the puppet repo or the LB repo? Is a config change feature work, bug fix, or testing? Do they dupe known issues, and if to, to the same ticket, or do they just close it as a NTBF known issue? If these two can't agree on basics, then your fine tuning is essentially just additional entropy. Worse, you can't even really measure quality without this messy dataset, and the correct answers should change over time as the software and network architecture evolves.
It probably says "the DOJ really is gonna force us to sell Chrome."
reply