Hacker News new | past | comments | ask | show | jobs | submit login
AI-powered conversion from Enzyme to React Testing Library (slack.engineering)
178 points by GavCo 9 months ago | hide | past | favorite | 100 comments



The Slack engineering blog[0] is more pragmatic, and shows more about how the approaches were actually combined.

This is basically our whole business at grit.io and we also take a hybrid approach. We've learned a fair amount from building our own tooling and delivering thousands of customer migrations.

1. Pure AI is likely to be inconsistent in surprising ways, and it's hard to iterate quickly. Especially on a large codebase, you can't interactively re-apply the full transform a bunch.

2. A significant reason syntactic tools (like jscodeshift) fall down is just that most codemod scripts are pretty verbose and hard to iterate on. We ended up open sourcing our own codemod engine[1] which has its own warts, but the declarative model makes handling exceptions cases much faster.

3. No matter what you do, you need to have an interactive feedback loop. We do two levels of iteration/feedback: (a) automatically run tests and verify/edit transformations based on their output, (b) present candidate files for approval / feedback and actually integrate feedback provided back into your transformation engine.

[0] https://slack.engineering/balancing-old-tricks-with-new-feat...

[1] https://github.com/getgrit/gritql


I think you copy-pasted the wrong URL in your first link.

Should be https://slack.engineering/balancing-old-tricks-with-new-feat...


We changed the URL to that from https://www.infoq.com/news/2024/06/slack-automatic-test-conv... . Thanks!


The actual efficiency claim (which is also likely incorrect) is inverted from the original article, "We examined the conversion rates of approximately 2,300 individual test cases spread out within 338 files. Among these, approximately 500 test cases were successfully converted, executed, and passed. This highlights how effective AI can be, leading to a significant saving of 22% of developer time."

Reading that leads me to believe that 22% of the conversions succeeded and someone at Slack is making up numbers about developer time.


> 500 test cases were successfully converted, executed, and passed.

Wonder what "successfully converted" means? A converted test executing and passing doesn't tell you whether it's still testing the same thing as before.


That suggests a test suite to test the test suite, which again would suggest another test suite to test the test suite testing test suite...

In the end, it is test suites all the way down.


There actually is a way to test your test suite: mutation tests [1]. Basically, you change the covered codebase (invert an if statement, change a variable, etc.) and expect the tests to then fail. If the tests actually survive the mutations, they might not be good enough.

[1] https://en.wikipedia.org/wiki/Mutation_testing


You could evaluate the quality of your test suite with something other than a test suite.


such as?


You can run mutation tests that intentionally seed errors into the codebase (flipping Boolean and to or, for example) and rerun the tests.

A good test suite will catch these errors. Something that is effectively a noop to get to green will not.


A developer who knows the code and will own the consequences can review and merge it - easy. Just not sure why the LLM needed to get involved in the first place.


Well, you need someone to write the code, and someone (else) to review the code.

In this situation you would replace the author with the LLM, but leave the reviewer as human.

It's not as pointless as you make it out to be. You still save one human.


Presumably because they've got 500 test cases?


Also given that there are huge differences in complexity between tests, how do we know that the successful 22% are not just trivial one-liner tests?

Thinking about the test suite in my current project there is a clear Pareto distribution with majority of tests being simple or almost trivial.


[flagged]


> You don’t need unit tests if you have integration tests.

Which is why, as per Jim Coplien, most unit testing is waste.

But converting one type of unit tests into another is a perfect showcase for AI-generated code. They could have even kept just the prompts in the source and regenerate the tests on every run, were it not for inaccuracy, temperature, and the high cost of running.


Yes, the 80% claim comes from comparing 9 tests converted by both the conversion tool and humans and comparing the quality - “80% of the content within these files was accurately converted, while the remaining 20% required manual intervention.” Not sure what to make of it since they claim only 16% of files get fully converted.


Was it the 20% of the code that requires 80% of the time to write?


I guess they believe the files that didn't get fully converted got like 76% converted on average?


Someone will still have to review and validate all the tests, which may take more time than rewriting the code.


If you're reviewing the tests (and not just the results) after conversion by automation, wouldn't someone else also review the tests converted by a person rewriting them manually?


This is something people seemingly don't grasp about LLMs. If you want "Human alignment", then you will need humans in the loop either way.


For a test that is already passing, the validation is already there. The human is reviewing the PR, but reviewing a diff is much less time intensive when you can presume it basically works, and you are just assessing if the logic make makes sense relative to the prior code.


Yep. Honestly, I don't know when I last read an InfoQ article all the way through. Too painful.


> approximately 500 test cases were successfully converted, executed, and passed.

How many of these passing tests are still actually testing anything? Do they test the tests?


> saving considerable developer time of at least 22% of 10,000 hours

I wonder how much time or money it would take to just update Enzyme to support react 18? (fork, or, god forbid, by supporting development of the actual project).

Nah, let's play with LLMs instead, and retask all the frontend teams in the company to rewriting unit tests to a new framework we won't support either.

I guess when you're swimming in pools of money there's no need to do reasonable things.


Slack's blog post links to "Enzyme is dead. Now what?"[1], in which Wojciech Maj said, "A couple of tireless evenings later, @wojtekmaj/enzyme-adapter-react-17 was born." Now, he was building on someone else's pull request, and he also said that adapting to React 18 would have required a huge rework. Still, I'm thinking that @slack/enzyme-adapter-react-18 just might have taken less than 10,000 hours.

Then again, the idea of a testing framework that is so tightly coupled that it breaks every version is foreign to me, so I probably don't know what I'm talking about.

[1] https://dev.to/wojtekmaj/enzyme-is-dead-now-what-ekl


>> by supporting development of the actual project

You mean good engineering.

>> play with LLMs instead

You mean good on the resume.

>> I guess when you're swimming in pools of money there's no need to do reasonable things.

We dont. Reasonable sailed when FB/Google/AMZN started "giving back" to the community... They leaked out all their greatest examples of Conways Law and we lapped it up like water at a desert oasis.

The thing is, these technologies have massive downsides if you aren't Google/FB/Amazon... But we're out here busy singing their praises and making sure jr dev's are all pre trained out of college for the FA(not this one)ANG lifestyle.

Think about how much react being public saves Facebook on onboarding a new dev.


When I read 'React testing library' I thought they had added an official testing library to the React project which would have been fantastic and a worthwhile migration target for sure, sad that it's just another third party one which also might one day stop supporting newer React versions


This community built React 18 adapter actually works pretty well in my experience. Some failures, but worked for multiple thousands of test files for my use case. https://www.npmjs.com/package/@cfaester/enzyme-adapter-react...

That said, making the 19 adapter is a whole new task, and I think these tests should be converted to RTL eventually, so the approach described in the blog post is still valuable.


Enzyme is kind of dead, so it would mean picking up sponsorship and maintainership (indefinitely) rather than a one-off project to convert to the official testing library for the ecosystem.


Well, it might get a lot less dead with a small fraction of the resources spent on this project.

> indefinitely

You’ll note they’ve switch to another open source framework which has the same potential to fail without support/resources. They’ve kicked the can down the road, but are now accruing the technical debt that lead to this effort exactly the same as before. Since that technical debt will inevitably turn into a real expenditure of resources, they are stuck with expenses indefinitely, however they do it. Though I think it’s pretty obvious that one way is a lot cheaper and less disruptive to the business than the other.

(BTW, if they were concerned with indefinite expenses, you might also wonder why they want to continue to build their product on the shifting sands that are react, or pursue a unit testing strategy that is so tightly coupled to specific versions of their UI framework. These are “fuck-the-cost” decisions, both short term and long term.)


In fact, enzyme didn't support the previous version of React either, except for the grace of some random guy who wrote a driver to make it work. Airbnb, who built and maintained enzyme, abandoned it. There's (afaik) no way to add React 18 support without major changes to the enzyme core. So not only is this a problem that will plague them indefinitely (that is, dealing with their test framework not supporting a recent version) if they don't switch, it's adopting a project that they don't own and didn't start to avoid a one time cost of rewriting some tests.

> Since that technical debt will inevitably turn into a real expenditure of resources, they are stuck with expenses indefinitely, however they do it.

I simply can't see how becoming the maintainer of a testing framework to rewrite it to add support for the last two versions of the library it no longer works with is a comparable investment to the ongoing cost of maintaining your own unit tests. That's like if Docker became abandoned and didn't update to support modern kernels so you decided it's better to become the maintainer of Docker instead of switching to Podman.


It’s a unit test framework though, not a suite of containerization software and services. Maintained mostly by one person for years.


"It was just maintained by one person" has no bearing on the cost of maintaining.

> It’s a unit test framework

It's not a unit test framework. It reaches into the internals of React to make it behave as though it's running in a real environment. It requires intense knowledge of how React works under the hood, and the design requires it to be compatible with lots of old versions of React as well as the latest version.

Honestly I'm not sure why you are so dismissive of the incredible amount of effort that's gone into making it work at all and how much effort it would take it make it work for the latest version of React.


react-testing-library isn't the "official testing library" for React, it isn't made by the React team, and testing library provides testing libraries for other frameworks.

It's just a change from an outdated, unmaintained testing library to a more 'modern', well-maintained library. There are also some philosophical differences in the testing approach.


Monster Energy is the official energy drink of NASCAR but that doesn't mean NASCAR manufactures energy drinks. As best as I can tell, RTL is the only testing framework mentioned in the React docs, so that's pretty "official"


The conversion is between two testing libraries for React. Not to be too cynical (this sort of works seems to me like a pretty good niche for llms), but I don’t think I’d be that far off of 80% with just vim macros…


I think you're significantly underestimating the complexity of automatic transforms. It's not like they didn't try writing codemods first, and vim macros aren't more powerful than codemods.


You really think you could achieve 80% success rate with just syntaxic transformations, while the article says they only reached 45% success rate with fine grained ast transformations?

I am no vim hater, but allow me to cast a large, fat doubt on your comment!


Fair enough :) It was very much an exaggeration. But, I do wonder how far would “dumb” text editing go in this scenario. And, more importantly, whether it wouldn’t be faster overall than writing a tool that still requires humans to go through its output and clean/fix it up.


You might be underestimating vim ;)

Key point is that vim macros are interactive. You don’t just write a script that runs autonomously, you say “ok, for the next transformation do this macro. Oh wait, except for that, in the next 500 lines do this other thing.” You write the macro, then the next macro, adjust on the fly.


From the article:

> Our initiative began with a monumental task of converting more than 15,000 Enzyme test cases, which translated to more than 10,000 potential engineering hours

That's a lot of editing.


Out of curiosity, can you drop into edit session during the macro? It is some time since I last used vim, so I do not recall, but in emacs you can record a macro along the lines of "do A, do B, drop to edit session letting user do whatever, do C, do D". Is that possible with vim macros?


I don't think so since you need to leave edit mode to terminate the macro.


Just use —- calculating… —- 2 macros.


That sounds interesting! Would you mind sharing some links to the articles or videos that focus on this possibility?


This Vimcast (http://vimcasts.org/episodes/converting-markdown-to-structur...) recording is an example of a quite complex macro for converting (a specific file's) markdown to HTML. At the beginning of the video you see that they save the macro to the "a" register. You can record macros of similar complexity to each of the other letters of the alphabet, to get some idea of the maximum complexity (though I tend to stick to about 3 or less in a single session).


Not to mention the possible savings if you just don't switch to whatever the latest testing framework is your resume driven developers want. 100% time savings!


Enzyme is abandoned and doesn’t work on newer versions of React. Many teams are doing this conversion for their React apps.


Gee, if "many teams" want to spend their time migrating their unit-test framework and unit tests because their frontend framework hit version 18 I suppose that's their prerogative.

Not me to applaud Teams but it seems Slacks lunch is being eaten by people who are busy building things on the corpse of Skype, not churning through churn incarnate.


If Enzyme was at all popular (which it sounds like it was) I'm surprised no one from the community has taken over maintenance.


I agree, once I had to write a groovy conf out of java library constructors and setters and vim macros were really good for that.


For people unfamiliar with Enzyme and RTL, this was the basic problem:

Each test made assertions about a rendered DOM from a given React component.

Enzyme’s API allowed you to query a snippet of rendered DOM using a traditional selector e.g. get the text of the DOM node with id=“foo”. RTL’s API required you to say something like “get the text of the second header element”, but prevents you from using selectors.

To do the transformation successfully you have to run the tests, first to render each snippet, then have some system for taking those rendered snippets and the Enzyme code that queries it and convert the Enzyme code to roughly-equivalent RTL calls.

That’s what the LLM was tasked with here.


If that's the entire issue couldn't someone just add support for selectors to rtl or something?


”Just add support for selectors” in a library of which whole philosophy is built around ”you test the app like the user would” (via WAI-ARIA roles [1] and text visible to screen readers).

Of course they could’ve forked the lib but that’s definitely not a ”just” decision to commit to.

[1] https://developer.mozilla.org/en-US/docs/Web/Accessibility/A...


It's a 2024 webdev summary, nothing can be added:

New React version made the lib obsolete, we used LLM to fix it (1/5 success rate)


A lib was heavily relying on React internals for testing, rather than just on components' public api. That this approach was going to be unsustainable was already obvious around 2020. The question is, after you've invested a lot of work in a bad practice, how to move to a better practice with the least amount of pain. Another, more philosophical, question is how does a bad practice gain so much traction in the developer community.


Sounds like a nightmare to be involved with anything that is written in react and requires 15,000 unit tests.


web guis are the worst event/interaction model


I’m working on a similar project (DepsHub) where LLMs are used to make major library updates as smooth as possible. While it doesn’t work in 100% cases, it really helps to minimize all the noise while keeping your project up to date. I’m not surprised Slack decided to go this way as well.


It feels to me that there may be even more potential in flipping this idea around - human coders write tests to exact specifications, then an llm-using coding system evolves code until it passes the tests.


People who are terminally capitalist have been salivating over this idea basically since this hype curve first started.

Someone made a solid joke about it as far back as 2016: https://www.commitstrip.com/2016/08/25/a-very-comprehensive-...


Well yeah, TDD. Many companies already work this way, and the LLMs are ok at generating the test-passing code.

From my experience though, it's better at writing tests (from natural language)


My concern with having the LLM's write tests is that it's hard to be convinced that they've written the right tests. Coupling human TDD with a genetic algorithm of some sort that uses LLM's to generate candidate populations of solutions, one could be assured that once a solution gets far enough through the tests [assuming one ever does], it is guaranteed to have the correct behavior (as far as "correct" has been defined in the tests).


yes, definitely it's a concern.

the idea with llm tests first is tests should be extremely easy to read. of course ideally so should production code, but it's not always possible. if a test is extremely complicated, it could be a code smell or a sign that it should be broken up.

this way it's very easy to verify the llm's output (weird typos or imports would be caught by intellisense anyway)


Seems like a reasonable approach. I wonder if it took less time than it would have taken to build some rule-based codemod script that operates on the AST, but I assume it did.


If you read the source article[0], they tried a rule-based approach first and the complexity exploded.

[0] https://slack.engineering/balancing-old-tricks-with-new-feat...


The rules based made the job of the LLM easier, so it was a worthwhile part of project.


It also takes less context potentially - allowing a more junior engineer or somebody who doesn't know much about the language/library to implement the change.


We did this for our codebase (several hundred tests) manually, two or three years ago (the problems were already apparent with React 17). It helped that we never used Enzyme's shallow renderer, because that type of testing was already falling out of favor by late 2010s.

The next fronteer is ditching jest and jsdom in favor of testing in a real browser. But I am not sure the path for getting there is clear yet in the community.


Another proof this probabilistic stochastic approach works on the prediction/token level, but not on the semantic level, where it needs a discreet system. This essentially reminds of RAG setup and is similar in its nature.

Perhaps reiterating my previous sentiment that such application of LLMs together with discreet structures brings/hides much more value than chatbots who will be soon considered mere console UI.


You probably mean discrete, not discreet?


Sssh! Keep it under your hat.


indeed thanks


Slightly tangential but one of the largest problems I’ve had working with React Testing Library is a huge number of tests that pass when they should fail. This might be because of me and my team misusing it but regularly a test will be written, seem like it’s testing something, and pass but if you flip the condition, or break the component it doesn’t fail as expected. I’d really worry that any mass automated, or honestly manual, method for test conversion would result in a large percentage of tests which seem to be of value but actually just pass without testing anything.


Can someone elaborate if the term “AST” is used correctly in the article?

I’ve been playing with mutation-injection framework for my master’s thesis for some time. I had to use LibCST to preserve syntax information which is usually lost during AST serialization/deserialization (like whitespaces, indentation and so on). I thought that the difference between abstract and concrete trees is that it’s guaranteed CST won’t lose any information, so it can be used to specific tasks where ASTs are useless. So, did they actually use CST-based approach?


Usually, the ast can be converted to code, then formatted using a specific formatter.

I'm sure slack has a particular code formatter they use.

Most of the time when working with an AST you don't think about whitespace except when writing out the result


In real world things don't fit neatly in the boxes established by textbook definitions. The borders between various concepts are fuzzy, and in production implementations for practical purposes like performance, code simplicity, better error reporting and exact usecase different stages of parsing and the parsed representations can be expanded, narrowed or skipped.

A lot of time you will have a syntax tree. It might have preserved some of the concrete syntax details like subexpression ranges for error reporting and IDE functionality or somewhat optional nodes, but at the same time it might also contain information obtained form semantic analysis (which from computer science perspective isn't even a concern of parser), it might not even be a tree. And all of that potentially produced in single pass. Is it a concrete syntax tree, is it abstract syntax tree, is it a thing after AST? Just because a data structure didn't throw away all concrete syntax details, doesn't mean it contains all of them. From my experience in such situations it's more likely to be called AST, as it's closer to that than concrete syntax tree.

It also depends how you define the language that you are interested in (even if it's the same language). The purpose of parsing isn't necessarily always code compilation. Formal grammars and parsers can be used for all kind of things like defining a file format, text markup, and also tasks like manipulation of source code and checking it for errors. Typical example of details not included in AST is parentheses. But that doesn't mean AST can never contain node for parentheses. If they have a meaning for task you are trying to achieve nothing prevents from assigning a node within AST tree. For example both Clang and GCC in some situations will give a warning depending on presence of parentheses even though they are meaningless based on C++ syntax. If you define comments as part of the language then they can be kept and manipulated within AST.

CST doesn't really guarantee that you won't lose any information. The parsers don't operate on bytes they operate on abstract symbols, which might directly correspond to bytes in the file but not always. Again in real world systems what you actually have is 2-3 languages stacked on top of each other. Which of the CST are you talking about? Preserving CST for one stage doesn't mean no information was lost in previous steps. C++ standard defines ~6 steps before the main C++ language parsing , many of which can be considered a separate formal languages with their own CST/AST.

1) Text decoding bytes -> text. While most text encodings are trivial byte->character substitution, variable length encodings like UTF8 can be described as (simple) context free grammars. I don't think any programing language toolchain does Unicode normalization at this stage, but in theory you could have a programming language which does that.

1.5) Trigraph substitution

2) text -> preprocessing tokens

3) preprocessing tokens -> preprocessing AST

4) as a result of executing preprocessing directive execution you obtain new sequence of tokens

4.5) string literal merging

5) main parsing

In practices some of these steps might be merged and not executed as separate stages, there are also a few more transformations I didn't mention.

Stuff like this makes real world source to source transformations messy. As the later grammars are operating on symbols which only exist in intermediate steps and don't always have simple 1:1 to mapping to the input file.

And in some cases you might have some custom algorithm doing a transformation which doesn't fit the model of context free grammars at all, thus whatever it did isn't part of any formal CST for the language (language in terms of formal grammars, not a programming language). Python is a good example of this. The indentation based scopes can't be handled by context free grammars, it relies on magic tokenizer which generate "indent", "dedent" tokens, so if you follow formal definitions CST of main python language doesn't contain exact information about original indentation. The fact that you can get it from LibCST is stretching definition of CST/changing the language it is parsing. At that point once you add all the extra information are you really building a CST or are you making an AST for a language where every character is significant, because you redefined which parts of program are important.

With all that said I wouldn't be surprised that the thing slack did was using something closer to AST (with some additional syntax details preserved) than CST (with additional analysis and done). If you are not building a general purpose tool for making small adjustments to arbitrary existing codebase (otherwise preserving original code), it's not necessary to preserve every tiny syntax detail as long as comments are preserved. I would expect them to be using a standardized code formatter anyway so loss of insignificant whitespace shouldn't be a major concern, and the diff will likely touch almost everyone line of code.

Whether "AST" or a "CST" is useless for specific task is in many situations less about "AST" vs "CST" but more about design choices of specific programming language, parser implementation and pushing things beyond the borders of strict formal definitions.


Pretty misleading summary, given that LLMs played only a tiny part in the effort, and probably took more time to integrate than it saved in what is otherwise a pretty standard conversion pipeline, although I’m sure it’s heavily in the Slack engineers’ interest to go along with the AI story to please the Salesforce bosses who have mandated AI must be used in every task. Just don’t fall for the spin here, and think this will actually save you time on a similar effort.


Saving 22% of 15,000 tests is 3,300 tests.

While 22% sounds low, saving yourself the effort to rewrite 3,300 tests is a good achievement.


hypothetically yes, but not if you also have to manually rewrite them to compare results



Just to shamelessly plug one of my old projects, I did something like this at a German industrial engineering firm - they wanted us to rewrite a huge base of old tests written in TCL into C#.

It was supposed to take 6 months for 12 people.

Using an AST parser I wrote a program in two weeks, that converted like half the tests flawlessly, with about another third needing minor massaging, and the rest having to be done by hand (I could've done better, by handling more corner cases, but I kinda gave up once I hit diminishing returns ).

Although it helped a bunch that most tests were brain dead simple.

Reaction was mixed - the newly appointed manager was kinda fuming that his first project's glory was stolen from him by an Assi, and the guys under him missed out on half a year of leisuirely work.

I left a month after that, but what I heard is that they decided to pretend that my solution didn't exist on the management level, and the devs just ended up manually copypasting the output of my tool, and did a days planned work in 20 minutes, with the whole thing taking 6 months as planned.


Misleading title. Maybe try this one?

"Slack uses ASTs to convert test code from Enzyme to React with 22% success rate"

This article is a poor summary of the actual article, which is at least linked to Slack's engineering blog [0].

[0] https://slack.engineering/balancing-old-tricks-with-new-feat...

[updated]


The slack blog is for engineers. It's PR to hire in talent.

The INFOQ article is for your C types. It's the spin on what buzz words they should be using with their peers and to make a splash in the market.

NFT, Crypto, Cloud, Microservice, SAS, Podcast (the first time when there wasnt video), Web 2.0, The Ad (double-click, pre google) market, The Dot Com Bubble...

Im sure I missed a few hype cycles in there.

Both articles show how deep we are in the stink of this one.

Im tired of swimming the bullshit, It keeps getting in my mouth.


Thus the conclusion should be heavily scrutinize future infoq .com articles and perhaps future articles by the same author Eran Stiller.

We shouldn’t detach responsibility from the publisher and author.


Except the article turns out to be accurate (see https://news.ycombinator.com/item?id=40728179)

So I guess the publisher and author should get credit. I'll leave others to discuss the misleading comment...


Hah. A lot of tech folks like to trash journalists (it's fine, I get it, there's legitimate reasons) ... but then misread source content that the journalist interpreted better/correctly.


Actually the Infoq article is more correct than this comment!

The comment: ""Slack uses ASTs to convert test code from Enzyme to React with 22% success rate""

To quote[1], this 22% comes from this part:

> We examined the conversion rates of approximately 2,300 individual test cases spread out within 338 files. Among these, approximately 500 test cases were successfully converted, executed, and passed. This highlights how effective AI can be, leading to a significant saving of 22% of developer time. It’s important to note that this 22% time saving represents only the documented cases where the test case passed.

So that 22% rate is 22% saving of developer time, measured on a sample. No reasonable reading of that makes it a "22% success rate".

Over the whole set of tests:

> This strategic pivot, and the integration of both AST and AI technologies, helped us achieve the remarkable 80% conversion success rate, based on selected files, demonstrating the complementary nature of these approaches and their combined efficacy in addressing the challenges we faced.

and

> Our benchmark for quality was set by the standards achieved by the frontend developers based on our quality rubric that covers imports, rendering methods, JavaScript/TypeScript logic, and Jest assertions. We aimed to match their level of quality. The evaluation revealed that 80% of the content within these files was accurately converted, while the remaining 20% required manual intervention.

(So I guess the "80% conversion success rate" is this percentage of files?)

The Infoq title "Slack Combines ASTs with Large Language Models to Automatically Convert 80% of 15,000 Unit Tests" certainly more accurately reflects the underlying article than this comment.

Edit: they do have a diagram that talks about 22% of the subset of manually inspected files being 100% complete. This doesn't appear to be what Slack considers their success rate because they manually inspect files anyway.

[1] https://slack.engineering/balancing-old-tricks-with-new-feat...


> No reasonable reading of that makes it a "22% success rate".

Well, 500/2300 is 22%, so calling it 22% seems pretty reasonable.

From what I get from the rest, the 78% remaining tests (the ones that failed to convert) were "80% accurately converted", I guess they had some metric for measuring that.

So it looks like it depends on how you interpret "automatically converted 80%". If it's taken to mean "80% could be used without manual intervention", then it's clearly false. If you take it to mean "it required manual intervention on just 20% of the contents to be usable", then it's reasonable.



Reminds me of Hitchhikers Guide … they had to figure out the right question to ask.


Is it true that they give out free BMWs in Moscow? Yes, it is true! But it's not Moscow but St. Petersburg. And it's not BMWs but Ladas. And they don't give them out, they steal them.


This is a variation of famous jokes about Radio Yerevan, very popular in former soviet states.

I live in Poland and I know this version:

Is it true that they give away cars on Red Square?

Radio Yerevan answers: not cars, only bicycles, not on Red Square, but near the Warsaw station, and they don't give them away, they steal them.


So all in all, still a mostly accurate news reporting, as they go.


infoq has gone to pure shit


This is from the actual Slack blog post:

> We examined the conversion rates of approximately 2,300 individual test cases spread out within 338 files. Among these, approximately 500 test cases were successfully converted, executed, and passed. This highlights how effective AI can be, leading to a significant saving of 22% of developer time. It’s important to note that this 22% time saving represents only the documented cases where the test case passed.

So the blog post says they converted 22% of tests, which they claim as saving 22% of developer time, which InfoQ interpreted as converting 80% of tests automatically?

Am I missing something? Or is this InfoQ article just completely misinterpreting the blog post it’s supposed to be reporting on?

The topic itself is interesting, but between all of the statistics games and editorializing of the already editorialized blog post, it feels like I’m doing heavy work just to figure out what’s going on.


My reading of this is that the examination was a subset of the full set they manually examined.

From the source:

> It’s so compelling that we at Slack decided to convert more than 15,000 of our frontend unit and integration Enzyme tests to RTL, as part of the update to React 18.

and

> Our benchmark for quality was set by the standards achieved by the frontend developers based on our quality rubric that covers imports, rendering methods, JavaScript/TypeScript logic, and Jest assertions. We aimed to match their level of quality. The evaluation revealed that 80% of the content within these files was accurately converted, while the remaining 20% required manual intervention.

There is a diagram that mentions 22% of the subset of manually inspected files that were 100% converted. But Slack is manually checking all converted test cases anyway so they don't seem to consider this the success rate.

https://slack.engineering/balancing-old-tricks-with-new-feat...


Having a success criteria "the test passed" is a huge red flag. So the test can be :

print("Passed")

(or some more subtle variation on that) and we succeeded.


[flagged]





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: