Those people seem to have a passion for developing package managers (instead of just seeing it as a tool that needs to do the job), and as long as it is the case, I don't see how we wouldn't end up with one new package manager every year.
Aren't you screwed from the moment you have a malicious user in your workspace? This user can change their picture/name and directly ask for the API key, or send some phishing link or get loose on whatever social engineering is fundamentally possible in any instant message system.
There are a lot of public Slack for SaaS companies, phishing can be detected by serious users (especially when the messages seems phishy) but an indirect AI leak does not put you in a "defense mode", all it takes is one accidental click
It can take an arbitrary amount of time. Modules are just code executed top to bottom, and might contain anything beyond mere constants and functions declaration.
It's easy to imagine why something could never work.
It's more interesting to imagine what just might work. One thing that has plagued programmers for the past decades is the difficulty of writing correct multi-threaded software. You need fine-grained locking otherwise your threads will waste time waiting for mutexes. But color-coding your program to constrain which parts of your code can touch which data and when is tedious and error-prone. If LLMs can annotate code sufficiently for a SAT solver to prove thread safety that's a huge win. And that's just one example.
Code coverage exists. Shouldn't be hard at all to tune the parameters to get what you want. We have really good tools to reason about code programmatically - linters, analyzers, coverage, etc.
In my experience they are ok (not excellent) for checking whether some code will crash or not. But checking whether the code logic is correct with respect to the requirements is far from being automatized.
But for writing tests that's less of an issue.
You start with known good/bad code and ask it to write tests against a spec for some code X - then the evaluation criteria is something like did the test cover the expected lines and produce the expected outcome (success/fail). Pepper in lint rules for preferred style etc.
But this will lead you to the same problem the tweet is talking! You are training a reward model based on human feedback (whether the code satisfies the specification or not). This time the human feedback may seem more objective, but in the end it's still non-exhaustive human feedback which will lead to the reward model being vulnerable to some adversarial inputs which the other model will likely pick up pretty quickly.
The input data is still human produced. Who decides what is code that follows the specification and what is code that doesn't? And who produces that code? Are you sure that the code that another model produces will look like that? If not then nothing will prevent you from running into adversarial inputs.
And sure, coverage and lints are objective metrics, but they don't directly imply the correctness of a test. Some tests can reach a high coverage and pass all the lint checks but still be incorrect or test the wrong thing!
Whether the test passes or not is what's mostly correlated to whether it's correct or not. But similarly for an image recognizer the prompt of whether an image is a flower or not is also objective and correlated, and yet researchers continue to find adversarial inputs for image recognizer due to the bias in their training data. What makes you think this won't happen here too?
So are rules for the game of go or chess ? Specifying code that satisfies (or doesn't satisfy) is a problem statement - evaluation is automatic.
> but they don't directly imply the correctness of a test.
I'd be willing to bet that if you start with an existing coding model and continue training it with coverage/lint metrics and evaluation as feedback you'd get better at generating tests. Would be slow and figuring out how to build a problem dataset from existing codebases would be the hard part.
The rules are well defined and you can easily write a program that will tell whether a move is valid or not, or whether a game has been won or not. This allows you generate virtually infinite amount of data to train the model on without human intervention.
> Specifying code that satisfies (or doesn't satisfy) is a problem statement
This would be true if you fix one specific program (just like in Go or Chess you fix the specific rules of the game and then train a model on those) and want to know whether that specific program satisfies some given specification (which will be the input of your model). But if instead you want the model to work with any program then that will have to become part of the input too and you'll have to train it an a number of programs which will have to be provided somehow.
> and figuring out how to build a problem dataset from existing codebases would be the hard part
This is the "Human Feedback" part that the tweet author talks about and the one that will always be flawed.
In the end, your are replacing the application code by a spec, which needs to have a comparable level of detail in order for the AI to not invent its own criteria.
If you have a test that completes with the expected outcome and hits the expected code paths you have a working test - I'd say that heuristic will get you really close with some tweaks.
That's a good point. A model that is capable of implementing a nonsense test is still better than a model that can't. The implementer model only needs a good variety of tests. They don't actually have to translate a prompt into a test.
The update affected less than 1% of all Windows machines. [1] Although maybe the biggest software failure in history, far from the biggest possible one. The level of cloud connectivity in the world could basically break the world if we didn't have diversity.
> Those two flight deck pilots had breathed-up all the oxygen in their breathing packs by the time they hit the sea, something confirmed by the empty breathing packs that were recovered. Which means they were alive when they hit the sea!
I don't understand how this follows. The best scenario is that they had their last drops of oxygen around hitting the sea; in other scenarios they died from lack of oxygen before hitting the sea.
> The best scenario is that they had their last drops of oxygen around hitting the sea; in other scenarios they died from lack of oxygen before hitting the sea.
See [0] for a summary. It appears that at least one unidentified crew member activated the air pack for Smith (the pilot) but not Scobee (the commander). Smith operated some switches after the break-up so was certainly conscious. The crew compartment was tumbling but not so fast as to cause blackouts.
> they would be in air dense enough not to even lose consciousness.
Assuming that they don't need to do any action to change from bottle oxygen to external. Or that if action is required (like turning a valve or opening their visors), that it was performed by them.
I do not know how that subsystem worked. Maybe someone else here knows?
I bet it becomes unique far far less often then most people think.
Computing the number of permutations is thoroughly unconvincing.
For instance, there's 20 possible first moves and of those only probably 2 are played 95% of the time. You can certainly compute what the rates open is and the rarest response that's actually played or the rarest response to the most common open
> After a candidate's defeat in an election, you will be supplied with the "cause" of the voters' disgruntlement. Any conceivable cause can do. The media, however, go to great lengths to make the process "thorough" with their armies of fact-checkers. It is as if they wanted to be wrong with infinite precision (instead of accepting being approximately right, like a fable writer).
> they've gone from barely stringing together a TODO app to structuring and executing large-scale changes in entire repositories in 3 years.
No they didn't. They're still at the step of barely stringing together a TODO app, and mostly because it's as simple as copying the gazillionth TODO app from GitHub.
I’ve used copilot recently in my work codebase and it absolutely has no idea what’s going on in the codebase. At best it’ll look at the currently open file. Half the time it can’t seem to comprehend even the current file fully. I’d be happy if it was better but it’s simply not.
I do use chatgpt most recently today to build me a GitHub actions yaml file based on my spec and it saved me days of work. Not perfect but close enough that I can fill in some details and be done. So sometimes it’s a good tool. It’s also an excellent rubber duck- often better than most of my coworkers. I don’t really know how to extrapolate what it’ll be in the future. I would guess we hit some kind of a limit that will be tricky to get past because nothing scales forever