More

incorrecthorse · 2024-08-22T08:17:45 1724314665

Those people seem to have a passion for developing package managers (instead of just seeing it as a tool that needs to do the job), and as long as it is the case, I don't see how we wouldn't end up with one new package manager every year.

rtpg · 2024-08-22T08:51:31 1724316691

People aren’t writing new package managers for rust every other week. JavaScript still sees some but less.

Perhaps people are trying to solve real pain points, and by getting closer to solving them things feel nicer!

the_mitsuhiko · 2024-08-22T08:44:24 1724316264

Who are "those people"?

incorrecthorse · 2024-08-21T08:20:37 1724228437

Aren't you screwed from the moment you have a malicious user in your workspace? This user can change their picture/name and directly ask for the API key, or send some phishing link or get loose on whatever social engineering is fundamentally possible in any instant message system.

h1fra · 2024-08-21T09:49:53 1724233793

There are a lot of public Slack for SaaS companies, phishing can be detected by serious users (especially when the messages seems phishy) but an indirect AI leak does not put you in a "defense mode", all it takes is one accidental click

incorrecthorse · 2024-08-09T15:59:05 1723219145

It can take an arbitrary amount of time. Modules are just code executed top to bottom, and might contain anything beyond mere constants and functions declaration.

incorrecthorse · 2024-08-08T09:58:38 1723111118

Unless you want an empty test suite or a test suite full of `assert True`, the reward function is more complicated than you think.

gizmo · 2024-08-08T10:20:09 1723112409

It's easy to imagine why something could never work.

It's more interesting to imagine what just might work. One thing that has plagued programmers for the past decades is the difficulty of writing correct multi-threaded software. You need fine-grained locking otherwise your threads will waste time waiting for mutexes. But color-coding your program to constrain which parts of your code can touch which data and when is tedious and error-prone. If LLMs can annotate code sufficiently for a SAT solver to prove thread safety that's a huge win. And that's just one example.

imtringued · 2024-08-08T14:24:05 1723127045

Rust is that way.

rafaelmn · 2024-08-08T10:02:37 1723111357

Code coverage exists. Shouldn't be hard at all to tune the parameters to get what you want. We have really good tools to reason about code programmatically - linters, analyzers, coverage, etc.

SkiFire13 · 2024-08-08T10:44:19 1723113859

In my experience they are ok (not excellent) for checking whether some code will crash or not. But checking whether the code logic is correct with respect to the requirements is far from being automatized.

rafaelmn · 2024-08-08T11:42:33 1723117353

But for writing tests that's less of an issue. You start with known good/bad code and ask it to write tests against a spec for some code X - then the evaluation criteria is something like did the test cover the expected lines and produce the expected outcome (success/fail). Pepper in lint rules for preferred style etc.

SkiFire13 · 2024-08-08T12:51:04 1723121464

But this will lead you to the same problem the tweet is talking! You are training a reward model based on human feedback (whether the code satisfies the specification or not). This time the human feedback may seem more objective, but in the end it's still non-exhaustive human feedback which will lead to the reward model being vulnerable to some adversarial inputs which the other model will likely pick up pretty quickly.

rafaelmn · 2024-08-08T12:53:35 1723121615

It's based on automated tools and evaluation (test runner, coverage, lint) ?

SkiFire13 · 2024-08-08T13:36:53 1723124213

The input data is still human produced. Who decides what is code that follows the specification and what is code that doesn't? And who produces that code? Are you sure that the code that another model produces will look like that? If not then nothing will prevent you from running into adversarial inputs.

And sure, coverage and lints are objective metrics, but they don't directly imply the correctness of a test. Some tests can reach a high coverage and pass all the lint checks but still be incorrect or test the wrong thing!

Whether the test passes or not is what's mostly correlated to whether it's correct or not. But similarly for an image recognizer the prompt of whether an image is a flower or not is also objective and correlated, and yet researchers continue to find adversarial inputs for image recognizer due to the bias in their training data. What makes you think this won't happen here too?

rafaelmn · 2024-08-08T14:03:07 1723125787

> The input data is still human produced

So are rules for the game of go or chess ? Specifying code that satisfies (or doesn't satisfy) is a problem statement - evaluation is automatic.

> but they don't directly imply the correctness of a test.

I'd be willing to bet that if you start with an existing coding model and continue training it with coverage/lint metrics and evaluation as feedback you'd get better at generating tests. Would be slow and figuring out how to build a problem dataset from existing codebases would be the hard part.

SkiFire13 · 2024-08-08T14:20:27 1723126827

> So are rules for the game of go or chess ?

The rules are well defined and you can easily write a program that will tell whether a move is valid or not, or whether a game has been won or not. This allows you generate virtually infinite amount of data to train the model on without human intervention.

> Specifying code that satisfies (or doesn't satisfy) is a problem statement

This would be true if you fix one specific program (just like in Go or Chess you fix the specific rules of the game and then train a model on those) and want to know whether that specific program satisfies some given specification (which will be the input of your model). But if instead you want the model to work with any program then that will have to become part of the input too and you'll have to train it an a number of programs which will have to be provided somehow.

> and figuring out how to build a problem dataset from existing codebases would be the hard part

This is the "Human Feedback" part that the tweet author talks about and the one that will always be flawed.

layer8 · 2024-08-08T18:05:55 1723140355

Who writes the spec to write tests against?

In the end, your are replacing the application code by a spec, which needs to have a comparable level of detail in order for the AI to not invent its own criteria.

incorrecthorse · 2024-08-08T14:35:16 1723127716

Code coverage proves that the code runs, not that it does what it should do.

rafaelmn · 2024-08-08T15:29:00 1723130940

If you have a test that completes with the expected outcome and hits the expected code paths you have a working test - I'd say that heuristic will get you really close with some tweaks.

WithinReason · 2024-08-08T10:55:51 1723114551

Adversarial networks are a straightforward solution to this. The reward for generating and solving tests is different.

imtringued · 2024-08-08T14:27:39 1723127259

That's a good point. A model that is capable of implementing a nonsense test is still better than a model that can't. The implementer model only needs a good variety of tests. They don't actually have to translate a prompt into a test.

littlestymaar · 2024-08-08T10:08:57 1723111737

It's not trivial to get right but it sounds within reach, unlike “hallucinations” with general purpose LLM usage.

incorrecthorse · 2024-07-30T08:36:29 1722328589

The update affected less than 1% of all Windows machines. [1] Although maybe the biggest software failure in history, far from the biggest possible one. The level of cloud connectivity in the world could basically break the world if we didn't have diversity.

[1] https://blogs.microsoft.com/blog/2024/07/20/helping-our-cust...

incorrecthorse · 2024-07-10T08:35:16 1720600516

> Those two flight deck pilots had breathed-up all the oxygen in their breathing packs by the time they hit the sea, something confirmed by the empty breathing packs that were recovered. Which means they were alive when they hit the sea!

I don't understand how this follows. The best scenario is that they had their last drops of oxygen around hitting the sea; in other scenarios they died from lack of oxygen before hitting the sea.

KineticLensman · 2024-07-10T13:08:49 1720616929

> The best scenario is that they had their last drops of oxygen around hitting the sea; in other scenarios they died from lack of oxygen before hitting the sea.

See [0] for a summary. It appears that at least one unidentified crew member activated the air pack for Smith (the pilot) but not Scobee (the commander). Smith operated some switches after the break-up so was certainly conscious. The crew compartment was tumbling but not so fast as to cause blackouts.

[0] https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disas...

1659447091 · 2024-07-10T13:20:45 1720617645

Here is the link to add on the Personal Egress Air Packs, and a crew member activating Smiths PEAP

[0]: https://en.wikipedia.org/wiki/Personal_Egress_Air_Pack

KennyBlanken · 2024-07-10T15:10:46 1720624246

Note that the Challenger crew were not wearing Launch Entry Suits like those shown in the photo.

They were dressed in what amounted to nylon jumpsuits and motorcycle helmets.

inglor_cz · 2024-07-10T11:42:09 1720611729

"in other scenarios they died from lack of oxygen before hitting the sea."

If they ran out in last 4 km of altitude or so, they would be in air dense enough not to even lose consciousness.

krisoft · 2024-07-10T12:22:02 1720614122

> they would be in air dense enough not to even lose consciousness.

Assuming that they don't need to do any action to change from bottle oxygen to external. Or that if action is required (like turning a valve or opening their visors), that it was performed by them.

I do not know how that subsystem worked. Maybe someone else here knows?

incorrecthorse · 2024-06-11T09:27:08 1718098028

Board state itself becomes unique pretty quickly, so you would just end up with a gigantic lot of "moves" played only 1 time.

EDIT: so you could define "rare" moves as the biggest difference of occurrences between state N and state N+1.

kristopolous · 2024-06-11T23:31:13 1718148673

I bet it becomes unique far far less often then most people think.

Computing the number of permutations is thoroughly unconvincing.

For instance, there's 20 possible first moves and of those only probably 2 are played 95% of the time. You can certainly compute what the rates open is and the rarest response that's actually played or the rarest response to the most common open

incorrecthorse · 2024-05-14T10:25:53 1715682353

Looks like this missed the opportunity to load the board in JS from the URL to be truly static.

incorrecthorse · 2024-05-13T16:32:05 1715617925

> After a candidate's defeat in an election, you will be supplied with the "cause" of the voters' disgruntlement. Any conceivable cause can do. The media, however, go to great lengths to make the process "thorough" with their armies of fact-checkers. It is as if they wanted to be wrong with infinite precision (instead of accepting being approximately right, like a fable writer).

-- N.N. Taleb

incorrecthorse · 2024-04-29T16:47:26 1714409246

> they've gone from barely stringing together a TODO app to structuring and executing large-scale changes in entire repositories in 3 years.

No they didn't. They're still at the step of barely stringing together a TODO app, and mostly because it's as simple as copying the gazillionth TODO app from GitHub.

coffeebeqn · 2024-04-30T00:59:29 1714438769

I’ve used copilot recently in my work codebase and it absolutely has no idea what’s going on in the codebase. At best it’ll look at the currently open file. Half the time it can’t seem to comprehend even the current file fully. I’d be happy if it was better but it’s simply not.

I do use chatgpt most recently today to build me a GitHub actions yaml file based on my spec and it saved me days of work. Not perfect but close enough that I can fill in some details and be done. So sometimes it’s a good tool. It’s also an excellent rubber duck- often better than most of my coworkers. I don’t really know how to extrapolate what it’ll be in the future. I would guess we hit some kind of a limit that will be tricky to get past because nothing scales forever