more pornel's comments | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit | more pornel's comments

pornel 11 days ago | parent | context | [–] | on: TODOs aren't for doing

Tracking in external system adds overhead not only for filing the issue, but also for triaging it, backlog management, re-triaging to see if it's still a problem, and then closing it when it's finished. Issues in an external systems may also be overlooked by developers working on this particular code.

There are plenty of small things that are worth fixing, but not worth as much as the overhead of tracking them.

TODO in code is easy to spot when someone is working on this code, and easy to delete when the code is refactored.

LeifCarrotson 11 days ago | | [–]

I think the key distinction is that tracking in an external system exposes the task for triage/management/prioritization to people who aren't reading the code, while a TODO comment often leaves the message in exactly the spot where a programmer would read it if the possibility of a problem became an actual problem that they had to debug.

In my experience, these can often be replaced with logger.Error("the todo") or throw new Exception("the todo"), which read about as well as //TODO: the todo, but also can bubble up to people not actually reading the code. Sometimes, though, there's no simple test to trigger that line of code, and it just needs to be a comment.

WorldMaker 10 days ago | | | [–]

I've also seen good use of automation tools that monitor the codebase for TODOs and if they last for more than a couple weeks escalate them into a "real" ticketing system.

(I've also seen that backfire and used to punish engineering.)

SonarQube, for instance, will flag TODOs as "Code Smells" and then has some automation capabilities to eventually escalate those to tickets in certain systems if plugins are configured.

I've also seen people do simpler things with GitHub Actions to auto-create GitHub Issues.

alchemyzach 15 hours ago | | | [–]

> I've also seen good use of automation tools that monitor the codebase for TODOs and if they last for more than a couple weeks escalate them into a "real" ticketing system

Im sorry but that’s exactly the kind of automation that sounds helpful in theory but ends up creating bloat and inefficiency in practice. Like the article says, the second a TODO gets a timer/deadline attached, it stops being a quick, lightweight note and turns into process overhead (note the distinction between something that is urgent and needs fixing now, and something that really is just a TODO).

Maybe a weird way to put it, but it’s like a TODO that used to be lean and trail-ready - able to carry itself for miles over tough terrain with just some snacks and water - suddenly steps on a scale and gets labeled “overweight" and "bloated" and flagged as a problem, and sent into the healthcare system. It loses its agility and becomes a burden.

"But the TODO is a serious problem that does need to get addressed now" Ok then it was never actually a TODO, and thats something to take up with the dev who wrote it. But most TODOs are actually just TODOs - not broken code, but helpful crumbs left by diligent, benevolent devs. And if you start to attack/accuse every TODO as "undone work that needed to be done yesterday" then youll just create a culture where devs are afraid to write them, which is really stupid and will just create even more inefficiency/pitfalls down the road - way more than if you had just accepted TODOs as natural occurrences in codebases

pc86 11 days ago | | | [–]

> Tracking in external system adds overhead not only for filing the issue, but also for triaging it, backlog management, re-triaging to see if it's still a problem, and then closing it when it's finished.

Which is already what you're doing in that system, and what the system is designed for.

Source code is not designed to track and management issues and make sure they get prioritized, so you shouldn't be using your source code to do this.

We add TODOs during development, and then during review we either add a ticket and remove the TODO, or fix the issue as part of the PR and remove the TODO.

lmm 10 days ago | | | [–]

> Which is already what you're doing in that system, and what the system is designed for.

No it isn't. The system is designed to get managers to pay for it and it does that very well, it's very ineffective at tracking or triaging issues.

> Source code is not designed to track and management issues and make sure they get prioritized, so you shouldn't be using your source code to do this.

Most things that people build systems for managing outside of the source code repo end up being less effective to manage that way.

pc86 10 days ago | | | [–]

This is just intellectually lazy IMO. "Ticket management software isn't good at managing tickets, it's just good at getting stupid CTOs to pay for it because execs are stupid didn't you guys know?"

I'm sure that's true for enterprise bloatware, but there are dozens of excellent open source and/or low cost issue trackers. Hell, Trello will do about 90% of what you need out of the box if you're operating with 1-3 people.

lmm 10 days ago | | | [–]

> This is just intellectually lazy IMO. "Ticket management software isn't good at managing tickets, it's just good at getting stupid CTOs to pay for it because execs are stupid didn't you guys know?"

Far more intellectually lazy to assume that because people pay for it it does something useful. Have you actually tried working without a ticketing system, not just throwing up your hands as soon as anything went wrong but making a serious attempt?

aaronbaugher 10 days ago | | | | [–]

The main problem I have with ticketing (and project management) systems is that I can't get the people asking me to do things to use the system. I'll set it up and show them how to use it, and then they tell me about issues via email or text message or voice call. I end up entering the tickets/tasks myself, at which point I might as well be using my own org-mode setup.

Sohcahtoa82 10 days ago | | | [–]

Where I'm at, we have a bot that automatically creates a ticket for IT any time someone posts a message to the #it-help channel on Slack. It even automatically routes the ticket based on the content of the message with decent accuracy.

CRConrad 1 day ago | | | | [–]

From all I've seen, heard, and read about that problem over the decades (yup, I think it's not mere years any more):

The only solution is to be rock-solid in refusing to do anything if there isn't a ticket for it. Your nine-thousand-percent-consistent reply to those emails, text messages, and voice calls needs to be "Yeah, make a ticket about it. I've shown you how, and that's the way we do it. No ticket, no action from me."

If you can't be that "mean" about it, you'll have to be a make-my-own-tickets doormat forever. In that perspective, doesn't feel all that "mean" any more, does it?

pc86 10 days ago | | | | [–]

Oh well you use emacs, that's the problem. /s

Perhaps I've been spoiled having worked almost exclusively in organizations where it's completely acceptable to get a message on Slack, or Teams, or email, or whatever, with some bug or issue, and respond with "please create a ticket" and the person... creates a ticket.

Yeah if nobody uses the system, or if you have to expend organizational capital to get them to do it (they view it as doing something for you instead of just doing their job), the system will by definitely be worth less and be less helpful.

lan321 10 days ago | | | | [–]

Has anyone ever made a language or extended a language with serious issue tracking? I can definitely imagine a system where every folder and file can have a paired ticket file where past, future and current tasks are tracked. Theoretically, it could even bind to a source management extension for the past ones. It won't ever be as powerful and manager-friendly as JIRA, but it would be good enough for many projects.

limagnolia 10 days ago | | | [–]

Fossil has project management features built-in- https://fossil-scm.org/home/doc/trunk/www/index.wiki

OJFord 10 days ago | | | | [–]

There are various things built on git (the issues don't need representation in the current state of the source necessarily after all) but I'm not aware of any with any traction - it's a hobby Show HN thing, it appeals to us, but not to product.

lan321 10 days ago | | | [–]

I think it'd be cooler to have it as part of the source and kind of build incrementally. So you'd have the bits in code that get added to the pair file that will then be added to the directory file... Then you can add other pairs for things like test results, and it could be decent. Some Lego Logseqesue thing :)

I wouldn't use git as a basis for it since then management is completely out. Hell, I'm probably out as well since I see git as a necessary evil.

motorest 11 days ago | | | | [–]

> Source code is not designed to track and management issues and make sure they get prioritized, so you shouldn't be using your source code to do this.

Indeed. Who in their right mind would think it is reasonable to track relevant tasks purposely outside of a system designed and used explicitly to get tasks done?

Also, no one prevents a developer from closing a ticket before triaging it. If you fix a TODO, just post a comment and close it. I mean, will your manager complain about effortlessly clearing the backlog? Come on.

OrderlyTiamat 11 days ago | | | [–]

You can leave the TODO in the comments- e.g. ruff the linter has an optional rule to disallow TODO comments unless it's followed by an issue url.

If you put that in the CI, then you can use TODOs either as blockers you wish to fix before merging, or as long term comments to be fixed in a future ticket.

evnu 11 days ago | | | [–]

Some years ago, I started to use FIXME to indicate that something is blocking the PR and needs to be done before merging, and TODO if something can be done at a later point in time. Then, CI only needs to grep for FIXME to block merging the PR, which works for practically any language. Works pretty well for me, maybe that tip can help others as well.

alchemyzach 15 hours ago | | | [–]

> There are plenty of small things that are worth fixing, but not worth as much as the overhead of tracking them

THANK YOU

motorest 11 days ago | | | [–]

> Tracking in external system adds overhead not only for filing the issue, but also for triaging it, backlog management, re-triaging to see if it's still a problem, and then closing it when it's finished.

Filing the issue can take as long as writing the TODO message.

Triaging it, backlog management, re-triaging to see if it's still a problem... It's called working on the issue. I mean, do you plan on working on a TODO without knowing if it is still a problem? Come on.

> Issues in an external systems may also be overlooked by developers working on this particular code.

I stumbled upon TODO entries that were over a decade old. TODOs in the code are designed to be overlooked.

The external system was adopted and was purposely designed to help developers track issues, including bugs.

You are also somehow assuming that there is no overhead in committing TODO messages. I mean, you need to post and review a PR to update a TODO message? How nuts is that.

> There are plenty of small things that are worth fixing, but not worth as much as the overhead of tracking them.

If those small things are worth fixing, they are worth filing a ticket.

If something you perceive as an issue is not worth the trouble of tracking, it's also not worth creating a comment to track it.

RandallBrown 11 days ago | | | [–]

This gives me an idea for a source control/task task tracking system where TODOs in code get automatically turned into tickets in your tracker, and then removed from your code automatically.

That way you don't fill your code with a list of TODOs and you'll still be able to track what you want to improve in your codebase.

It might not be the right tool for everyone, but I'd love it.

CRConrad 1 day ago | | | [–]

> TODOs in code get automatically turned into tickets in your tracker, and then removed from your code automatically.

Better yet (IMO), not removed but replaced by a ticket number or link to the issue.

mystifyingpoi 11 days ago | | | | [–]

Check out Puzzle Driven Development.

pornel 14 days ago | parent | context | | [–] | on: The borrowchecker is what I like the least about R...

There are some artificial limitations, but I love the upside: I don't need defensive programming!

When my function gets an exclusive reference to an object, I know for sure that it won't be touched by the caller while I use it, but I can still mutate it freely. I never need to make deep copies of inputs defensively just in case the caller tries to keep a reference to somewhere in the object they've passed to my function.

And conversely, as a user of libraries, I can look at an API of any function and know whether it will only temporarily look at its arguments (and I can then modify or destroy them without consequences), or whether it keeps them, or whether they're shared between the caller and the callee.

All of this is especially important in multi-threaded code where a function holding on to a reference for too long, or mutating something unexpectedly, can cause painful-to-debug bugs. Once you know the limitations of the borrow checker, and how to work with or around them, it's not that hard. Dealing with a picky compiler is IMHO still preferable to dealing with mysterious bugs from unexpectedly-mutated state.

In a way, borrow checker also makes interfaces simpler. The rules may be restrictive, but the same rules apply to everything everywhere. I can learn them once, and then know what to expect from every API using references. There are no exceptions in libraries that try to be clever. There are no exceptions for single-threaded programs. There are no exceptions for DLLs. There are no exceptions for programs built with -fpointers-go-sideways. It may be tricky like a game of chess, but I only need to consider the rules of the game, and not odd stuff like whether my opponent glued pieces to the chessboard.

klauserc 13 days ago | | [–]

Yes! One of the worst bugs to debug in my entire career boiled down to a piece of Java mutating a HashSet that it received from another component. That other component had independently made the decision to cache these HashSet instances. Boom! Spooky failure scenarios where requests only start to fail if you previously made an unrelated request that happened to mutate the cached object.

This is an example where ownership semantics would have prevented that bug. (references to the cached HashSets could have only been handed out as shared/immutable references; the mutation of the cached HashSet could not have happened).

The ownership model is about much more than just memory safety. This is why I tell people: spending a weekend to learn rust will make you a better programmer in any language (because you will start thinking about proper ownership even in GC-ed languages).

bn-l 13 days ago | | | [–]

A weekend?

IshKebab 13 days ago | | | [–]

Yeah that's definitely optimistic. More like 1-6 months depending on how intensively you learn. It's still worth it though. It easily takes as long to learn C++ and nobody talks about how that is too much.

AquilaFasciata 12 days ago | | | | [–]

Yes. I learned Rust in a weekend. Basic Rust isn't that complicated, especially when you listen to the compiler's error messages (which are 42x as helpful compared with C++ compiler errors).

bn-l 9 days ago | | | [–]

Damn. You are a smart person. It’s taken me months and I’m still not confident. But I was coming from interpreted languages (+ small experience with c).

brabel 13 days ago | | | | [–]

> This is an example where ownership semantics would have prevented that bug.

It’s also a bug prevented by basic good practices in Java. You can’t cache copies of mutable data and you can’t mutate shared data. Yes it’s a shame that Java won’t help you do that but I honestly never see mistakes like this except in code review for very junior developers.

zozbot234 13 days ago | | | [–]

The whole point is that languages like Java won't keep track of what's "shared" or "mutable" for you. And no, it doesn't just trip up "very junior developers in code review", quite the opposite. It typically comes up as surprising cross-module interactions in evolving code bases, that no "code review" process can feasibly catch.

brabel 13 days ago | | | [–]

Speak for yourself. I haven't seen any bug like this in Java for years. You think you know better and my experience is not valid? Ha. Ok. Keep living in your dreams.

IshKebab 13 days ago | | | [–]

Yes I think he knows better and your experience is not valid.

Well, maybe not valid, but insufficient at least.

nlitened 13 days ago | | | [–]

> When my function gets an exclusive reference to an object, I know for sure that it won't be touched by the caller while I use it, but I can still mutate it freely.

I love how this very real problem can be solved in two ways:

1. Avoid non-exclusive mutable references to objects

2. Avoid mutable objects

Former approach results in pervasive complexity and rigidity (Rust), latter results in pervasive simplicity and flexibility (Clojure).

pornel 13 days ago | | | [–]

Shared mutable state is the root of all evil, and it can be solved either by completely banning sharing (actors) or by banning mutation (functional), but Rust gives fine-grained control that lets you choose on case-by-case basis, without completely giving up either one. In Rust, immutability is not a property of an object in Rust, but a mode of access.

It's also silly to blame Rust for not having flexibility of a high-level GC-heavy VM-based language. Rust deliberately focuses on the extreme opposite of that: low-level high-performance systems programming niche, where Clojure isn't an option.

pornel 16 days ago | parent | context | | [–] | on: Writing a competitive BZip2 encoder in Ada from sc...

There may be lots of uninformed post-hoc rationalizations now, but it couldn't have started with everyone collectively deciding to irrationally dislike Ada, and not even try it. I suspect it's not even the ignorant slander that is the cause of Ada's unpopularity.

Other languages survive being called designed by committee or having ugly syntax. People talk shit about C++ all the time. PHP is still alive despite getting so much hate. However, there are rational reasons why these languages are used, they're just more complicated than beauty of the language itself, and are due to complex market forces, ecosystems, unique capabilities, etc.

I'm not qualified to answer why Ada isn't more popular, but an explanation implying there was nothing wrong with it, only everyone out of the blue decided to irrationally dislike it, seems shallow to me.

pornel 18 days ago | parent | context | | [–] | on: Belgian CVD is deeply broken

The related "Belgium is unsafe for CVD" post explains that if you discover any vulnerability in anything in Belgium, it automatically creates a legal obligation on you, with a 24h deadline, to report this secretly and exclusively to Belgian authorities, with logs of everything you've done, even if you're not a Belgian citizen and don't reside in Belgium.

This is a very short deadline, with onerous requirements. They most likely won't give you permission to share any information about this vulnerability with anyone else. If it's a common vulnerability affecting non-Belgian entities, you'll be required to leave them uninformed and vulnerable.

The most rational response for law-abiding vulnerability researches is to stay away from everything Belgian and never report anything to them.

xchip 18 days ago | | [–]

Unfortunately this sounds like a very wise advice.

You'd think that you rather encourage and reward researchers to ethically hack your systems rather than having the MI5 do it, as it happened recently.

(https://www.infosecurity-magazine.com/news/how-gchq-hacked-b...)

pornel 18 days ago | parent | context | | [–] | on: SQLite async connection pool for high-performance

Sharing one SQLite connection across the process would necessarily serialize all writes from the process. It won't do anything for contention with external processes, the writes within the process wouldn't be concurrent any more.

Basically, it adds its own write lock outside of SQLite, because the pool can implement the lock in a less annoying way.

bawolff 18 days ago | | [–]

I don't understand, all writes to a single sqlite DB are going to be serialized no matter what you do.

> Basically, it adds its own write lock outside of SQLite, because the pool can implement the lock in a less annoying way.

Less annoying how? What is the difference?

pornel 18 days ago | | | [–]

SQLite's lock is blocking, with a timeout that aborts the transaction. An async runtime can have a non-blocking lock that allows other tasks to proceed in the meantime, and is able to wait indefinitely without breaking transactions.

bawolff 18 days ago | | | [–]

What's the benefit of this over just doing PRAGMA busy_timeout = 0; to make it non-blocking ?

After all, as far as i understand, the busy timeout is only going to occur at the beginning of a write transaction, so its not like you have to redo a bunch of queries.

pornel 19 days ago | parent | context | | [–] | on: Data brokers are selling flight information to CBP...

Buying has no accountability, no judges. Probably not even a proper paper trail.

They're spending public money, so the cost doesn't matter to them either. With this administration they can get unlimited funding.

pornel 22 days ago | parent | context | | [–] | on: In a First, Solar Was Europe's Biggest Source of P...

Why would they be less likely to be bombed? Zaporizhzhia Nuclear Power Plant got bombed in 2022.

There's no strong deterrent there. These plants don't blow up like nukes, or even Chernobyl. Nuclear disasters require very precise conditions to sustain the chain reaction. Blowing up a reactor with conventional weapons will spread the fuel around, which is a nasty pollution, but localized enough that it's the victim's problem not the aggressor’s problem.

Why do you even mention transformers and cables as an implied alternative to nuclear power plants? Power plants absolutely require power distribution infrastructure, which is vulnerable to attacks.

From the perspective of resiliency against military attacks, solar + batteries seem the best - you can have them distributed without any central point of failure, you can move them, and the deployments can be as large or small as you want.

(BTW, this isn't argument against nuclear energy in general. It's safe, and we should build more of it, and build as much solar as we can, too).

Nasrudith 21 days ago | | [–]

Nuclear plants and their cooling towers tend to be made of reinforced concrete. That makes them harder to bomb. If you want to take out power you bomb the transmission or substations instead as they are far less durable.

I recall hearing in school that 9-11 masterminds had considered planes against nuclear power plants but abandoned it after doing the math and realizing that it would do little damage. Not sure how true that is admittedly.

spauldo 20 days ago | | | [–]

Depends what you're trying to protect yourself from.

Reinforced concrete is great if they're just shelling you. Sure, all the outdoor infrastructure will be toast but your reactor probably won't get damaged. It'll take a bit to get back on the grid but you don't need to rebuild the plant.

Bunker busters, on the other hand, eat reinforced concrete for breakfast. A pinpoint strike into each reactor hall and you're down for good.

The former is cheaper, less risky for the attacker, and hurts you bad enough for most military purposes, so the latter isn't really worth worrying about unless you're Iran or North Korea.

pornel 23 days ago | parent | context | | [–] | on: Turkey bans Grok over Erdoğan insults

The US is not that exceptional nor principled. The concept of "freedom of speech" is absolute when Republicans want to say Republican things, but it's a "national security issue" when Muslims make too much noise. When sexual minorities want to speak, the priority is to "protect family values" instead. Corporations have "freedom of speech", but TikTok boosting black-green-red flags isn't protected speech, but an agent of the enemy corrupting the youth.

European countries have their own dogmas and hypocrisy, only draw the line at different topics (especially where everyone had their grandparents traumatized in a war started by the Grok's favorite character).

smotched 23 days ago | | [–]

Could you give examples of when a U.S. citizens speech rights were legally taken away? Lets go with one of your examples of "When sexual minorities want to speak". Please elaborate.

None of the examples you gave are actually examples of speech being restricted. Its people (sometimes politicians) freely voicing their opinions on others speech, that is not restriction.

saghm 23 days ago | | | [–]

Literally in the last week, the Supreme Court ruled that books featuring gay couples need to be opt-out in schools. They've quite literally taken the stance that someone literally just seeing the existence of a gay couple in a children's picture book is a violation of their freedom.

anamax 23 days ago | | | [–]

> They've quite literally taken the stance that someone literally just seeing the existence of a gay couple in a children's picture book is a violation of their freedom.

No.

They've taken the stance that parents get to decide what books their kids see.

Other parents are free to make a different decision.

Do you really think that there's a "right" to force others to read books that you choose?

lcnPylGDnU4H9OF 22 days ago | | | [–]

> They've taken the stance that parents get to decide what books their kids see.

So why draw the line at books depicting gay couples, rather than literally all books? Because this has nothing to do with the ban, except for being a “family-friendly” bullshit justification.

smotched 22 days ago | | | [–]

They didn't draw the line there, that's case that was brought forth. That's how the courts work.

lcnPylGDnU4H9OF 22 days ago | | | [–]

> that's case that was brought forth

That's not how the Supreme Court works. They are selective about the cases they hear. Especially looking at a 6-3 ruling with this court it's clear to see this was an ideological selection.

smotched 22 days ago | | | [–]

So that case was not brought forth the supreme court for them to rule on? They rule on that specific case.

lcnPylGDnU4H9OF 22 days ago | | | [–]

Yes, the case was appealed to the Supreme Court who chose to hear it instead of choosing not to hear it. That is ultimately why they ruled on the case.

Given that, it really does seem that the court ruled 6-3 in favor of the plaintiffs who are trying to draw a line around gay couples because the court is trying to draw a line around gay couples.

saghm 22 days ago | | | | [–]

Other parents making a different decision doesn't matter if the schools find it virtually impossible to have these books because of the logistical requirements of allowing kids to leave the classroom every time certain books are read.

> Do you really think that there's a "right" to force others to read books that you choose?

Do I really think that public schools have a right to assign reading of certain books for classes? Is this even a real question? How do you think English classes work?

pornel 23 days ago | parent | context | | [–] | on: Optimizing a Math Expression Parser in Rust

I don't think memory mapping does anything to prevent false sharing. All threads still get the same data at the same address. You may get page alignment for the file, but the free-form data in the file still crosses page boundaries and cache lines.

Also you don't get contention when you don't write to the memory.

The speedup may be from just starting the work before the whole file is loaded, allowing the OS to prefetch the rest in parallel.

You probably would get the same result if you loaded the file in smaller chunks.

pornel 23 days ago | parent | context | | [–] | on: Grok 4 Launch [video]

You could get 100% on the benchmark with an SQL query that pulls the answers from the dataset, but it wouldn't mean your SQL query is more capable than LLMs that didn't do as well in this benchmark.

We want benchmarks to be representative of performance in general (in novel problems with novel data we don't have answers for), not merely of memorization of this specific dataset.

simondotau 23 days ago | | [–]

My question, perhaps asked in too oblique of a fashion, was why the other LLMs — surely trained on the answers to Connections puzzles too — didn't do as well on this benchmark. Did the data harvesting vacuums at Google and OpenAI really manage to exclude every reference to Connections solutions posted across the internet?

LLM weights are, in a very real sense, lossy compression of the training data. If Grok is scoring better, it speaks to the fidelity of their lossy compression as compared to others.

pornel 23 days ago | | | [–]

There's a difficult balance between letting the model simply memorize inputs, and forcing it to figure out a generalisations.

When a model is "lossy" and can't reproduce the data by copying, it's forced to come up with rules to synthesise the answers instead, and this is usually the "intelligent" behavior we want. It should be forced to learn how multiplication works instead of storing every combination of numbers as a fact.

Compression is related to intelligence: https://en.wikipedia.org/wiki/Kolmogorov_complexity

frozenseven 23 days ago | | | [–]

You're not answering the question. Grok 4 also performs better on the semi-private evaluation sets for ARC-AGI-1 and ARC-AGI-2. It's across-the-board better.

emp17344 23 days ago | | | [–]

If these things are truly exhibiting general reasoning, why do the same models do significantly worse on ARC-AGI-2, which is practically identical to ARC-AGI-1?

frozenseven 23 days ago | | | [–]

It's not identical. ARC-AGI-2 is more difficult - both for AI and humans. In ARC-AGI-1 you kept track of one (or maybe two) kinds of transformations or patterns. In ARC-AGI-2 you are dealing with at least three, and the transformation interact with one another in more complex ways.

Reasoning isn't an on-off switch. It's a hill that needs climbing. The models are getting better at complex and novel tasks.

emp17344 23 days ago | | | [–]

This simply isn’t the case. Humans actually perform better on ARC-AGI-2, according to their website: https://arcprize.org/leaderboard

frozenseven 23 days ago | | | [–]

The 100.0% you see there just verifies that all the puzzles got solved by at least 2 people on the panel. That was calibrated to be so for ARC-AGI-2. The human panel averages for ARC-AGI-1 and ARC-AGI-2 are 64.2% and 60% respectively. Not a huge difference, sure, but it is there.

I've played around with both, yes, I'd also personally say that v2 is harder. Overall a better benchmark. ARC-AGI-3 will be a set of interactive games. I think they're moving in the right direction if they want to measure general reasoning.

kevinventullo 23 days ago | | | | [–]

There are many basic techniques in machine learning designed specifically to avoid memorizing training data. I contend any benchmark which can be “cheated” via memorizing training data is approximately useless. I think comparing how the models perform on say, today’s Connections would be far more informative despite the sample being much smaller. (Or rather any set for which we could guarantee the model hasn’t seen the answer, which I suppose is difficult to achieve since the Connections answers are likely Google-able within hours if not minutes).

Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact