I can't find a github or email for Hannah - if you're reading this i'd like to add Australian energy price data via Open Electricity[0] to the data (reach out via my profile)
so we're all going to hold onto sequoia like we did snow leopard. only reason i'm not buying a new mac at the moment is because it would force me to upgrade.
the situation is absurd ..
fwiw switching to the sequoia beta channel in system settings killed the nag notifications for me (I believe the profile as defined in OP will stop all updates - which you probably don't want)
from my understanding Anthropic are now hiring a lot of experts in different who are writing content used to post-train models to make these decisions and they're constantly adjusted by the anthropic team themselves
this is why the stacks in the report and what cc suggests closely match latest developer "consensus"
your suggestion would degrade user experience and be noticed very quickly
I guess that’s why I’m not seeing anyone trying to build a skills marketplace for agent skills files. The llm api will read in any skills you want to add to context in plain text, and then use your content to help populate their own skills files.
That's how Google search worked back when it was at its most useful. They had a large "editorial team" that manually tweaked page ranks on a site-by-site basis.
The core graph reputation based page ranking algorithm lasted for a hot second before people started gaming it. No idea what they do these days.
For a specific harness, they've all found ways to optimize to get higher cache hit rates with their harness. Common system prompts and all, and more and more users hitting cache really makes the cost of inference go down dramatically.
What bothers me about a lot of the discussion about providers disallowing other harnesses with the subscription plans around here is the complete lack of awareness of how economies of scale from common caching practices across more users can enable the higher, cheaper quotas subscriptions give you.
I feel like a lot of this would go away if they made a different API for the “only for use with our client” subscriptions. A different API from the generic one, that moved some of their client behaviors up to the server seems like it would solve all this. People would still reverse engineer to use it in other tools but it would be less useful (due to the forced scaffolding instead of entirely generic completions API) and also ease the burden on their inference compute.
I’m sure they went with reusing the generic completions API to iterate faster and make it easier to support both subscription and pay-per-token users in the same client, but it feels like they’re burning trust/goodwill when a technical solution could at least be attempted.
> I feel like a lot of this would go away if they made a different API for the “only for use with our client” subscriptions.
They literally did exactly that. That's what being cut off (Antigravity access, i.e. private "only for use with our client" subscription - not the whole account, btw.) for people who do "reverse engineer to use it in other tools".
Nothing here is new or surprising, the problem has been the same since Anthropic released Claude Code and the Max subscriptions - first thing people did then was trying to auth regular use with Claude Code tokens, so they don't have to pay the API prices they were supposed to.
What I was getting at is that the current API is still a generic inference endpoint, but with OAuth instead of an API key. What I'm suggesting is that they move some of the client logic up to the OAuth endpoint so it's no longer a generic inference endpoint (e.g. system prompt is static, context management is done on the server, etc). I assume they can get it to a point that it's no longer useful for a general purpose client like OpenClaw
> What we've found is that giving LLM security agents access to good tools (Semgrep, CodeQL, etc.) makes them significantly better
100% agree - I spun out an internal tool I've been using to close the loop with website audits (more focus on website sec + perf + seo etc. rather than appsec) in agents and the results so far have been remarkable:
Human written rules with an agent step that dynamically updates config to squash false positives (with verification) and find issues while also allowing the llm to reason.
My understanding is that all recent gains are from post training and no one (publicly) knows how much scaling pretraining will still help at this point.
Happy to learn more about this if anyone has more information.
I still remember gemini 1.5 ultra and gpt 4.5 as extremely strong at some areas that no benchmark capture. It was probably not economical to use them at 20 usd subscription, but they felt differently and smarter at some ways. The benchmarks seems to be missing something, because flash 3 was very close on some benchmarks to 3 pro, but much, much dumber.
[0] https://explore.openelectricity.org.au/
reply