More

wordpad · 2025-10-05T19:26:59 1759692419

TLDR for others: * DeepSeek cutting edge models are still far behind * On par DeepSeek costs 35% more to run * DeepSeek models 12 times more susceptible to jail breaking and malicious instructions * DeepSeek models follow strict censorship

I guess none of these are a big deal to non-enterprise consumers.

nylonstrung · 2025-10-06T08:31:01 1759739461

Saying Deepseek is more expensive is FUD

Token price on 3.2 exp is <5% what the US LLMs are and it's very close in benchmarks. Which we know that ChatGPT, Google, Grok and Claude have explicitly gamed to inflate their capabilities

ACCount37 · 2025-10-06T08:37:04 1759739824

And we "know" that how, exactly?

nylonstrung · 2025-10-06T09:30:26 1759743026

Read a study called "The Leaderboard Illusion" which credibly alleged that Meta Google OpenAI and Amazon got unfair treatment from LM Arena that distorted the benchmarks

They gave them special access to privately test and let them benchmark over and over without showing the failed tests

Meta got to privately test Llama 4 27 times to optimize it for high benchmark scores and then was allowed to report the only the highest cherry picked benchmark

Which makes sense because in real world applications Llama is recognized to be markedly inferior to models that scored lower

ACCount37 · 2025-10-06T14:03:03 1759759383

Which is one study that touches exactly one benchmark - and "credibly alleged" is being way too generous to it. The only case that was anywhere close to being proven LMArena fraud is Meta and Llama 4. Which is a nonentity now - nowhere near SOTA on anything, LMArena included.

Not that it makes LMArena a perfect benchmark. By now, everyone who wanted to push LMArena ratings at any cost knows what the human evaluators there are weak to, and what should they aim for.

But your claim of "we know that ChatGPT, Google, Grok and Claude have explicitly gamed <benchmarks> to inflate their capabilities" still has no leg to stand on.

nylonstrung · 2025-10-06T19:00:14 1759777214

There are a lot of other cases that extend well beyond LMArena where it was shown certain benchmark performance increases by the major US labs were only attributable to being over-optimized for the specific benchmarks. Some in ways that are not explainable by the benchmark tests merely contaminating the corpus.

There are cases where merely rewording the questions or assigning different letters to the answer dropped models like Llama 30% in the evaluations while others were unchanged

Open-LLM-Leaderboard had to rate limit because a "handful of labs" were doing so many evals in a single day that it hogged the entire eval cluster

“Coding Benchmarks Are Already Contaminated” (Ortiz et al., 2025) “GSM-PLUS: A Re-translation Reveals Data Contamination” (Shi et al., ACL 2024). “Prompt-Tuning Can Add 30 Points to TruthfulQA” (Perez et al., 2023). “HellaSwag Can Be Gamed by a Linear Probe” (Rajpurohit & Berg-Kirkpatrick, EMNLP 2024). “Label Bias Explains MMLU Jumps” (Hassan et al., arXiv 2025) “HumanEval-Revival: A Re-typed Test for LLM Coding Ability” (Yang & Liu, ICML 2024 workshop). “Data Contamination or Over-fitting? Detecting MMLU Memorisation in Open LLMs” (IBM, 2024)

And yes I relied on LLM to summarize these instead of reading the full papers

xpe · 2025-10-05T19:45:33 1759693533

> I urge everyone to go read the original report and _then_ to read this analysis and make up their own mind. Step away from the clickbait, go read the original report.

>> TLDR for others...

Facepalm.

wordpad · 2025-10-03T05:22:49 1759468969

Looks pretty useful.

How do you chose content? Do you manually identify high quality resources and then just scrape and ai summarize all the content they post? Or is it more granular?

andrewamurphy · 2025-10-03T05:25:35 1759469135

I seeded it with a few hundred items that I collated over the years. Then others added content

You can add via a URL and it attempts an AI summary of the website, or you can write your own.

wordpad · 2025-10-01T21:43:41 1759355021

Respecting and engaging with company politics in order to push good engineering decisions is one thing, but learning and playing a sport, I think falls outside of "other duties as assigned" for an engineer.

wordpad · 2025-10-01T21:10:28 1759353028

The emergent behavior of LLMs being amazing at accurately predicting tokens in previously unseen conditions might be more powerful than more rigorous machine learning extrapolations.

Especially when you throw noisy subjective context at it.

mikepurvis · 2025-10-01T21:54:06 1759355646

The “prediction” in this case is I think some approximation of “ingest today’s news and social media buzz as it’s happening and predict what the financial news tomorrow morning will be.”

wordpad · 2025-09-21T03:43:30 1758426210

And not far from that are sentient toasters and doorbells.

wordpad · 2025-09-14T13:04:08 1757855048

What are you using for image generation? It works faster than anything I've seen

wordpad · 2025-09-10T15:29:59 1757518199

Why can't crawling be crowd sourced? It would solve ip rotation and spread the load

6510 · 2025-09-10T15:51:56 1757519516

https://yacy.net

catlikesshrimp · 2025-09-10T16:24:32 1757521472

Too bad it doesn't support android. It is much more energy efficient than anything else I can spare (for 100% uptime contribution)

Poomba · 2025-09-10T15:51:27 1757519487

That’s how residential proxies work, in a perverse way

chiefsearchaco · 2025-09-10T22:10:54 1757542254

Common crawl sort of serves this function. I use it. It's a really good foundation.

wordpad · 2025-09-09T01:38:19 1757381899

I'm sure there is some art you'd shell out for. Maybe an original prop from your favorite movie or a collectors item from a time period you're nostalgic for...

wordpad · 2025-09-04T17:19:32 1757006372

What did you use to do it?

deadbabe · 2025-09-04T17:57:04 1757008624

wordpad · 2025-08-24T03:05:54 1756004754

Have you tried giving your AI productivity tool a personality so it can guilt trip you the same way?

karmakaze · 2025-08-24T03:40:42 1756006842

That's an interesting idea, adding an AI so it could be you pair programming with a colleague remotely over Tuple, a truple. Seriously though it would take out the frustration and flow-breaking pauses while interacting with an LLM.