There is also a tradeoff between different vocabulary sizes (how many entries exist in the token -> embedding lookup table) that inform the current shape of tokenizers and LLMs. (Below is my semi-armchair stance, but you can read more in depth here[0][1].)
If you tokenized at the character level ('a' -> embedding) then your vocabulary size would be small, but you'd have more tokens required to represent most content. (And context scales non-linearly, iirc, like n^3) This would also be a bit more 'fuzzy' in terms of teaching the LLM to understand what a specific token should 'mean'. The letter 'a' appears in a _lot_ of different words, and it's more ambiguous for the LLM.
On the flip side: What if you had one entry in the tokenizer's vocabulary for each word that existed? Well, it'd be far more than the ~100k entries used by popular LLMs, and that has some computational tradeoffs like when you calculate the probability of each 'next' token via softmax, you'd have to run that for each token, as well as increasing the size of certain layers within the LLM (more memory + compute required for each token, basically).
Additionally, you run into a new problem: 'Rare Tokens'. Basically, if you have infinite tokens, you'll run into specific tokens that only appear a handful of times in the training data and the model is never able to fully imbue the tokens with enough meaning for them to _help_ the model during inference. (A specific example being somebody's username on the internet.)
Fun fact: These rare tokens, often called 'Glitch Tokens'[2], have been used for all sorts of shenanigans[3] as humans learn to break these models. (This is my interest in this as somebody who works in AI security)
As LLMs have improved, models have pushed towards the largest vocabulary they can get away with without hurting performance. This is about where my knowledge on the subject ends, but there have been many analyses done to try to compute the optimal vocabulary size. (See the links below)
One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8) or directly against the final layers of state in a small LLM (trying to use a small LLM to 'grok' the meaning and hoist it into a more dense, almost compressed latent space that the large LLM can understand).
It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...
This immediately makes the model's inner state (even more) opaque to outside analysis though. e.g., like why using gRPC as the protocol for your JavaScript front-end sucks: Humans can't debug it anymore without other tooling. JSON is verbose as hell, but it's simple and I can debug my REST API with just network inspector. I don't need access to the underlying Protobuf files to understand what each byte means in my gRPC messages. That's a nice property to have when reviewing my ChatGPT logs too :P
Short answer: it's hard to fully say, but most people believe that the holocene is a lower bound on this sort of thing. I'll try to explain, but keep in mind that I'm trying to massively simplify a huge field of open questions.
The fundamental assumption underlying most archaeology is that changes in material culture broadly reflect people reacting to the world around them in intelligent ways. Most archaeologists therefore believe that Pleistocene people didn't build permanent structures out of stone because nomadic or seminomadic lifestyles were more optimal for the chaotic pleistocene environments happening globally. There's a few people who disagree with the universality of this idea, most famously the authors of Dawn of Everything who argue for a more diverse family of lifeways in early humans, but that's just quibbling about the edges of this overall narrative rather than rewriting it.
And we'd expect to have more evidence than we do if the holocene boundary wasn't the effective start date for this kind of structure. Cave environments are much more stable, and it's where much of our evidence comes from. Gobekli Tepe (GT) and other Tas Tepler sites are made with local limestone, an extremely erosion-prone rock. We have sites covered by existing urban cities like Jericho, the earliest layers of which date from around the same time as GT. We also have older structures, like the epigravittean mammoth huts, and a fairly good idea of the forager->farmer transition in the near east across the natufian culture. GT is actually thought to be part of that transition.
But yes, a lot of organic stuff from the pleistocene is gone. Organics were probably the dominant form of material used, so that leaves a huge gap we're still struggling with. Not really sure where I'm going with this, so I guess I'll stop here?
There are specialized architectures (the Tolman-Eichenbaum Machine)* that are able to complete this kind of task. Interestingly, once trained, their activations look strikingly similar to place and grid cells in real brains. The team were also able to show (in a separate paper) that the TEM is mathematically equivalent to a transformer.
And then the providers ship a landmark feature or overhaul themselves. Especially as their models advance.
Wrappers constantly live in the support and feature parity of today.
Anthropic’s Claude Code will look a hell of a lot different a year from now, probably more like an OS for developers and Claude Agent non-tech. Regardless they are eating the stack.
Pricing/usage will be very simple - a fixed subscription and we will no longer know the tokenomics because the provider will have greatly abstracted and optimized the cost per token, favoring a model that they can optimize margin against a fixed revenue floor.
I've been contributing to an open source mobile app [1] that takes two swings at offering something that Roo does not have.
1. Real-time sync of CLI coding agent state to your phone. Granted this doesn't give you any new coding capabilities, you won't be making any different changes from your phone. And I would still chose to make a code change on my computer. But the fact that it's only slightly worse (you just wish you had a bigger screen) is still an innovation. Making Claude Code usable from anywhere changes when you can work, even if it doesn't change what you can do. I wrote a post trying to explain why this matters in practice. https://happy.engineering/docs/features/real-time-sync/
2. Another contributor is experimenting with a separate voice agent in between you and Claude Code. I've found it usable and maybe even nice? The voice agent acts like a buffer to collect and compact half backed think out loud ideas into slightly better commands for Claude Code. Another contributor wrote a blog post about why voice coding on your phone while out of the house is useful. They explained it better than I can. https://happy.engineering/docs/features/voice-coding-with-cl...
I agree. I find even Haiku good enough at managing the flow of the conversation and consulting larger models - Gemini 2.5 Pro or GPT-5 - for programming tasks.
Last few days I am experimenting with using Codex (via MCP ${codex mcp}) from Gemini CLI and it works like a charm. Gemini CLI is mostly using Flash underneath but this is good enough for formulating problems and re-evaluating answers.
Same with Claude Code - I am asking (via MCP) for consulting with Gemini 2.5 Pro.
Never had much success of using Claude Code as MCP though.
The original idea comes of course from Aider - using main, weak and editor models all at once.
This is fantastic work. The focus on a local, sandboxed execution layer is a huge piece of the puzzle for a private AI workspace. The `coderunner` tool looks incredibly useful.
A complementary challenge is the knowledge layer: making the AI aware of your personal data (emails, notes, files) via RAG. As soon as you try this on a large scale, storage becomes a massive bottleneck. A vector database for years of emails can easily exceed 50GB.
(Full disclosure: I'm part of the team at Berkeley that tackled this). We built LEANN, a vector index that cuts storage by ~97% by not storing the embeddings at all. It makes indexing your entire digital life locally actually feasible.
Combining a local execution engine like this with a hyper-efficient knowledge index like LEANN feels like the real path to a true "local Jarvis."
Heads up, there’s a fair bit of pushback (justified or not) on r/LocalLLaMA about Ollama’s tactics:
Vendor lock-in: AFAIK it now uses a proprietary llama.cpp fork and builts its own registry on ollama.com in a kind of docker way (I heard docker ppl are actually behind ollama) and it's a bit difficult to reuse model binaries with other inference engines due to their use of hashed filenames on disk etc.
Closed-source tweaks: Many llama.cpp improvements haven’t been upstreamed or credited, raising GPL concerns. They since switched to their own inference backend.
Mixed performance: Same models often run slower or give worse outputs than plain llama.cpp. Tradeoff for convenience - I know.
Opaque model naming: Rebrands or filters community models without transparency, biggest fail was calling the smaller Deepseek-R1 distills just "Deepseek-R1" adding to a massive confusion on social media and from "AI Content Creators", that you can run "THE" DeepSeek-R1 on any potato.
Difficult to change Context Window default: Using Ollama as a backend, it is difficult to change default context window size on the fly, leading to hallucinations and endless circles on output, especially for Agents / Thinking models.
---
If you want better, (in some cases more open) alternatives:
llama.cpp: Battle-tested C++ engine with minimal deps and faster with many optimizations
ik_llama.cpp: High-perf fork, even faster than default llama.cpp
llama-swap: YAML-driven model swapping for your endpoint.
LM Studio: GUI for any GGUF model—no proprietary formats with all llama.cpp optimizations available in a GUI
Open WebUI: Front-end that plugs into llama.cpp, ollama, MPT, etc.
I keep returning to Zachtronics games endlessly in my free time, despite doing engineering work for 8-10 hours a day for the last months. Sure they're a bit of a facsimile of a programming challenge, but they're pretty tough problems, especially in the ones that are basically using assembly. I even had someone reference that my latest Opus Magnum creation looks like cellular automata.
If you can simplify the problem/solution space into a puzzle, give me a leaderboard to compete against, more specifically let me compete against the people I care about, and give it the barest amount of polish, it's the kind of thing someone like me would obsess over.
I put this in my system prompt: "Never compliment me. Critique my ideas, ask clarifying questions, and offer better alternatives or funny insults" and it works quite well. It has frequently told me that I'm wrong, or asked what I'm actually trying to do and offered better alternatives.
It's the most high-influence, low-exposure essay I've ever read. As far as I'm concerned, this dude is a silent prescient genius working quietly for DARPA, and I had a sneak peak into future science when I read it. It's affected my thinking and trajectory for the past 8 years
It's a turducken of crap from everyone but ngxson and Hugging Face and llama.cpp in this situation.
llama.cpp did have multimodal, I've been maintaining an integration for many moons now. (Feb 2024? Original LLaVa through Gemma 3)
However, this was not for mere mortals. It was not documented and had gotten unwieldy, to say the least.
ngxson (HF employee) did a ton of work to get gemma3 support in, and had to do it in a separate binary. They dove in and landed a refactored backbone that is presumably more maintainable and on track to be in what I think of as the real Ollama, llama.cpp's server binary.
As you well note, Ollama is Ollamaing - I joked, once, that the median llama.cpp contribution from Ollama is a driveby GitHub comment asking when a feature will land in llama-server, so it can be copy-pasted into Ollama.
It's really sort of depressing to me because I'm just one dude, it really wasn't that hard to support it (it's one of a gajillion things I have to do, I'd estimate 2 SWE-weeks at 10 YOE, 1.5 SWE-days for every model release), and it's hard to get attention for detailed work in this space with how much everyone exaggerates and rushes to PR.
EDIT: Coming back after reading the blog post, and I'm 10x as frustrated. "Support thinking / reasoning;
Tool calling with streaming responses" --- this is table stakes stuff that was possible eons ago.
I don't see any sign of them doing anything specific in any of the code they link, the whole thing reads like someone carefully worked with an LLM to present a maximalist technical-sounding version of the llama.cpp stuff and frame it as if they worked with these companies and built their own thing. (note the very careful wording on this, e.g. in the footer the companies are thanked for releasing the models)
I think it's great that they have a nice UX that helps people run llama.cpp locally without compiling, but it's hard for me to think of a project I've been more by turned off by in my 37 years on this rock.
I have been a heavy user of Claude but cancelled my Pro subscription yesterday. The usage limits have been quietly tightened up like crazy recently and 3.7 is certainly feeling dumber lately.
But the main reason I quit is the constant downtime. Their status page[0] is like a Christmas tree but even that only tells half the story - the number of times I have input a query only to have Claude sit, think for a while then stop and return nothing as if I had never submitted at all is getting ridiculous. I refuse to pay for this kind of reliability.
I can’t stop thinking about this article. I spent a long time in ad tech before switching to broader systems engineering. The author captures something I've struggled to articulate to friends and family about why I left the industry.
The part that really struck me was framing advertising and propaganda as essentially the same mechanism - just with different masters. Having built targeting systems myself, this rings painfully true. The mechanical difference between getting someone to buy sneakers versus vote for a candidate is surprisingly small.
What's frustrating is how the tech community keeps treating the symptoms while ignoring the disease. We debate content moderation policies and algorithmic transparency, but rarely question the underlying attention marketplace that makes manipulation profitable in the first place.
The uncomfortable truth: most of us in tech understand that today's advertising systems are fundamentally parasitic. We've built something that converts human attention into money with increasingly terrifying efficiency, but we're all trapped in a prisoner's dilemma where nobody can unilaterally disarm.
Try this thought experiment from the article - imagine a world without advertising. Products would still exist. Commerce would still happen. Information would still flow. We'd just be freed from the increasingly sophisticated machinery designed to override our decision-making.
Is this proposal radical? Absolutely. But sometimes the Overton window needs a sledgehammer.
P.S. If you are curious about the relationship between Sigmund Freud, propaganda, and the origins of the ad industry, check out the documentary “Century of the Self”.
To an extent, I also think the determination of the Apologetics Project also shows the tendency of people to go into denial about the limits of the technology. There is a lovely SF short story, The Quest for Saint Aquin, on how a true AI might feel about religious belief but we are a long way short of that.
It worries me a lot more that governments and the like will also be in denial about what they can do with AI. I can ignore low quality apologetics, I cannot ignore the government (I have to pay taxes, for example).
Meshtastic is another project that has recently made serious strides[0] in their UX on the Lilygo T-deck (and similar ESP32 devices), but specifically regarding LoRA-enabled devices.
It's still on a branch, but I compiled and ran it, and now I have two T-decks that can communicate with eachother off-the-grid without a smartphone attached to send messages; it's actually usable in emergencies now, which is why I bought the devices in the first place.
Currently in the process to deploy a mesh from me to my parents and family.
Founder of Gouach, the repairable (and fireproof!) e-bike battery mentioned in the article, happy to answer any question!
- we salvaged 100s of discarded e-bike batteries
- we found that 90% of components were like new
- batteries were thrown away because of the spot-welding and the glue which prevents repairability
- we spent 2 years (and 5 patents) to design a robust, safe, and easy to assemble system that requires nothing but a screwdriver
Our batteries have been in use since 2 years in the streets of France, on micro-mobility e-bikes, in the harshest possible conditions (rain, snow, cold, heat, shocks), and we're very happy with their performances!
We're now opening it to the general public (for conversion kits, and to replace old batteries that are no longer manufactured)
We plan to open-source at least part of the embedded software, so people can write extensions (to let their battery "talk" with any e-bike system, and share it — using WASM embeddable code — to other people on the web!)
I've always been partial to systolic arrays. I iterated through a bunch of options over the past few decades, and settled upon what I think is the optimal solution, a cartesian grid of cells.
Each cell would have 4 input bits, 1 each from the neighbors, and 4 output bits, again, one to each neighbor. In the middle would be 64 bits of shift register from a long scan chain, the output of which goes to 4 16:1 multiplexers, and 4 bits of latch.
Through the magic of graph coloring, a checkerboard pattern would be used to clock all of the cells to allow data to flow in any direction without preference, and without race conditions. All of the inputs to any given cell would be stable.
This allows the flexibility of an FPGA, without the need to worry about timing issues or race conditions, glitches, etc. This also keeps all the lines short, so everything is local and fast/low power.
What it doesn't do is be efficient with gates, nor give the fastest path for logic. Every single operation happens effectively in parallel. All computation is pipelined.
I've had this idea since about 1982... I really wish someone would pick it up and run with it. I call it the BitGrid.
I so relate to that out of body experience. My story is different though -- 10 years ago, I was driving along the SF-LA scenic route (I forget its name). I'd rarely driven in US at that time and was brand new to the country. Suddenly I felt like I knew the route. I'd seen these underpasses before, surely? It was so surreal. I could even predict some road feature that was going to show up next.
Turns out the game Roadrash that I played as a kid had that route :)
- 4K HDR video, not whatever the heck the buggy client delivers.
- Atmos/TrueHD audio track that actually works, not whatever the broken app delivers (I'm looking at you Sky and rest of the ilk that still deliver HBO content with stereo).
- Subtitles for ALL the languages, not just one or two. And those languages don't disappear when I go on a vacation, leaving me stuck with german audio and french subtitles.
- Properly functioning offline playback for when I'm traveling, not randomly broken and disappearing offline mode (Netflix, Spotify and YouTube all blessed me with "all your downloaded content is gone" experience on long flights).
- Works on all my devices not a random subset independent on which way greedy execs tried to extract "ecosystem" money from my playback device manufacturer. Looking at you ATV+.
- Is actually available in my region and doesn't randomly disappear from my devices just because I decided to travel to visit my parents or have some time off.
- Doesn't randomly disappear after 6 months when I started watching the series because some license expired.
As you can see, I really tried to pay to get content from these people. And all I got was bunch of frustration. F'em, they brought this upon themselves for being user hostile arseholes. Again.