More

entilzha · 2025-05-12T22:02:58 1747087378

Great to see our paper here again! Since the paper release, we've also released model weights here for anyone interesting in building on top of it: https://huggingface.co/facebook/blt. We also added HF Hub code to easily load the model https://github.com/facebookresearch/blt?tab=readme-ov-file#l....

accassar · 2025-05-13T01:11:18 1747098678

The thing that stood out for me was the use of ngram hashes as an additional feature set. My understanding of this is that its typically used as a positional feature.

Is this a limitation of the byte patches in that the positional information needs to be augmented?

entilzha · 2024-12-14T17:32:24 1734197544

(Author here)

If I understand your question right, this is one of the reasons BPE is nice and the parent liked it. For any character sequence, provided the characters are in the alphabet used to create the BPE vocab, there are no unknown words/sequences. One downside of some previous tokenization methods is you could have unknown/UNK tokens, EG dictionary based methods.

In our paper with bytes, we also avoid the UNK issue, since we can have an embedding for every possible byte, since it’s not that many (and for sequences of bytes we use hash embedding, although we did test n-gram lookups for the top K frequent byte n-grams in the training data).

cs702 · 2024-12-14T19:42:34 1734205354

Nice work. Thank you for commenting on HN!

Did you guys try using an RNN or some other kind of DNN to encode the patches?

entilzha · 2024-12-14T20:01:54 1734206514

I don't believe so, or at least if someone tried it didn't work well enough that I remember :). Some of the motivation for the architecture changes in encoding patches stemmed from finding FLOP efficient ways to express relationships between byte sequences. E.G., having a long context window makes sense when dealing with tokens, but you don't need as long as an attention window if you're attending byte sequences to make patch representations, since the patch representations will implicitly be part of a longer context window in terms of number of patches.

cs702 · 2024-12-14T21:23:59 1734211439

Thanks for the quick reply!

Interesting. I would have thought one of those "minimum viable" RNNs (like https://arxiv.org/abs/2410.01201) would have been ideal for this. I might tinker a bit with this :-)

entilzha · 2024-12-14T15:12:52 1734189172

(Author Here)

Good description! Maybe what parent got mixed up on is an alternate way to view this is trying to chunk bytes to have roughly similar information. EG we initially tried a bunch of patching schemes, EG, keep a running total of entropy until the total exceeds a threshold, but ended up finding simple things worked better.

I’ll see if we can add more information about the small CNN in a next update to arXiv paper.

cschmidt · 2024-12-14T19:46:04 1734205564

I'm curious if you're aware of some papers from around 2005 on using contextual entropy to do unsupervised word segmentation on Chinese, and other languages that don't use spaces for word boundaries.

https://aclanthology.org/Y03-1017/ https://aclanthology.org/I05-1009/ https://aclanthology.org/P06-2056/

Exactly the same approach of segmenting a word when the entropy goes up compared to the previous byte.

ted_dunning · 2024-12-14T21:50:40 1734213040

It is also quite similar to Carl de Marcken's work for segmenting text and speech. He phrased everything in terms of minimum description length (MDL), but that is trivially the same thing as local entropy.

https://dspace.mit.edu/handle/1721.1/7191?show=full

entilzha · 2024-12-14T20:05:02 1734206702

At least I wasn't aware of this work, but thanks for the refs! I'm always curious to read papers from 10-20+ years ago that have similarly inspired ideas. If it makes sense, we'll mention those in the next related work update.

psb217 · 2024-12-14T19:26:01 1734204361

One way of thinking about the "Approximate Monotonic Constraint" is that you're running a quick and dirty edge detector on the entropy. Ie, you're clipping based on the gradient of per-byte entropy wrt timestep compared to detecting an edge based on gradient of per-pixel intensity wrt pixel coordinates. It would be interesting to look at the raw sequences of per-byte entropies to see how strongly these sorts of "edges" correlate with human interpretable boundaries (words, prefixes, suffixes, etc).

yorwba · 2024-12-14T20:58:29 1734209909

Figure 4 plots the entropy of each byte in "Daenerys Targeryen is in Game of Thrones, a fantasy epic by George R.R. Martin."

entilzha · 2024-12-14T15:06:36 1734188796

(Author Here)

Related thought, I think BPE is quite a good, cheap inductive bias to have in a model, which is part of what made it challenging to scale better against. I also suspect this is part of why with less training FLOPs BPE is better (left side of figure 1), BLT has to expend some of its FLOPs budget to recover/learn some of this useful bias. With more training FLOPs this becomes a smaller fraction of the budget though leading to better scaling.

entilzha · 2024-12-14T14:56:59 1734188219

(Author Here)

There is at least some work on character based modeling, but it hasn’t scaled well before. The challenge I think with something more adhoc for exceptional tokens is that it’s hard to see gains since they are by definition, infrequent. If the text is rare enough, BPE should produce many single byte tokens, so current models actually expend more compute on these rare sequences.

BLT scales well because it expends less compute (by patching) on more predictable (low entropy) byte sequences. Current models only to some degree get this benefit, if it’s a larger BPE token, but that only goes so far.

So it’s really two related, but different motivations.

entilzha · 2024-12-14T14:49:38 1734187778

(Author Here)

In editing we couldn’t find a good place for this so cut it in the current version, but at one point had discussed a parallel with information density of speech as described by one paper. Essentially the paper found that in languages that were less information dense per syllable, speakers spoke faster to achieve similar information density as languages with higher density per syllable. You could see patching by entropy paralleling this if you consider that low entropy bytes in terms of Shannon entropy are less information dense.

entilzha · 2024-12-14T14:43:34 1734187414

(Author Here)

Not sure what you mean by implicit? If you mean just treat bytes as tokens, one issue you run into is your sequence lengths get quite long, so compared to a regular token LLM, you can’t pack as many bytes in a batch, which means you’re pretty FLOP inefficient so scale worse. You could make the model smaller to compensate, but then the model isn’t as good.

entilzha · 2024-12-14T08:01:49 1734163309

Author here :), I do think it’s a good direction to look into! That said, aside from it being a bit too much to do at once, you’d also have to be careful about how you distributed your FLOP budget across the hierarchy. With two levels, you can make one level (bytes/local encoder) FLOP efficient and the other (patches/global encoder) FLOP intensive. You’d also need to find a way to group patches into larger units. But ya, there are many directions to go from here!

Permik · 2024-12-14T08:54:56 1734166496

In a way I'm kinda sad that if tokenizers will go the way of the dinosaurs as asking someone to give me a Unicode character from the private use area was one of the last ways you could actually distinguish a co-operative human from an LLM online They simply don't have those characters tokenized, so they can't output them. (But this is technically moot if the LLM has a python interpreter handy)

djhn · 2024-12-14T23:00:40 1734217240

How do you ask someone to give you a Unicode character from the private use area?

entilzha · on Aug 24, 2023

While you’re here, a killer feature for me would be the ability to privately host obsidian sites (similar to publish). Even if it required subscribing to publish to download a tarball of the site (that isn’t public), it could still be worth it. My use case is sharing obsidian notes with non-users (eg coworkers) in a private way.

kepano · on Aug 24, 2023

There are a few options for this already. A good one just came out a few days ago called Quartz: https://github.com/jackyzha0/quartz

entilzha · on Aug 24, 2023

I tried a few a while back. What I really want is as close to 1-1 to obsidian UI as possible. I found with some of the plugins that it could be hit/miss on working correctly. If I were doing only markdown notes, then wouldn’t need obsidian ;)

entilzha · on Feb 8, 2020

Any thoughts on how to access/modify on mobile without making it too cumbersome? I often think about todo on walk/train, but could see making it a computer only thing.

kabdib · on Feb 8, 2020

Absurdly, comically simple is the way to go. In these situations I usually just send myself an email, or write something down in the current physical notebook.

About 20 years ago I spent a while writing a server to do a bunch of automation and reminding and so forth, and found that I was spending more time on that thing than I was doing real work, so I ditched it, just like I did OneNote and all the other apps. There's no need to fix bugs in a text file, nor do you need to upgrade your version of Python or whatever.

Simple will survive decades with zero effort. Complicated pieces of code (especially ones that use networking or browser tech) will need to be maintained, patched, debugged and ported. Life is short. I'd rather actually be productive than have to spend time keeping some productivity app running, or worrying if the vendor of the online notes app I chose to use is going to be bought by Oracle or something.

rakoo · on Feb 11, 2020

You're making me think that the even absurder method is to use mail drafts as your files. Everything is synced between all your devices, you have built-in modification date, a full editor in the web version or applications for a more integrated experience, high portability and backup (transferring from IMAP account to IMAP account is easy), labels and/or folders depending on your provider, efficient search, attachments...

Actually in the past I used my email account as my RSS reader. It had integrated read/non read status, mass "mark as read" as needed, was synced on all platforms I cared about without anything to install. I didn't really changed my subscriptions a lot so configuration was not complicated.

Sometimes a webmail is just a good enough UI for text-related things.

rsync · on Feb 8, 2020

Set up an email alias that cats (appends) your note to your org file.

Then, just send your notes (via sms) to the email.

bachmeier · on Feb 8, 2020

If you version control it using Fossil, and it's in markdown, you can access it directly from the repo website.

justusthane · on Feb 9, 2020

You can with Github as well. And make changes to it via the website. Not on mobile though, unless you switch to the desktop site.

vor0nwe · on Feb 8, 2020

But you can't make changes to it online.

bachmeier · on Feb 9, 2020

Depends. You're right if you use a conventional version control approach where you write on your local machine and then push to the server. But that's not the only way to use Fossil to store your notes.

If you're using the wiki (using one or more wiki pages) everything is entered into the browser whether online or using the built-in server running locally. Other options are to create an issue for each note or create a new forum thread for each note. The nice thing about these options is that you get the search and querying power of an sqlite database applied to your notes for free.

Disclaimer: I haven't used this approach because I don't use mobile devices for note taking. Just pointing out that it is possible.

yoavm · on Feb 8, 2020

On Android, I've been using Markor with Syncthing for years now and never had any issues.