More

Smerity · 2025-01-01T02:05:31 1735697131

You can download Common Crawl data for free using HTTPS with no credentials. If you don't store it (streamed processing or equivalent) and you have no cost for incoming data (which most clouds don't) you're good!

You can do so by adding `https://data.commoncrawl.org/` instead of `s3://commoncrawl/` before each of the WARC/WAT/WET paths.

Smerity · 2024-10-04T01:31:57 1728005517

Excited to see more people working on RNNs but wish their citations were better.

In 2016 my team from Salesforce Research published our work on the Quasi-Recurrent Neural Network[1] (QRNN). The QRNN variants we describe are near identical (minGRU) or highly similar (minLSTM) to the work here.

The QRNN was used, many years ago now, in the first version of Baidu's speech recognition system (Deep Voice [6]) and as part of Google's handwriting recognition system in Gboard[5] (2019).

Even if there are expressivity trade-offs when using parallelizable RNNs they've shown historically they can work well and are low resource and incredibly fast. Very few of the possibilities regarding distillation, hardware optimization, etc, have been explored.

Even if you need "exact" recall, various works have shown that even a single layer of attention with a parallelizable RNN can yield strong results. Distillation down to such a model is quite promising.

Other recent fast RNN variants such as the RWKV, S4, Mamba et al. include citations to QRNN (2016) and SRU (2017) for a richer history + better context.

The SRU work has also had additions in recent years (SRU++), doing well in speech recognition and LM tasks where they found similar speed benefits over Transformers.

I note this primarily as the more data points, especially when strongly relevant, the better positioned the research is. A number of the "new" findings from this paper have been previously explored - and do certainly show promise! This makes sure we're asking new questions with new insights (with all the benefit of additional research from ~8 years ago) versus missing the work from those earlier.

[1] QRNN paper: https://arxiv.org/abs/1611.01576

[2] SRU paper: https://arxiv.org/abs/1709.02755

[3]: SRU++ for speech recognition: https://arxiv.org/abs/2110.05571

[4]: SRU++ for language modeling: https://arxiv.org/abs/2102.12459

[5]: https://research.google/blog/rnn-based-handwriting-recogniti...

[6]: https://arxiv.org/abs/1702.07825

Smerity · on Jan 11, 2024

I think you've done a great explanation expansion except I believe it's ALiBi ("Attention with Linear Biases Enables Input Length Extrapolation"), a method of positional encoding (i.e. telling the Transformer model how much to weight a distant token when computing the current output token). This has been used on various other LLMs[2].

[1]: https://arxiv.org/abs/2108.12409

[2]: n.b. Ofir Press is co-creator of ALiBi https://twitter.com/OfirPress/status/1654538361447522305

benreesman · on Jan 11, 2024

This is indeed what I was referring to and along with RoPE and related techniques is a sort of "meta-attention" in which a cost-effective scalar pointwise calculation can hint the heavyweight attention mechanism with super-linear returns in practical use cases.

In more intuitive terms, your bog-standard transformer overdoes it in terms of considering all context equally in the final prediction, and we historically used rather blunt-force instruments like causally masking everything to zero.

These techniques are still heuristic and I imagine every serious shop has tweaks and tricks that go with their particular training setup, but the Rope shit in general is kind of a happy medium and exploits locality at a much cheaper place in the overall computation.

lhl · on Jan 11, 2024

My understanding is that Mistral uses a regular 4K RoPE that is "extends" the window size with SWA. This is based on looking at the results of Nous Research's Yarn-Mistral extension: https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k and Self-Extend, both of which only apply to RoPE models.

There are quite a few recent attention extension techniques recently published:

* Activation Beacons - up to 100X context length extension in as little as 72 A800 hours https://huggingface.co/papers/2401.03462

* Self-Extend - a no-training RoPE modification that can give "free" context extension with 100% passkey retrieval (works w/ SWA as well) https://huggingface.co/papers/2401.01325

* DistAttention/DistKV-LLM - KV cache segmentation for 2-19X context length at runtime https://huggingface.co/papers/2401.02669

* YaRN - aforementioned efficient RoPE extension https://huggingface.co/papers/2309.00071

You could imagine combining a few of these together to basically "solve" the context issue while largely training for shorter context length.

There are of course some exciting new alternative architectures, notably Mamba https://huggingface.co/papers/2312.00752 and Megabyte https://huggingface.co/papers/2305.07185 that can efficiently process up to 1M tokens...

Kerbonut · on Jan 11, 2024

imo mistral-medium is worse than mixtral. Do you have API access?

EmilStenstrom · on Jan 11, 2024

Thank you!

Smerity · on Dec 22, 2023

In the distant past I was the lone engineer of Common Crawl almost a decade ago. Common Crawl heavily leverages the WARC format.

My favorite capability of the WARC format borrows from the fact that most compression formats can be written to allow random access. Compression formats such as `gzip` and `zstandard` allow multiple compressed streams to be stuck together and act during decompression as if it's one contiguous file.

Hence you can create multiple compressions and literally stick them together:

  $ echo cat > cat.txt
  $ echo dog > dog.txt
  $ zstd cat.txt dog.txt
  $ cat cat.txt.zst dog.txt.zst > catdog.zst
  $ zstdcat catdog.zst
  cat
  dog

For files composed of only a textual / clearly delimited format that means you can fairly trivially leap to a different offset assuming each of the inputs is compressed individually. You lose out on some amount of compression but random lookup seems a fairly reasonable tradeoff. Common Crawl was able to use this to allow entirely random lookups into web crawl datasets dozens / hundreds of terabytes in size without any change in file format for example and utilizing Amazon S3's support for HTTP Range requests[1].

Trading compression for random lookup is even more forgiving if you create a separate compression dictionary tailored toward your dataset. For web crawling you'd likely get you the majority of the compression gains back unless pages from the same website are sequentially written which is unlikely in most situations. The website's shared template/s would result in very high compression gains across files which you'd lose by allowing random lookup but most crawlers don't don't operate sequentially so local compression gains are likely smaller than larger.

[1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ...

electroly · on Dec 22, 2023

Isn't this a benefit you'd trivially get just by using .zip? I pull individual files out of large .zip archives in S3 using HTTP range requests; works exactly as you'd expect. You know the zip header is at the end of the file, and the header tells you the offset and length of the compressed entry data so you can request the range. Two requests if you've never seen the .zip before, one if you've got the zip header cached.

Smerity · on Dec 22, 2023

As mentioned it's trivial across the spread of compression algorithms supporting this type of behaviour (`gzip`, `zstandard`, `zip`, ...), the header in `zip` making it even more convenient as you note!

WARC as a format essentially states that unless you have good reason "record at a time" compression is the preferred[1]. The mixture of "technically possible" and "part of spec" is what makes it so useful - any generic WARC tool can support random access, there are explicit fields to index over (URL), and even non-conforming WARC files can be easily rewritten to add such a capability.

[1]: https://iipc.github.io/warc-specifications/specifications/wa...

stavros · on Dec 22, 2023

It occurs to me that you could stick a few bytes of header in the beginning of the ZIP file, to tell you the exact location of the header at the end of it, thus avoiding multiple lookups. It would even still be ZIP-compatible.

electroly · on Dec 22, 2023

Definitely. I take an alternative but similar approach: since I control the zip files, I can guarantee that the header is always within the last N kilobytes of the zip file (configurable value of N). I spend a HEAD request to get the length of the zip file and then walk backwards by N kilobytes. You would request the few bytes at the beginning instead of using that request to get the file length.

stavros · on Dec 22, 2023

How can you guarantee that it's always within N kilobytes, when N depends on the number of files in the zip?

electroly · on Dec 23, 2023

If you're creating the zips in the first place, you can just check and see how big the headers are when you create them. If you happen to get N wrong, you can request another chunk, but obviously it's nice to avoid multiple requests to get the header. For my use case, the number of files is small and relatively consistent between zips so a generous value of 64KB ended up working great.

Smerity · on Sept 27, 2023

That section of the text is ambiguous but it's $250k for 8 x H100s (PCI-e) and $350k for 8 x H100 (SXM).

The prices above are plus or minus from Lambda Labs[1].

[1]: https://lambdalabs.com/products/scalar

Smerity · on Nov 17, 2022

To note, Figma use something inspired by but not quite CRDTs:

> Figma isn't using true CRDTs though. CRDTs are designed for decentralized systems where there is no single central authority to decide what the final state should be. There is some unavoidable performance and memory overhead with doing this. Since Figma is centralized (our server is the central authority), we can simplify our system by removing this extra overhead and benefit from a faster and leaner implementation.

> It’s also worth noting that Figma's data structure isn't a single CRDT. Instead it's inspired by multiple separate CRDTs and uses them in combination to create the final data structure that represents a Figma document.

https://www.figma.com/blog/how-figmas-multiplayer-technology...

Smerity · on Aug 29, 2022

I have had my XPS 13 laptop die four times in the past two years, needing a replacement logic board each time.

The laptop is amazing when it's working, default Linux support too, but so far have literally had to ship it back to Dell for at least two months total since I've used it. Usually the turn around is only a few days but they were waiting on a part at one stage for weeks.

I am now looking to do a system exchange with them but am not certain if it's going to go through. At this stage I can't imagine they'd be profitable on this machine given how many times I've had to ship it in for repair.

Hence I sadly cannot recommend the XPS 13 even though I truly love it when it's working and laugh with glee at having 32GB of RAM =]

Smerity · on Aug 10, 2022

The author is focused on GPU texture compression for games[1] so they're not too concerned with fast compression speed. Given a game will be played on (at a minimum) tens of thousands of machines, some of which may be quite limited (e.g. handheld consoles or mobile devices), the game developers are only more than happy to trade off once-off compression times for decompression size / speed. They'd likely only perform this on a later "release build" of their game too.

The author mentions in a tweet going from minutes to seconds for compression when switching CPU for GPU[2]. From memory he has made other references to a few seconds for compression being entirely reasonable for such tasks but I can't find a direct reference.

[1]: http://www.binomial.info/

[2]: https://twitter.com/richgel999/status/1476325003662667777

Smerity · on July 1, 2022

I saw a version of this myself at Geary and Mason around midnight on June 20th[1]. That it happened again but at a larger scale ten days later is deeply concerning.

Given I had no clue what state the vehicles were in, and that they'd start moving without indication, it felt pretty damn worrying.

[1]: https://twitter.com/Smerity/status/1539144641206882304

octoberfranklin · on July 1, 2022

> I had no clue what state the vehicles were in

"Murderbot mode". Or rather, transitioning into it. But they got stuck, which is why you're still alive.

Balgair · on July 1, 2022

Oh, I can see the screenplay being written.

"Coming Halloween, 2024! Killer Kaarr!.

Aiden, malicious disgraced hacker gets into the networks of, Joel, his successful rival's self-driving taxi company, Kaarr. Joel and Aiden were CIA hackers, but fell out over their mutual love of Klara. Joel takes Klara to the heights of the elite, but she still pines for Aiden. In revenge, Aiden hacks into Kaarr's networks and makes all the cars into murderbots, locking his rival Joel inside as the killing commences. At the end of the night, all the Killer Kaarrs are set to drive into the ocean, killing a traumatized Joel. But there's only one problem, Klara is locked inside a Killer Kaarr too. Can Aiden reverse his sadistic code in time to save Klara? Can Joel escape to stop the killing? Find Out, Halloween 2024!

A horror/gore/race film with lots of body horror as people are alternatively hit, crushed, bonked, exploded, and otherwise tortured by cute little electric cars with really tinted windshields and choice camera angles. Think the hokey rubbery special effects of The Thing with a lot of cell phones and car-centric jump scares.

Smerity · on April 8, 2022

I just want to note the replies to this thread are excessively dismissive and toxic. You may not agree with the wording of their advertising ("world's most powerful NLP toolkit" is marketing speak, sure) but going from that to implying the technical side is "only Min-GPT" is tremendously weird. As someone who works in machine learning and specifically language models this is a team I'm keeping an eye on.

For anyone who wanted more technical discussion re: ML / LM (though the author notes this work "[does] not reflect the architectures or latencies of my employer's models" i.e. it's an exploratory technical breakdown of general model characteristics) I've appreciated the technical write-ups from @kipperrii (ML ops @ Cohere) recently:

- Transformer Inference Arithmetic: https://carolchen.me/blog/transformer-inference-arithmetic/

- Breakdown of H100s for Transformer Inferencing: https://carolchen.me/blog/h100-inferencing/

hnhg · on April 8, 2022

They should put that content on their website. I also thought that the comments were a bit harsh but then visited the site and was immediately put off myself. They have a really great team and could do a great job of conveying that through content.