More

bkitano19 · 2025-02-06T16:22:42 1738858962

Related work:

Interpreting Modular Addition in MLPs https://www.lesswrong.com/posts/cbDEjnRheYn38Dpc5/interpreti...

Paper Replication Walkthrough: Reverse-Engineering Modular Addition https://www.neelnanda.io/mechanistic-interpretability/modula...

joelburget · 2025-02-06T21:41:56 1738878116

And more recently, [Language Models Use Trigonometry to Do Addition](https://arxiv.org/abs/2502.00873)

bkitano19 · 2024-10-15T15:04:17 1729004657

hume.ai specializes in expressive prosody for TTS (disclaimer - I work here)

bkitano19 · 2024-08-27T21:41:06 1724794866

Time to first token is as important to know for many use cases, rarely are people reporting it

Gcam · 2024-08-27T21:42:52 1724794972

See here for our TTFT metric benchmarks: https://artificialanalysis.ai/models/llama-3-1-instruct-70b/...

bkitano19 · 2024-08-09T18:37:17 1723228637

this is nuts

cpeterson42 · 2024-08-09T18:44:04 1723229044

We think so too, big things coming :)

goku-goku · 2024-08-09T19:54:42 1723233282

www.juicelabs.co

bkitano19 · 2024-07-10T00:04:01 1720569841

+1, had the fortune to work with him at a previous startup and meetup in person. Our convo very much broadened my perspective on engineering as a career and a craft, always excited to see what he's working on. Good luck Simon!

bkitano19 · 2024-07-08T19:05:42 1720465542

https://transformer-circuits.pub/2022/in-context-learning-an...

there is a lot of evidence to suggest that they are performing induction

bkitano19 · 2024-05-23T17:41:12 1716486072

Had this exact problem (Heroku Postgres to RDS) at my old co. Data migration went as bad as it possibly could (dropped indices, foreign keys, everything but the data itself). This would have saved us months of pain.

bkitano19 · 2024-04-06T10:49:45 1712400585

https://hootdoogs.com/ - "find food between you guys"

bkitano19 · on Jan 29, 2024

High integrity maintainer, big respect.

bkitano19 · on Aug 9, 2023

Karpathy covers this in Makemore, but the tl;dr is that if you don’t normalize the batch (essentially center and scale your activations down to be normally distributed), then at gradient/backprop time, you may get values that are significantly smaller or greater than 1. This is a problem, because as you stack layers in sequence (passing outputs to inputs), the gradient compounds (because of the Chain Rule), and so what may have been a well behaved gradient at the end layers has either vanished (the upstream gradients were 0<x<1 at each layer) or exploded (the gradients were x>>1 upstream). Batch normalization helps control the vanishing/exploding gradient problem in deep neural nets by normalizing the values passed between layers.

ripvanwinkle · on Aug 9, 2023

got it,thanks