More

kianN · 2025-10-08T14:18:14 1759933094

Hi author here. The goal of this post was to do a data driven analysis on the investment trends of companies that are leveraging LLMs. I used YC company descriptions to get a granular sense of the use cases in which LLMs are gaining and losing momentum. I used YC as a proxy for the VC industry. I do question how representative it is of the VC industry as a whole but my impression is it's a good leading indicator.

kianN · 2025-09-30T22:33:45 1759271625

I’ve actually run into the exact same issue. At the time we similarly had to scrap bandits. Since then I’ve had the opportunity to do a fair amount of research into hierarchical dirichelete processes in an unrelated field.

On a random day, a light went off in my head that hierarchy perfectly addresses the stratification vs aggregation problems that arise in bandits. Unfortunately I’ve never had a chance to apply this (and thus see the issues) in a relevant setting since.

hruk · 2025-09-30T23:40:10 1759275610

You can do fairly well here with ridge regression as a poor man's hierarchical model. We've used this library's Bayesian ridge regression to support a geo-pricing strategy (and it contains the Dirichlet-Multinomial approach as well): https://github.com/bayesianbandits/bayesianbandits

CuriouslyC · 2025-09-30T23:32:37 1759275157

Ahh, hierarchical dirichlet processes. Sounds like you were reading the literature on Bayesian diffusion modelling / diffusion trees. I studied that stuff almost 20 years ago now, really takes me back.

kianN · 2025-10-01T01:10:36 1759281036

Haha I’ve actually never heard of that field. My work was focused on applying Chinese restaurant process models to text analysis. But very curious what you were working on?

CuriouslyC · 2025-10-01T13:30:31 1759325431

I was using it for bioinformatics to incorporate measurement uncertainty from fluid microarrays into genotype cluster estimates.

kianN · 2025-09-30T13:42:27 1759239747

Hi all, author here. I originally was just planning to put together a few helper functions to support sparse arrays in DuckDB, but there ended up being a lot of interesting lessons in the process of benchmarking storage and optimizing queries.

For those not wanting to read the whole story, GitHub repo with basic benchmarks and utilities is here: https://github.com/Sturdy-Statistics/duckdb-sparse-array-lis...

kianN · 2025-09-28T19:50:10 1759089010

This is my favorite book on statistics. Full stop. The author Andrew Gelman created a whole new branch of Bayesian statistics with both his theoretical work on hierarchical modeling while also publishing Stan to enable practical applications of hierarchical models.

It took me about a year to work through this book on the side (including the exercises) and it provided the foundation for years of fruitful research into hierarchical Bayesian models. It’s a definitely not an introductory read, but for any looking to advance their statistical toolkit, I cannot recommend this book highly enough.

As a starting point, I’d strongly suggest the first 5 chapters for an excellent introduction to Gelman’s modeling philosophy, and then jumping around the table of contents to any topics that look interesting.

tmule · 2025-09-29T02:17:42 1759112262

“ The author Andrew Gelman created a whole new branch of Bayesian statistics ...” Love Gelman, but this is playing fast and loose with facts.

kragen · 2025-09-29T02:28:27 1759112907

His book on hierarchical modeling with Hill has 20398 cites on Google Scholar https://scholar.google.com/scholar?cluster=94492350364273118... and Wikipedia calls him "a major contributor to statistical philosophy and methods especially in Bayesian statistics[6] and hierarchical models.[7]", which sounds like the claim is more true than false.

nextos · 2025-09-29T02:45:33 1759113933

He co-wrote the reference textbook on the topic and made interesting methodological contributions, but Gelman acknowledges other people as creators of the theoretical underpinnings of multilevel/hierarchical modeling, including Stein or Donoho [1]. The field is quite old, one can find hierarchical models in articles that were published many decades ago.

Also, IMHO, his best work has been done describing how to do statistics. He has written somewhere I cannot find now that he sees himself as a user of mathematics, not as a creator of new theories. His book Regression and Other Stories is elementary but exceptionally well written. He describes how great Bayesian statisticians think and work, and this is invaluable.

He is updating Data Analysis Using Regression and Multilevel/Hierarchical Models to the same standard, and I guess BDA will eventually come next. As part of the refresh, I imagine everything will be ported to Stan. Interestingly, Bob Carpenter and others working on Stan are now pursuing ideas on variational inference to scale things further.

[1] https://sites.stat.columbia.edu/gelman/research/unpublished/...

kianN · 2025-09-29T03:57:28 1759118248

Totally agree and great point that hierarchical models have been around for a long time; however, these were primarily analytical, leveraging conjugate priors or requiring pretty extensive integration.

I would say his work with Stan and his writings, along with theorists like Radford Neal, really opened the door to a computational approach to hierarchical modeling. And I think this is a meaningfully different field.

CrazyStat · 2025-09-29T12:34:27 1759149267

I give Gelman a lot of credit for popularizing hierarchical models, but you give him too much.

Before Stan existed we used BUGS [1] and then JAGS [2]. And most of the work on computation (by Neal and others) was entirely independent of Gelman.

[1] https://en.wikipedia.org/wiki/Bayesian_inference_using_Gibbs...

[2] https://en.wikipedia.org/wiki/Just_another_Gibbs_sampler

pyyxbkshed · 2025-09-28T20:20:42 1759090842

What is a book / course on statistics that I can go through before this so that I can understand this?

oogway8020 · 2025-09-29T12:45:57 1759149957

Here is one path to learn Bayesian starting from basics, assuming modern R path with tidyverse (recommended):

First learn some basic probability theory: Peter K. Dunn (2024). The theory of distributions. https://bookdown.org/pkaldunn/DistTheory

Then frequentist statistics: Chester Ismay, Albert Y. Kim, and Arturo Valdivia - https://moderndive.com/v2/ Mine Çetinkaya-Rundel and Johanna Hardin - https://openintrostat.github.io/ims/

Finally Bayesian: Johnson, Ott, Dogucu - https://www.bayesrulesbook.com/ This is a great book, it will teach you everything from very basics to advanced hierachical bayesian modeling and all that by using reproducible code and stan/rstanarm

Once you master this, next level may be using brms and Solomon Kurz has done full Regression and Other Stories Book using tidyerse/brms. His knowledge of tidyverse and brms is impressive and demonstrated in his code. https://github.com/ASKurz/Working-through-Regression-and-oth...

thefringthing · 2025-09-29T15:27:25 1759159645

I would include Richard McElreath's _Statistical Rethinking_ here after, or in combination with, _Bayes Rules!_. A translation of the code parts into the tidyverse is available free online, as are lecture videos based on the book.

kianN · 2025-09-28T20:24:33 1759091073

I don’t mean for the bar to sound too high. I think working through khan academy’s full probability, calculus and linear algebra courses would give you a strong foundation. I worked through this book having just completed the equivalent courses in college.

It’s just a relatively dense book. There’s some other really good suggestions in this thread, most of which I’ve heard good things about. If you have a background in programming, I’d suggest Bayesian Methods for Hackers as a really good starting point. But you can also definitely tackle this book head on, and it will be very rewarding.

ccosm · 2025-09-28T21:12:53 1759093973

Highly recommend Stats 110 from Blitzstein. Lectures and textbook are all online https://stat110.hsites.harvard.edu/

crystal_revenge · 2025-09-28T20:38:24 1759091904

Bayesian Statistics the Fun Way is probably the best place to start if you're coming at this from 0. It covers the basics of most of the foundational math you'll need along the way and assumes basically no prerequisites.

After than Statistical Rethinking will take you much deeper into more complex experiment design using linear models and beyond as well as deepening your understanding of other areas of math required.

1u15 · 2025-09-28T20:42:33 1759092153

Regression and Other Stories. It’s also co-authored by Gelman and it reads like an updated version of his previous book Data Analysis Using Hierarchical/Multilevel Models.

Statistical Rethinking is a good option too.

armcat · 2025-09-28T20:49:03 1759092543

Can second Regression and Other Stories, it's freely available here: https://users.aalto.fi/~ave/ROS.pdf, and you can access additional information such as data and code (including Python and Julia ports) here: https://avehtari.github.io/ROS-Examples/index.html

itissid · 2025-09-29T02:27:04 1759112824

If you are near Columbia the visiting students post baccalaureate program(run by the SPS last I recall) allows you to take for credit courses in the Social Sciences department. Professor Ben Goodrich has an excellent course on Bayesian Statistics in Social Sciences which teaches it using R(now it might be in Stan).

That course is a good balance between theory and practice. It gave me a practical intuition understanding why posterior distribution of parameters and data are important and how to compute them.

I took the course in 2016 so a lot could have changed.

musebox35 · 2025-09-29T05:14:10 1759122850

I found the book from David Mackay on Information Theory, Inference, and Learning Algorithms to be well written and easy to follow. Plus it is freely available from his website: https://www.inference.org.uk/itprnn/book.pdf

It goes through fundamentals of Bayesian ideas in the context of applications in communication and machine learning problems. I find his explanations uncluttered.

biosonar · 2025-10-02T19:01:25 1759431685

Really sad he died of cancer a few years ago.

twiecki · 2025-09-29T03:52:43 1759117963

There is a collection of curated resources here: https://www.pymc.io/projects/docs/en/stable/learn.html

srean · 2025-09-29T13:46:09 1759153569

I would really love to have the story of PyMC told, especially it's technical evolution, how it was implemented first and how it changed over the years.

sn9 · 2025-09-29T19:07:03 1759172823

For effectively and efficiently learning the calculus, linear algebra, and probability underpinning these fields, Math Academy is going to be your best resource.

jmpeax · 2025-09-29T10:45:32 1759142732

Statistical Rethinking by Richard McElreath. He even has a youtube series covering the book if you prefer that modality.

glial · 2025-09-29T16:51:33 1759164693

Doing Bayesian Data Analysis by John Kruschke (get the 2nd edition). The name is even an homage to the original.

djmips · 2025-09-29T06:40:03 1759128003

Can you explain to me in simple terms how your fruitful research benefited you in a concrete way. Is this simply an enlightening hobby or do you have significant everyday applications? What kind of cool job has you employ Bayesian Data Analysis day to day and for what benefit? How do the suits relate to such knowledge and it's beneficial application that may be well beyond their ken?

kianN · 2025-09-29T15:16:51 1759159011

My applications have focused on noisy, high dimensional small datasets in which it is either very expensive or impossible to get more data.

One example is rare class prediction on long form text data eg phone calls, podcasts, transcripts. Other networks including neural networks and LLMs are either not flexible enough or require far too much data to achieve the necessary performance. Structured hierarchical modeling is the balance between those two extremes.

Another example is in genomic analysis. Similarly high dimensional, noisy, low data. Additionally, you don’t actually care about the predictions, you want to understand what genes or sets of genes are driving phenotypic behaviors.

I’d be happy to go into more depth via email or chat if this is something you are interested in (on my profile).

Some useful reads

[1] https://sturdystatistics.com/articles/text-classification

[2] https://pmc.ncbi.nlm.nih.gov/articles/PMC5028368/

SilverElfin · 2025-09-28T19:53:36 1759089216

Is there a good book that covers statistics as it is applied to testing - like for medical research or as optimization or manufacturing or whatever?

crystal_revenge · 2025-09-28T20:16:31 1759090591

The key insight to recognize is that within the Bayesian framework hypothesis testing is parameter estimation. Your certainty in the outcome of the test is your posterior probability over the test-relevant parameters.

Once you realize this you can easily develop very sophisticated testing models (if necessary) that are also easy to understand and reason about. This dramatically simplifies.

If you're looking for a specific book recommendation Statistical Rethinking does a good job covering this at length and Bayesian Statistics the Fun Way is a more beginner friendly book that covers the basics of Bayesian hypothesis testing.

kianN · 2025-09-28T20:31:40 1759091500

I might checkout Statistical Rethinking given how frequently it is being recommended!

Edit: Haha I just found the textbook and I’m remembering now that I actually worked through sections of it back when I was working through BDA several years back.

kianN · 2025-09-28T20:01:24 1759089684

This book is very relevant to those fields. There is a common choice in statistics to either stratify or aggregate your dataset.

There is an example in his book discussing efficacy trials across seven hospitals. If you stratify the data, you lose a lot of confidence, if you aggregate the data, you end up just modeling the difference between hospitals.

Hierarchical modeling allows you to split your dataset under a single unified model. This is really powerful for extracting signal for noise because you can split your dataset according to potential confounding variables eg the hospital from which the data was collected.

I am writing this on my phone so apologies for the lack of links, but in short the approach in this book is extremely relevant of medical testing.

greymalik · 2025-09-29T01:29:00 1759109340

It’s unclear which post you’re referring to - can you clarify which book you mean by “this book”?

kianN · 2025-09-25T08:47:20 1758790040

It’s even worse with parlays: the events are potentially negatively correlated. There may be a 70% chance that Giannis has 3 or more turnovers and a 70% chance the bucks win, but the odds of both happening are less than 49% because more turnovers directly reduces the likelihood of a Bucks win.

kianN · 2025-08-31T04:36:23 1756614983

This is a great summary of why despite so much progress/tricks being discovered, so little progress to the core limitations to LLMs are made.

kianN · 2025-08-29T16:32:56 1756485176

These issues are often attributed to a bad implementation of AI, but I think the problem is a little more fundamental.

The potential of AI that causes VCs and investors to swap their eyes for dollar signs is the ability to take unstructured, unpredictable inputs and convert them into structured actions or data: in this case a drive through conversation into a specific order. However, the ability to generalize to unseen inputs (what we call common sense) is neural networks glaring weakness. LLMs can look amazingly capable through internal testing, but there is a long and ever increasing tail of unseen interactions when it comes to human conversation.

I’ve seen this play out repeatedly over the last decade in the contact center industry with neural networks as a data scientist in this field.

kianN · 2025-08-01T18:06:28 1754071588

I think the children's programming is a really undiscussed aspect of this. Some investments don't have immediately measurable outcomes. But as someone whose parents worked long hours growing up, I'm really grateful that my exposure to television was PBS rather than cable children's shows.

ivape · 2025-08-01T18:10:05 1754071805

If your parents couldn't afford cable, then you couldn't get round the clock children's content from Nickelodeon. Your content during day time would have been stuff like the Maury Show, all the Judge shows, soap operas, and day time talk shows. PBS would have been the thing that offered the free children's content.

kianN · 2025-07-29T17:45:24 1753811124

I built a public tool a while back for some of my friends in grad school to support this sort of deep academic research use case. Sharing in case it is helpful: https://sturdystatistics.com/deepdive?search_type=external&q...

kianN · 2025-07-28T22:01:27 1753740087

This is exactly the challenge. When embedding were first popularized in word to vec they were interpretable because the word2vec model was revealed to be a batched matrix factorization [1].

LLM embedding are so abstract and far removed from a human interpretable or statistical corollary that even as the embeddings contain more information, that information becomes less accessible to humans.

[1] https://papers.nips.cc/paper_files/paper/2014/hash/b78666971...