Im pretty deep into this topic and what might be interesting to an outsider is t...

nerdponx · 2025-11-17T21:39:28 1763415568

> Essentially you add random noise to the inputs and train by minimizing the regular loss (like l1) and at the same time maximizing the difference between 2 members with different random noise initialisations. I wonder if this will be applied to more traditional genai at some point.

We recently had a situation where we specifically wanted to generate 2 "different" outputs from an optimization task and struggled to come up with a good heuristic for doing so. Not at all a GenAI task, but this technique probably would have helped us.

albertzeyer · 2025-11-18T08:53:01 1763455981

This idea is often used for self-supervised learning (SSL). E.g. see DINO (https://arxiv.org/abs/2104.14294).

albertzeyer · 2025-11-18T08:49:24 1763455764

The random noise is added to the model parameters, not the inputs, or not?

This reminds me of variational noise (https://www.cs.toronto.edu/~graves/nips_2011.pdf).

If it is random noise on the input, it would be like many of the SSL methods, e.g. DINO (https://arxiv.org/abs/2104.14294), right?

lysecret · 2025-11-18T13:27:55 1763472475

Yes you are right it's applied to the parameters, but other models (like ngcm) applied it to the inputs. IMO it shouldn't make a huge difference main point is you max differences between models.

cleak · 2025-11-17T20:41:13 1763412073

That’s pretty neat. It reminds me of how VAEs work: https://en.wikipedia.org/wiki/Variational_autoencoder

rytill · 2025-11-17T19:28:30 1763407710

What is the goal of doing that vs using L2 loss?

counters · 2025-11-17T23:10:25 1763421025

To add to the existing answers - L2 losses induce a "blurring" effect when you autoregressively roll out these models. That means you not only lose import spatial features, you also truncate the extrema of the predictions - in other terms, you can't forecast high-impact extreme weather with these models at moderate lead times.

lysecret · 2025-11-18T13:29:15 1763472555

Yes very good point this to me is one of the most magical elements of this loss how it suddenly makes the model "collapse" on one output and the predictions become sharp.

counters · 2025-11-18T17:20:31 1763486431

Yeah, it's underplayed in the the writeup here but the context here is important. The "sharpness" issue was a major impediment to improving the skill and utility of these models. When GDM published GenCast two years ago, there was a lot of excitement because the generative approach seemed to completely eliminate this issue. But, there was a trade-off - GenCast was significantly more expensive to train and run inference with, and there wasn't an obvious way to make improvements there. Still faster than an NWP model, but the edge starts to dull.

FGN (and NVIDIA's FourCastNet-v3) show a new path forward that balances inference/training cost without sacrificing the sharpness of the outputs. And you get well-calibrated ensembles if you run them with random seeds to their noise vectors, too!

This is a much bigger deal than people realize.

lysecret · 2025-11-17T19:33:44 1763408024

To encourage diversity between the different members in an ensemble. I think people are doing very similar things for MOE networks but im not that deep into that topic.

sunshinesnacks · 2025-11-17T21:41:22 1763415682

The goal of using CRPS is to produce an ensemble that is a good probabilistic forecast without needing calibration/post processing.

[edit: "without", not "with"]

jasonmarks_ · 2025-11-18T01:02:21 1763427741

> Im pretty deep into this topic and what might be interesting to an outsider is that the leading models like neuralgcm/weathernext 1 before as well as this model now are all trained with a "crps" objective which I haven't seen at all outside of ml weather prediction.

You are a bit misleading here. The model is trained on historical data but each run off of new instrument readings will be generated a few times in an ensemble.