It's funny because GPT-4 is actually a pile of 3.5s. You just need to set it up ...

jamala1 · 2024-04-07T03:51:32 1712461892

I guess it's the difference between an ensemble and a mixture of experts, i.e. aggregating outputs from (a) model(s) trained on the same data vs different data (GPT-4). Though GPT-4 presumably does not aggregate, but it routes.

BinRoo · 2024-04-07T03:50:07 1712461807

> GPT-4 is actually a pile of 3.5s

I understand the intension and reference you're making. I bet the implementation of GPT-4 is probably something along those lines. However, spreading speculation in definitive language like that when the truth is unknown is dishonest, wouldn't you agree?

trash_cat · 2024-04-07T14:20:54 1712499654

Sure, I could it put it less definitively, but realistically, what else can it be? The transformer won't change much and all of the models, at the core use it. It's a closely guarded secret because it's easy to replicate.