OpenAI GPT-4 vs. Groq Mistral-8x7B

wruza · 2024-03-22T11:16:26 1711106186

The prompt, for those interested. I find it pretty underspecified, but maybe that's the point. For example, "Business operating hours" could be expanded a little, because "Closed - Opens at XX" is still non-processable in both cases.

  You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.

  Data to scrape: 
  title: Name of the business
  type: The business nature like Cafe, Coffee Shop, many others
  phone: The phone number of the business
  address: Address of the business, can be a state, country or a full address
  years_in_business: Number of years since the business started
  hours: Business operating hours
  rating: Rating of the business
  reviews: Number of reviews on the business
  price: Typical spending on the business
  description: Extra information that is not mentioned yet in any of the data
  service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
  is_operating: Whether the business is operating
  
  HTML: 
  {html}

infecto · 2024-03-22T12:11:24 1711109484

This should be higher up. This whole blog post is mostly worthless because the way they are extracting data is less than optimal.

Lower end models do not have the attention to complete tasks like this, GPT4Turbo will generally have the capability. But to have an optimal pipeline you should really be splitting up these tasks into individual units. You extract each attribute you want independently and then combine it back together however you want. Also asking for JSON upfront is equally suboptimal in the whole process.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Edit: I am not suggesting that an LLM is more optimal than what ever traditional parsing methods they may use, simply the way they are doing it is wrong from an LLM flow.

ilyazub · 2024-03-26T07:11:24 1711437084

> I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Cool, cool. I'm super interested. Please share the process and the results.

wruza · 2024-03-22T13:26:51 1711114011

Also, my (limited) experience with prompts tells that you want to invest more into the “You are” part. I’ll share my understanding, corrections are appreciated.

LLMs aren’t people even in a chat-roleplaying sense. They complete a “document” that can be a plot, a book, a protocol of conversation. The “AI” side in the chat isn’t an LLM itself, it’s a character (and so are you, it completes your “You: …” replies too - that’s where the driver app stops it and allows you to interfere). So everything you put in that header is very important. There are two places where you can do that: right in the chat, as in TFA, or in the “character card” (idk if GPTs have it, no GPT access for me). I found out that properly crafting a character card makes a huge difference and can resolve the whole classes of issues.

Idk what will work best in this case, but I’d start with describing which sort of a bot, how it deals with unclear or incomplete information, how amazing it is (yes, really), its soft/tech skills and problem solving abilities, what other people think of it, their experience and so on. Maybe would add few examples of interactions in a free form. Then in the task message I’d tell it more and specific details about that json.

One more note - at least for 8x7B, the “You are” in the chat is a much weaker instruction than a character card, even if the context is still empty. I low-key believe that’s because it’s a second-class prompt, i.e. the chat document starts with “This is a conversation with a helpful AI bot which yada yada” in… mind, and then in that chat that AI character gets asked to turn into something else, which poisons the setting.

Simply asking the default AI card represents 0.1% of what’s possible and doesn’t give the best results. Prompt Engineering is real.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Same. I think that no matter how good a model is, this prompt just isn’t a professional task statement and leaves too much to decide. It’s a task that you, as a regular human, would hate to receive.

mhuffman · 2024-03-22T12:48:31 1711111711

Do you have an example of a more optimal prompt to share?

infecto · 2024-03-22T13:01:44 1711112504

The prompt does not matter as much as the workflow which is describe above. 1) Extract one attribute at a time. 2) Don't ask for json during extraction, but on binary small attributes it might not matter as much.. 3) Combine the data later.

There are differences that can be marked on how different models perform against the same raw prompt but generally the workflow is what matters more. The raw text prompt will be dependent on what model you are using as there are those differences but I don't think its a level of "prompt engineering" like we had a year ago.

feintruled · 2024-03-22T10:20:24 1711102824

Brave new world, where our machines are sometimes wrong but by gum they are quick about it.

RUnconcerned · 2024-03-22T10:32:21 1711103541

I too am a big fan of having my computer hallucinate incorrect information.

darthrupert · 2024-03-22T12:38:27 1711111107

Yesterday I asked my locally running gpt4all "What model are you running on?"

Answer: "I'm running on Toyota Corolla"

Which was perhaps the funniest thing I heard that day.

harryf · 2024-03-22T10:45:36 1711104336

>> print(“Hello, world!”.ai_reverse()) world, Hello!

ben_w · 2024-03-22T11:17:01 1711106221

First few versions of Swift kept changing how strings work because it's not entirely obvious what most people intend from the nth element of a string.

Used to be easy, when it was ASCII.

Reverse the bytes of utf-8 and it won't always be valid uft-8.

Reverse the code-points, and the Canadian flag gets replaced with the Ascension Island flag.

samus · 2024-03-22T13:35:54 1711114554

Character-level operations are difficult for LLMs. Because of tokenization they don't really "perceive" strings as a list of characters. There are LLMs that ingest bytes, but they are intended to process binary data.

RUnconcerned · 2024-03-22T10:18:11 1711102691

Finally, something more offensive than parsing HTML with regular expressions: parsing HTML with LLMs.

AlphaAndOmega0 · 2024-03-22T10:20:59 1711102859

I for one am glad I can offload all the regex to LLMs. Powerful? Yes. Human readable for beginners? No.

cornedor · 2024-03-22T10:24:25 1711103065

Why tough? To me, it seems more prone to issues (hallucinations, prompt injections etc). It is also slower and more expensive at the same time. I also think it is harder to implement properly, and you need to add way more tests in order to be confident it works.

RUnconcerned · 2024-03-22T10:29:54 1711103394

Personally when I am parsing structured data I prefer to use parsers that won't hallucinate data but that's just me.

Also, don't parse HTML with regular expressions.

rybosome · 2024-03-22T12:03:43 1711109023

Generally I agree with your point, but there is some value in a parser that doesn’t have to be updated when the underlying HTML changes.

Whether or not this benefit outweighs the significant problems (cost, speed, accuracy and determinism) is up to the use case. For most use cases I can think of, the speed and accuracy of an actual parser would be preferable.

However, in situations where one is parsing highly dynamic HTML (eg if each business type had slightly different output, or you are scraping a site which updates the structure frequently and breaks your hand written parser) then this could be worth the accuracy loss.

samus · 2024-03-22T14:22:36 1711117356

You could employ an LLM to give you updated queries when the format changes. This is something where they should shine. And you get something that you can audit and exhaustively test.

okamiueru · 2024-03-22T10:24:50 1711103090

Deterministic? No.

retrac98 · 2024-03-22T10:15:05 1711102505

There are so many applications for LLMs where having a perfect score is much more important than speed, because getting it wrong is so expensive, damaging, or time consuming to resolve for an organisation.

nathan_compton · 2024-03-22T10:24:51 1711103091

If you need a perfect score, don't use LLMs. This seems obvious to me, even given the state of the art LLMs. I am a heavy user of GPT4 and I wouldn't bet $1000 bucks on it being 100% reliable for any non-trivial task.

retrac98 · 2024-03-22T10:27:09 1711103229

They'll get better. Humans are far from perfect, and I have no doubt that LLMs will eventually outperform them for non-trivial tasks consistently.

nathan_compton · 2024-03-22T10:32:42 1711103562

Maybe so, but at this stage I wouldn't be betting a business model on it.

Socnic · 2024-03-22T13:33:29 1711114409

Businesses do bet on imperfect and even criminal models all the time (way before LLMs existed)... they call it cost of doing business when they get it wrong or get caught.

Jensson · 2024-03-22T17:43:29 1711129409

> Humans are far from perfect

Humans running multishot with mixture of experts is close to perfect. You can't compare a multishot mixture of expert AI to a single human, humans doesn't work in isolation.

littlestymaar · 2024-03-22T10:57:06 1711105026

Machine learning models will get better for sure. We don't know if LLM are the end game though and it's not sure if this particular technique is what we'll need to reach the next level.

somewhereoutth · 2024-03-22T10:55:08 1711104908

Or they might not get better. It could be that we are at a local optimum for that sort of thing, and major improvements will have to wait (perhaps for a very long time) for radical new technologies.

luma · 2024-03-22T11:24:11 1711106651

Maybe, but it certainly hasn’t been the arc of the past few years. I don’t know how anyone could look at this and assume that it’s likely to slow down.

samus · 2024-03-22T10:28:41 1711103321

They already have superhuman image classification performance.

pooper · 2024-03-22T10:34:45 1711103685

I remember talking to a radiologist who said he was sure something like this was coming like ten years ago where instead of a radiologist looking at scans manually, a machine would go through a lot of images and flag some for manual review.

We haven't even gotten there yet, have we?

osrec · 2024-03-22T10:40:49 1711104049

Yes, we absolutely are there: https://youtu.be/D3oRN5JNMWs?feature=shared

My professor (Sir Michael Brady) at university 14 years ago set up a company to do this very thing, and he already had reliable models back before 2010. I believe their company was called Oxford Imaging or something similar.

wruza · 2024-03-22T10:58:05 1711105085

Yep, everyone seems to forget that ML was available before 2021. Had a conversation recently with my former colleague who learned about some plastic packaging company which used "AI" to predict client orders and inform them about scheduling implications. When I told him that you don't need Transformers and 30GB models for that, he was quasi-confused, cause he kinda knew it but the hype just overtook his knowledge.

anon373839 · 2024-03-22T12:09:54 1711109394

In ML courses, you’re taught to try simpler methods and models before turning to more complex ones. I think that’s something that hasn’t made it into the mainstream yet.

A lot of people seem to be using GPT-4 for tasks like text classification and NER, and they’d be much better off fine-tuning a BERT model instead. In vision, too, transformers are great but a lot of times, a CNN is all you really need.

dagw · 2024-03-22T10:40:25 1711104025

We haven't even gotten there yet, have we?

Yes and no. Countless teams have solved exactly this problems at universities and research groups across the world. Technically it's pretty much a solved problem. The hard part is getting the systems out of the labs and certified as an actual product and convincing hospitals and doctors to actually use them.

matheusd · 2024-03-22T10:39:53 1711103993

Maybe it's a liability issue, not a competency issue.

jojobas · 2024-03-22T10:35:37 1711103737

Until a single pixel makes a cat a dog or something like that.

samus · 2024-03-22T13:30:06 1711114206

Changing a single pixel is usually not enough to confuse convolutional neuronal networks. Even so, human supervision will probably always be quite important.

spaniard89277 · 2024-03-22T10:23:38 1711103018

I've tried to apply it to parsing HTML as this article into a pretty long pipeline. I'm using DeepInfra with Mistral 8x7B and I'm still unsure if I'm going to use for production.

The problem I'm finding is that the time I wanted to save mantaining selectors and the like is time that I'm spending writing wrapper code and dealing with the mistakes it makes. Some are OK and can deal with them, others are pretty annoying because It's difficult to deal with them in a deterministic manner.

I've also tried with GPT-4 but it's way more expensive, and despite what this guy got, it also makes mistakes.

I don't really care about inference speed, but I do care about price and correctness.

ogogmad · 2024-03-22T10:33:42 1711103622

Might be a silly question, but if you want determinism in this, why don't you get the LLM to write the deterministic code, and use that instead? Interesting experiment, though!

In fact, what about a hybrid of what you're doing now? Initially, you use an LLM to generate examples. And then from those examples, you use that same LLM to write deterministic code?

Eisenstein · 2024-03-22T11:58:07 1711108687

Have you tried swapping Mistral 8x7B with either command-r 34B, Qwen 1.5 70B, or miqu 70B? Those are all superior in my experience, though suited for slightly different tasks, so experimentation is needed.

samus · 2024-03-22T14:03:24 1711116204

Parsing HTML and tagsoup is IMHO not the right application for LLMs since these are ultimately structured formats. LLM are for NLP tasks, like extracting meaning out of unstructured and ambiguous text. The computational cost of an LLM chewing through even moderately-sized document can be more efficiently spent on sophisticated parser technologies that have been around for decades, which can also to a degree deal with ambiguous and irregular grammars. LLMs should be able to help you write those.

malux85 · 2024-03-22T10:17:05 1711102625

Yeah I agree - just an hour ago I was dealing with an LLM that was missing a "not" thus inverting the meaning of a rather important simulation parameter!

worldsayshi · 2024-03-22T10:30:46 1711103446

It makes much more sense to me to have the LLM infer the correct query for extracting data on the page. Much faster and reliable and it wouldn't really be a problem to have a human in the loop every now and then.

onion2k · 2024-03-22T11:50:29 1711108229

All the places I see AI being applicable to my work don't require a perfect score, and a threshold is actually much more useful, especially where multiple factors come together to make evaluation to a single value hard.

bberrry · 2024-03-22T10:21:30 1711102890

If you have speed you can generate multiple answers and have another model pick the best one.

Drakim · 2024-03-22T10:26:11 1711103171

If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

That's understandable. The real problem is when the AI lies/hallucinates another answer with confidence instead of saying "I don't know".

simion314 · 2024-03-22T10:44:53 1711104293

The problem is asking for facts, LLM are not a database so they know stuff but it is compressed so expect wrong facts, wrong names, dates, wrong anything.

We will need an LLM as a front end then it will generate a query to fetch the facts from the internet or a database , then maybe format the facts for your consumption.

samus · 2024-03-22T14:20:11 1711117211

This is called Retrieval Augmented Generation (RAG). The LLM driver recognizes a query, it gets send to a vector database or to an external system (could be another LLM...) and the answer is placed in the context. It's a common strategy to work around their limited context length, but it tends to be brittle. Look for survey papers.

ch_sm · 2024-03-22T11:29:52 1711106992

That‘s exactly it. It‘s ok for LLMs to not know everything, because they _should_ have a means to look up information. What are some projects where this obvious approach is implemented/tried?

Jensson · 2024-03-22T11:50:14 1711108214

But then you need an LLM that can separate between grammar and facts. Current LLMs doesn't know the difference, that is the main source to these issues, these models treat facts like grammar and that worked well enough to excite people but probably wont get us to a good state.

m348e912 · 2024-03-22T10:57:36 1711105056

The weird problem is with LLM hallucinations is that it usually will acknowledge its mistake and correct itself if you call it out. My question is why can't LLMs included a sub-routine to check itself before answering. Simply asking itself something like "this answer may not be correct, are you sure you're right?"

Shrezzing · 2024-03-22T11:25:24 1711106724

>The weird problem is with LLM hallucinations is that it usually will acknowledge its mistake and correct itself if you call it out.

From what I've tested, all of the current models will see a prompt like "are you sure that's correct" and respond "no, I was incorrect [here's some other answer]", irrespective of the accuracy of the original statement.

greenavocado · 2024-03-22T11:58:15 1711108695

In my experience the corrections can be additional hallucinations one after another after pointing out inaccuracies even multiple times in a row.

Eisenstein · 2024-03-22T12:17:57 1711109877

> My question is why can't LLMs included a sub-routine to check itself before answering.

Because LLMs don't work in a way for that to be possible if you operate them on their own.

Here is the debug output of my local instance of Mistral-Instruct 8x7B. The prompt from me was 'What is poop spelled backwards?'. It answered 'puoP'. Let's see how it got there starting with it processing my prompt into tokens:

   'What (3195)', ' is (349)', ' po (1627)', 'op (410)', ' sp (668)', 'elled (6099)', ' backwards (24324)', '? (28804)', '\n (13)', '### (27332)', ' Response (12107)', ': (28747)', '\n (13)',

It tokenized 'poop' as two tokens: 'po', number 1627, and 'op', number 410.

Next it comes up with its response:

   Generating (1 / 512 tokens) [(pu 4.43%) (The 66.62%) (po 11.96%) (p 4.99%)]
   Generating (2 / 512 tokens) [(o 89.90%) (op 10.10%)]
   Generating (3 / 512 tokens) [(P 100.00%)]
   Generating (4 / 512 tokens) [( 100.00%)]

It picked 'pu' even though it was only a ~4% chance of being correct, then instead of picking 'op' it picked 'o'. The last token was a 100% probability of being 'P'.

   Output: puoP

At no time did it write 'puoP' as a complete word nor does it know what 'puoP' is. It has no way of evaluating whether that is the right answer or not. You would need a different process to do that.

ZitchDog · 2024-03-22T11:13:27 1711106007

The problem is that if you call it out, it will frequently change its answer, even if it was correct. LLMs currently lack chutzpa.

samus · 2024-03-22T14:14:48 1711116888

They definitely stand their ground if they were aligned to do so.

Drakim · 2024-03-22T14:43:41 1711118621

But then they stand their ground when wrong too.

Jensson · 2024-03-22T11:18:27 1711106307

That is a common bullshitting strategy, talk a lot of bullshit, and then backtrack and acknowledge you were wrong when people push back. That way they will think you know way more than you do. Many people will see thought that, but most will just think you are a humble expert who can acknowledge when you are wrong instead of you always acknowledging you are wrong even when you aren't.

People have a really hard time catching such bullshitting from humans, which is why free form interviews doesn't work.

asimovfan · 2024-03-22T11:09:27 1711105767

Its because theres no entity that is actually acknowledging anything. Its generating an answer to your prompt. You can gaslight it into anything being wrong or correct.

samus · 2024-03-22T14:12:13 1711116733

They simply don't work that way. You are asking it for an answer, it will give you one since all it can do is extrapolate from its training data.

Good prompting and certain adjustment to the text generation parameters might help prevent hallucinations, but it's not an exact science since it depends on how it was trained. Also, an LLMs training data frankly said contains a lot of bulls*t.

helsinkiandrew · 2024-03-22T10:36:26 1711103786

> If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

Think the commenter meant use another model/LLM which could give a different answer, then let them vote on the result. Like "old fashioned AI" did with ensemble learning.

infecto · 2024-03-22T12:19:48 1711109988

This test is interesting from a general high level metric/test but overall the way they are extracting data using a LLM is suboptimal so I don't think the takeaway means much. You could extract this type of data using a low-end model like 8x7B with a high degree of accuracy.

samus · 2024-03-22T14:24:08 1711117448

The better way would be to ask it to generate a program that uses CSS selectors to parse the HTML.

emporas · 2024-03-22T12:59:18 1711112358

Mixtral works very well with json output in my personal experience. Gpt family are excellent of course, and i would bet Claude and Gemini are pretty good. Mixtral however is the smallest of the models and the most efficient.

Especially running on Groq's infrastructure it's blazing fast. Some examples i ran on Groq's API, the query was completed in 70ms. Groq has released API libraries for Python and Javascript, i wrote a simple Rust example here, of how to use the API [1].

Groq's API documents how long it takes to generate the tokens for each request. 70ms for a page of document, are well over 100 times faster than GPT, and the fastest of every other capable model. Accounting for internet's latency and some queue that might exist, then the user receives the request in a second, but how fast would this model run locally? Fast enough to generate natural language tokens, generate a synthetic voice, listen again and decode the next request the user might talk to it, all in real time.

With a technology like that, why not talk to internet services with just APIs and no web interface at all? Just functions exposed on the internet, take json as an input, validate it, and send the json back to the user? Or every other interface and button around. Why pressing buttons for every electric appliance, and not just talk to the machine using a json schema? Why should users on an internet forum, every time a comment is added, have to press the add comment button, instead of just talking and saying "post it"? Pretty annoying actually.

[1] https://github.com/pramatias/groq_test

imaurer · 2024-03-22T17:12:49 1711127569

Groq will soon support function calling. At that point, you would want to describe your data specification and use function calling to do extraction. Tools such as Pydantic and Instructor are good starting points.

I am collecting these approaches and tools here: https://github.com/imaurer/awesome-llm-json

bambax · 2024-03-22T10:59:19 1711105159

Interesting post, but the prompt is missing? How do the LLMs generate the keys? It's likely the mistakes could be corrected with a better prompt or a post check?

Also, Google SERP page is deterministic (always has the same structure for the same kind of queries), so it would probably be much more effective to use AI to write a parser, and then refine it and use that?

tosh · 2024-03-22T10:30:26 1711103426

I initially thought the blog post is about scraping using screenshots and multi-modal llms.

Scraping is quite complex by now (front-end JS, deep and irregular nesting, obfuscated html, …).

crowdyriver · 2024-03-22T11:53:13 1711108393

There's lots of comments here about how stupid is to parse html using llms.

Have you ever had to scrape multiple sites with variadic html?

samus · 2024-03-22T14:29:50 1711117790

The example here has HTML with a somewhat fixed format. It would indeed have been better to have samples with different format and aiming for a low error rate.

If you are scraping a limited amount of sites, you could for each site ask the LLM for parsing code from some samples, review that, and move on.

malux85 · 2024-03-22T10:10:15 1711102215

Sorry to be nit-picky but thats the essence of these benchmarks - Mistral putting "N/A" for not available is weird - N/A is not applicable, in every use I have ever seen, and they DONT mean the same thing. I would expect null for not available and N/A for not applicable

Impressive inference speed difference though

mewpmewp2 · 2024-03-22T10:17:42 1711102662

I have always known N/A as not available.

malux85 · 2024-03-22T10:19:49 1711102789

Curious, where are you from? If I Google N/A every single hit on the first page is explaining it means "Not applicable"

are you from a non-english country? Maybe its cultural?

selcuka · 2024-03-22T10:21:01 1711102861

The first entry on Google is Wikipedia [1] for me:

> N/A (or sometimes n/a or N.A.) is a common abbreviation in tables and lists for the phrase not applicable, not available, not assessed, or no answer.

[1] https://en.wikipedia.org/wiki/N/A

malux85 · 2024-03-22T10:22:55 1711102975

Thats interesting, wikipedia is not on the first page for me, my first hit is Cambridge dict: (and then a bunch of other dicts) - Im flying right now but IP geolocation puts me in the US

Meaning of n/a in English written abbreviation for not applicable: used on a form to show that you are not giving the information asked for because the question is not intended for you or your situation: If a question does not apply to you, please put N/A in the box provided. COMMERCE.

TIL

Jensson · 2024-03-22T12:35:37 1711110937

In a data table "not available" is usually the right word for it, like if you have a list of national statistics then some of the values wont be available due to political reasons etc. But all of those means basically the same thing to the end user, this value isn't there.

mewpmewp2 · 2024-03-22T21:41:59 1711143719

I'm from North Europe, so not a native English speaker, but still it seems like based on my experience in life it seems as the first idea is that it's Not Available.

If I was to code something and for whatever reason some data wasn't available I would use N/A.

"Not applicable" doesn't feel right to me about N/A.

For instance if there is a table of comparison and for whatever reason there is data missing for some entity, while there should be, I would use N/A. So not applicable feels wrong for me for that reason alone.

This all is coming from intuition though.

throwaway11460 · 2024-03-22T10:25:21 1711103121

It means all of these.

huqedato · 2024-03-22T10:29:30 1711103370

Can somebody explain why this Grok is more performant than Microsoft infrastructure ? LPU better than TPU/GPU ?

kkielhofner · 2024-03-22T11:37:15 1711107435

LLM performance is about parallelism but also memory bandwidth.

Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].

From the linked reference:

"In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."

That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).

If you consider the requirements for actual real-world inference serving workloads you need to serve multiple models, multiple versions of models, LoRA adapters, sentence embeddings models (for RAG), etc the economics and physical footprint alone get very challenging.

It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:

1) This analysis uses cloud GPU costs for Nvidia pricing. Cloud providers make significant margin on their GPU instances. If you look at qty 1 retail Nvidia DGX, Lambda Hyperplane, etc and compare it to cloud GPU pricing (inference needs to run 24x7) break even on hardware vs cloud is less than seven months depending on what your costs are for hosting the hardware.

2) Nvidia has incredibly high margins.

3) CUDA.

There are some special cases where tokens per second and time to first token are incredibly important (as the article states - real time agents, etc) but overall I think actual real-world production use or deployment of Groq is a pretty challenging proposition.

[0] - https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

tosh · 2024-03-22T10:31:58 1711103518

The Mistral Mixed Expert model has way fewer parameters active during inference and Groq has special purpose hardware (and probably less concurrent demand).

kkielhofner · 2024-03-22T11:43:19 1711107799

> probably less concurrent demand

This is a significant understatement. ChatGPT has an estimated 100m monthly active users.

Groq gets featured on HN from time to time but is otherwise almost completely unknown. According to their stats they have done something like 15m requests total since launch. ChatGPT likely does this in hours (or less).

naiv · 2024-03-22T10:35:54 1711103754

It's a totally different approach for interference

In short:

Groq - Ai Chip Microsoft etc. - Nvidia Gpu

ttrrooppeerr · 2024-03-22T10:17:02 1711102622

A bit off-topic but maybe not? Any words on GPT-5? Is that coming? Or is OpenAI just focusing on the Sora model?

YetAnotherNick · 2024-03-22T10:21:26 1711102886

There's no reason for OpenAI to release the model. They have close to 100% market anyways and releasing GPT-5 likely won't increase the total market as it is a incremental leap. And it's a open secret that most other models used GPT-4 synthetic data for training to come close to it.

They would likely wait till any model performs better than GPT 4 for the same price

whiplash451 · 2024-03-22T11:02:57 1711105377

The same reasoning would have applied for GPT-3.5. In the hindsight, you can say that it was obviously a good idea to build and ship GPT4. But hindsight is 20/20.

YetAnotherNick · 2024-03-22T13:39:23 1711114763

There are few differences. Firstly, GPT-3.5 wasn't ahead of Palm etc. from Google which was published at the same time as GPT-4.

Secondly, GPT-4 increased overall AI market. According to all the sources, interviews and leaks, GPT-5 won't be a big leap over GPT-4 as the model size and training data won't be significantly larger. I doubt GPT-5 would do that. (I could be wrong in my assumption though that GPT-5 would just be a incremental gain).

chilmers · 2024-03-22T11:05:03 1711105503

By any chance did you used to work in leadership at Nokia or Research in Motion? :-D

YetAnotherNick · 2024-03-22T13:41:06 1711114866

Nokia wasn't that ahead in technology and Motion wasn't that ahead in market. GPT-4 is ahead in both.

lewhoo · 2024-03-22T10:50:22 1711104622

There is reason to release new models if said models would be capable of grabbing a significant portion of job market currently occupied by humans.

tosh · 2024-03-22T10:38:17 1711103897

100%?

Claude 3 Opus is in the capability ballpark of GPT-4, GPT-3.5 has alternatives that are cheaper (Claude 3 Haiku) or cheaper and work offline (Qwen 1.5, Mixtral, …).

ZitchDog · 2024-03-22T11:20:17 1711106417

100% market share.

A competitor will likely need to be 10x better than ChatGPT in order to get significant market share, not just marginally better in certain scenarios.

Kostic · 2024-03-22T13:25:20 1711113920

Is Claude 3 Opus generating more profits and taking considerable amount of customers from OpenAI? I'm not seeing that yet. Granted, I'm in Europe (outside of EU) so I can't pay for Opus but I guess that kinda confirms my statement. GPT4 is still a good product and there are no market pressures to release GPT5.

burrish · 2024-03-22T10:23:17 1711102997

I hear it should be dropped this summer

cornedor · 2024-03-22T10:26:04 1711103164

According to Sam Altman in a podcast with Lex Fridman this week, there is no real indication that it will be dropped this year. They will release a new model, but it might not be GPT-5

burrish · 2024-03-22T10:41:32 1711104092

Fair enough, I got the info from this article

https://web.archive.org/web/20240319224624/https://www.busin...

whiplash451 · 2024-03-22T11:01:54 1711105314

Which is an indication of nothing. In which world would Sam A. drop any kind of info about such a sensitive topic? If anything, this could just be deception before a massive drop.

HarHarVeryFunny · 2024-03-22T11:56:32 1711108592

Could also be resetting expectations for people who've been expecting GPT-5 (or just GPT-4.5) sooner - been a year now since GPT-4 was released.

The other odd thing from Altman was saying that GPT-4 sucks.

I think the context for both announcements is the recent release of Anthropic's Claude-3, which in it's largest "Opus" form beats GPT-4 across the board in benchmarks.

I personally think OpenAI/Altman is a bit scared that any moat/lead they had has disappeared and they are now being out-competed by Anthropic (Claude). Remember that Anthropic as a company was only formed (by core members of the OpenAI LLM team) at the same time as GPT-3 was released, so in same time it took OpenAI to go from GPT-3 to GPT-4, Anthropic have gone from nothing -> Claude-1 -> Claude-2 -> Claude-3 which beats GPT-4 !!

Anthropic have also had quite a bit of success attracting corporate business, quite a bit of which is more long-term in nature (sharing details of expected future model capabilities so that partners can target those).

So, I think OpenAI is running a bit scared, and I'd interpret this non-announcement of some model (4.5 or 5) "coming soonish" to be them just waving the flag and saying "we'll be back on top soon", which they presumably will be, briefly, when their next release(s) do come out. Altman's odd "GPT-4 sucks" statement might be meant to downplay Claude-3 "Opus" which beats it.

DalasNoin · 2024-03-22T10:45:36 1711104336

My understanding from the lex podcast: they will release a lot of new models this year, but they will release intermediate models first before gpt-5

dns_snek · 2024-03-22T11:12:20 1711105940

For all the posturing and crypto hate on HN, we're entering a world where it's socially acceptable to use 1000W of computing power and 5 seconds of inference time to parse a tiny HTML fragment which would take microseconds with traditional methods - and people are cheering about it. Time for some self-reflection? That's not very green.

delegate · 2024-03-22T11:27:35 1711106855

Crypto energy requirements go up as the currency gets more traction.

TFA shows that groq is many times faster than GPT-4. Up to 18x groq claims. Faster means less energy. So I think it's just a matter of time until these things become ridiculously power efficient (eg run on phones in sub second times)

jodleif · 2024-03-22T11:32:06 1711107126

How does faster mean less energy? Thats only true if you’re running faster on the same hardware…

delegate · 2024-03-22T11:38:49 1711107529

Presumably. Less time the giant chip has to draw power for computation. The point is that everyone's interested in making AI power efficient, while crypto's proof of work is a competition for more power burned hashing and throwing away the result.

wenebego · 2024-03-22T11:41:00 1711107660

I think they are talking about the case where, hypothetically, there is a 10x increase in speed but only 2x increase in power consumption

jodleif · 2024-03-24T12:58:04 1711285084

I’m just pointing out that this is not a given…

drexlspivey · 2024-03-22T12:03:03 1711108983

Bitcoin energy requirements will be cut in half in a few days..

samus · 2024-03-22T14:40:29 1711118429

It's still a monstrosity compared to a traditional parser. You can even be fancy and use complex parsers that backtrack and can deal with mildly context-sensitive languages (as required for HTML, XML, and many programmin languages), and you'd still be more efficient.

shanehoban · 2024-03-22T11:18:35 1711106315

This is a valid point, but we are still in the early stages of AI/LLMs, so one would expect the speed and efficiency to improve drastically (perhaps accuracy too) over the coming years.

At least AI & LLMs have large scale practical applications as opposed to crypto (IMO).

AlchemistCamp · 2024-03-22T11:22:56 1711106576

AI is a lot older than blockchain. There were full-fledged neural networks in the 40s and the perceptron was implemented in hardware in the 50s.

IshanMi · 2024-03-22T11:45:30 1711107930

It's also interesting to think that IBM released an 8-trillion parameter model back in the 1980s [0]. Granted it was an n-gram model so it's not exactly an apples-to-apples comparison with today's models, but still, quite crazy to think about.

[0]: https://aclanthology.org/J92-4003.pdf

lukeschantz · 2024-03-22T12:01:06 1711108866

Interesting to see Robert Mercer the former CEO of Renaissance Technology is one of the authors on that paper. He is a former IBMer. If his name is unfamiliar he is a reclusive character who was a major funder of Breitbart, Cambridge Analytica and the Republican candidate in the 2016 presidential election.

varjag · 2024-03-22T11:52:20 1711108340

I wouldn't call the early McCulloch & Pitts work quite "full-fledged". Also backpropagation, essential for multi level perceptrons was not a thing until 1980s.

samus · 2024-03-22T14:43:58 1711118638

Backprop is just applied calculus. People simply didn't think about using it for neuronal networks yet.

varjag · 2024-03-23T15:37:16 1711208236

It was thought of as early as in 1960s by Rosenblatt but he did not come up with a practical implementation at the time. Lotsa things look obvious in hindsight.

ogogmad · 2024-03-22T12:21:27 1711110087

You're partially right. It's obvious that the solution is to combine traditional programming with AI, using traditional programming wherever possible because it's greener. Assuming you want things to turn out well in every possible future scenario, your decisions only matter if AGI isn't right around the corner. So assume it isn't right around the corner. Then there's going to be some interesting combining-together of manual human intervention, traditional software, and AI. We'll need to charge more for some uses of electricity, to incentivise turning AI into traditional software wherever possible.

Crypto is nearly pure waste.

CaptainFever · 2024-03-22T13:31:28 1711114288

> We'll need to charge more for some uses of electricity, to incentivise turning AI into traditional software wherever possible.

I don't understand this. This adds bureaucracy and I don't see why different uses need to be charged differently if they all use energy the same.

In other words, if energy costs X per unit, and an inefficient (AI) software takes 30 units and an efficient (traditional) software takes 10 units, then it is already cheaper to run the efficient software, and thus people are already incentivised to do so. There's no need to charge differently. If one day AI turns out to only need 5 units, turning more efficient, then just charge them for 5X. People will gravitate towards the new, efficient AI software naturally then.

Jensson · 2024-03-22T11:25:19 1711106719

Websites will never be fast, will they? Even with 1000x more compute than now they will just perform everything in LLM calls and stuff are just as slow as now.

qup · 2024-03-22T11:20:42 1711106442

It would take microseconds after a complete program was written by a human?

It no longer requires an expert human

josho · 2024-03-22T12:16:20 1711109780

And if this use case hit any kind of scale. We’d just have an llm generate a parser and be back to microseconds.

This was just a blog to generate traffic on the site. Not to showcase some new use case for an llm.

samlinnfer · 2024-03-22T11:44:26 1711107866

Any amount of energy spent useful work is vastly superior than whatever “POW” crypto burn does.

>For all the posturing and forest fire hate on HN, it’s now socially acceptable to run a toy steam engine to power a model car? Not very green of you.

CaptainFever · 2024-03-22T13:18:41 1711113521

It's almost a fallacy at this point to declare something bad simply because of the existence of carbon emissions, without first comparing the benefits of what is being produced, and the alternative tradeoffs.

To be fair to GP, they did compare it to alternatives (dumb HTML parsing), but failed to consider versatile HTML parsing or other uses for Groq LLM.

samus · 2024-03-22T14:45:13 1711118713

While you are not wrong, crypto is not what this is being compared with.

londons_explore · 2024-03-22T11:17:31 1711106251

While energy remains cheap and human minds remain expensive, it always makes sense to use AI to reduce human effort.

If one cares about the environment, a carbon cap/tax is what you should campaign for. Then carbon-based energy sources will be curtailled, energy costs will go up, and AI like this will be encouraged to become more energy efficient or other methods used instead.

osigurdson · 2024-03-22T12:17:39 1711109859

It is a nice idea in principle but ends up being a political tool and a tariff on goods and services of your own country. A global and corruption free carbon tax might work but that is impossible to achieve.

londons_explore · 2024-03-22T13:49:34 1711115374

The only way it's gonna work is if a bunch of countries get together, agree a carbon cap/tax, and then tell other countries that they need to join the scheme if they want to trade goods with the group.

One way to combat corruption is to ask an international panel of experts to assess how many extra emissions came from non-official sources in each country and reduce next years cap by that amount. Then countries have an incentive to stamp out corruption.

osigurdson · 2024-03-23T13:46:59 1711201619

I don't know. Corruption gets easier with increased centralization. I think a far better approach is to innovate our way out of it. If carbon free energy sources are less expensive then the problem will solve itself essentially. A global carbon tax will enviably extract some portion of global GDP from corruption. That money would likely be better spent in other ways.

Basically, carbon tax is the accountant's solution, innovation is the engineer's.

londons_explore · 2024-03-23T18:23:38 1711218218

carbon-free will take a really long time to be cheaper.

As soon as demand for oil starts to drop, so will oil prices, and I suspect they could go down by a factor of 10 or more and oil-rich nations would still think it worthwhile to exploit at least some reserves.

infecto · 2024-03-22T12:03:02 1711108982

Because crypto has very little real world use.

There is a lot of business value happening in the AI space and its only going to get better.

skc · 2024-03-22T11:15:56 1711106156

One is actually useful day to day though.

rafaelero · 2024-03-22T11:33:34 1711107214

What a ridiculous complaint. Energy efficiency won't remain static, and even if it were, it's not up to you to decide how to best leverage the available electricity.

lm28469 · 2024-03-22T11:47:09 1711108029

> it's not up to you to decide

Unless you live in a dictatorship it's definitely up to us to decide... Otherwise you leave your voice to the top 0.0001% business owners and expect them to work for your good and not for their own interests

Also read about the rebound effect. Planes are twice as efficient as they were 100 years ago yet they pollute infinitely more as a whole.

There is nothing ridiculous about the comment you're replying to

infecto · 2024-03-22T12:05:32 1711109132

Yes you are right and the future is dependent on innovation and using more electricity with a large percentage of it coming form renewable sources. I don't want to go live on the farm myself.

rafaelero · 2024-03-22T12:11:53 1711109513

Ok, then let's start by getting away with all the wasteful animal farming.

satisfice · 2024-03-22T11:14:57 1711106097

AND it's not even reliable.