Hacker News new | past | comments | ask | show | jobs | submit | spi's comments login

Aside from the weirdness of calling "good old" something that was released 17 months ago :-D I mean, deep learning is evolving at crazy rhythm, but you just can't assume a good paper gets written in days.

That said, as others have pointed out, and as it's also written on the blog post, they are entirely different methods. QLoRA requires access to the full training data, while theoretically you can apply SpinQuant to any given model. For example, they also apply it to Mistral, not only to their LLaMA.

(QLoRA also takes some time and compute to apply, but since SpinQuant also implies learning some weights, I don't know if it's actually faster/cheaper, too)


I know nothing about what makes an industry succeed or fail, and also nothing about web tech, but working in the field I can comment on:

> tensorflow looks like currently loosing to pytorch - seems like google got bored and more development is for JAX, Keras wrapper

Well, TensorFlow doesn't "look like currently losing", it has already lost since a long time. I haven't seen a decent paper release code in TensorFlow in years, and all the references I see to TF online are job posts from "older" companies (to the point that, if you are looking for a job in data science, seeing TF mentioned in the job post is kind of a red flag of a place you don't want to be).

That said, I am quite certain that this has only a small impact on why Google is losing terrain, and even on why it is behind in AI (which is also debatable: narrative aside, Gemini is not that much lacking behind competitors). Certainly if TensorFlow + TPUs turned out to be better than PyTorch + GPUs they would have had a lead to start from, but if that was so important, Meta or NVIDIA would have created the first LLM, not OpenAI.

Simply, sometimes stuff happens, you can't predict it all.


I know this is HN and here it's not a popular opinion, but maximum security is _not_ always a good idea. Even setting aside the problem of many different actors having to access these details mentioned below, there's value in a simple login process. Specifically for airplane tickets, the most common ones I had to struggle with multiple times are retrieving reservations bought from a different computer, or by a travel agency. In all these situations, it was exactly the simple approach that saved me. If 2FA was mandatory, the best case scenario was that the travel agency would have to send you a separate e-mail with details about how to access their portal where this 2FA would somehow work. The number of systems multiplies, the number of credentials to remember does, as well. If you are not from your usual workplace (and chances are, if you are travelling, you are not) or from a shaky connection (same), you are in a real problem. In a time-critical scenario, which makes it really worse.

Implementing a "secure" connection here would be a sure road for pain ahead, at least it would need the airplane company to increase customer support a lot, and likely a lot of bad publicity every time something fails. Delays cost money, especially in this industry. And what would you get for that? The safety that, if you publish a picture of your reservation / boarding pass online, nobody can log in with your credentials and cancel your flight? That's a rather niche and very targeted risk, which is better handled by a single customer support agent who, simply, issues you a new ticket.

(by the way, by the time you have checked in and your boarding pass has been issued, a lot of companies just don't allow you to cancel anymore, so it's really a non-issue?)


> (by the way, by the time you have checked in and your boarding pass has been issued, a lot of companies just don't allow you to cancel anymore, so it's really a non-issue?)

Which companies have a cancellation policy that is contingent upon getting a boarding pass? I've cancelled checked-in tickets before. If the flight is operated by a different airline than the ticket issuer, you just have to call the operating airline first to undo the check-in (a few airline can even do this online). After that it should be possible to cancel the ticket by the ticket issuer without any problems.


Do you have sources for "The MFU can be above 40% and certainly well above the 35 % in the estimate"?

Looking at [1], the authors there claim that their improvements were needed to push BERT training beyond 30% MFU, and that the "default" training only reaches 10%. Certainly numbers don't translate exactly, it might well be that with a different stack, model, etc., it is easier to surpass, but 35% doesn't seem like a terribly off estimate to me. Especially so if you are training a whole suite of different models (with different parameters, sizes, etc.) so you can't realistically optimize all of them.

It might be that the real estimate is around 40% instead of the 35% used here (frankly it might be that it is 30% or less, for that matter), but I would doubt it's so high as to make the estimates in this blog post terribly off, and I would doubt even more that you can get that "also for small models with plain pytorch and trivial tuning".

[1] https://www.databricks.com/blog/mosaicbert


Please look at any of the plain pytorch codes by Karpathy that complement llm.c. If you want scalable codes, please look at Megatron-LM.


I'm into AI but not into sound, so I might be saying something stupid here, but I think using something like this for very high volume like concerts would be possibly outright impossible, but, even if not, certainly quite dangerous and therefore not commercializable.

My understanding is that to "mute" a sound, you need to inject another wave that is exactly the opposite, with the exact same volume and in perfect sync, so that the two waves interfere destructively. However, in general but especially in AI, you can never guarantee 100% accuracy. If you use this technology to "silence" a background fountain, and something goes wrong, at worst you get a lot of noise that make you grimace and remove them. If at a concert with 100+ dB of music you get an error and your headphones start producing a similarly loud, but not perfectly aligned noise right into your ears, you probably won't have the time to remove them before damaging your hearing system.

In general, I think that having a tool that drives 100+ dB straight into your head is probably not a wise idea :-)


You could probably achieve the same outcome by combining two approaches though. Use traditional timing and phase management that existing noise cancelling headphones do. Then, using the data from that same set of microphones use AI to extract the conversation of interest (maybe using timing differences from left/right to determine who's "in front" of you) and inject that as the thing to overlay on top of the inversion. This way there's no risk of AI error on the noise cancellation and you can rely on existing solutions.


Even putting 50db of sound in the opposite direction might help take something from the volume of a nightclub to the volume of a refrigerator [1]. Not perfectly muting it, but perhaps good enough for many scenarios.

Disclaimer - I also have no technical experience of sound

[1] Going by the sounds levels in this post: https://lexiehearing.com/us/library/decibel-examples-noise-l...


It probably wouldn't work for in-ear setups. However, I'd you have over the ear headphones with good passive noise canceling (35db) then you would need less of the active canceling (65db) to make it quiet and safe.


You can get earplugs with ~30 dB reduction and builtin in-ear monitors. Slap some microphones and such on the outside, and you can probably work with it.


Yep that also sounded weird to me. I had, IIRC, three of my wisdom teeth removed as a teenager, I was living in Italy back then. I think two of them in a single session. General anaesthesia wasn't even an option, the whole thing happened in a normal dentist cabinet with a local anaesthesia to the relevant half of the mouth. I distinctly recall the dentist complaining that for one of the teeth my roots were particularly strongly attached to the bone, and he had to push and lean on it, _hard_; it didn't really feel painful, except that my jaw was aching on the opposite side (the mostly-non-sedated one) due to the pressure he put on it.

In fact, I think people and doctors alike tend to sedate much less in Italy - maybe not completely unjustified from a few things I've read in this thread. Back then, the normal drilling & filling tooth cavities mostly happened without any anaesthesia at all, local or otherwise. Frankly, that was quite painful, whenever the drilling happened to touch a nerve, and I really don't feel like experiencing it again :-) and I think at least this changed since.


> due to the pressure he put on it.

This is logical and within my experience.

Once an old dentist lady told me that she noticed patients complaining about pain on the other side in this situations. She didn't have an explanation then.


Variety matters a lot. If you pay 1000 trained labellers, you get 1000 POVs for a good amount of money, and likely can't even think of 1000 good questions to have them ask. If you let 1000000 people give you feedback on random topics for free, and then pay 100 trained people to go through all of that and only retain the most useful 1%, you get much ten times more variety for a tenth of the cost.

Of course numbers are pretty random, but it's just to give an idea of how these things scale. This is my experience from my company's own internal -deep learning but not LLM- models to train which we had to buy data instead of collecting it. If you can't tap into data "from the wild" -in our case, for legal reason- you can still get enough data (if measured in GB), but it's depressingly more repetitive, and that's not quite the same thing when you want to generalize.


If climate change were visible at that scale (tiny resolution between 0 and 40 degrees) we'd be all boiled since a while.

Still, you can see signs: the maximum temperature until 1990 or so seems to be around 35 degrees, since then there are several peaks above that value and in 2016 (?) it looks to be 38-39. It's certainly less visible on the peaks in the low, because maybe the absolute lowest scores appear to be in the 1990-2000 decade, but then again, all years in the 2010-2020 decade seem to be slightly higher than the minimum temperature in any other decade.

That said, there is massive downscaling involved in such scale, so I wouldn't be too surprised if some details were just skipped and not visible. I wouldn't trust this interpretation much - if a visualization it needs to be, I'd rather plot a moving average with a window of 6 months at least (or even 1 year to entirely rule seasonalities out), and see if that one has an upward trend or not (I bet it does).

[EDIT] I now see the post below with the year averages since 1979. It does indeed seem that 1995-1997 were abnormally cold years, and also that 2010-2020 is the warmest decade since then (and likely since quite a bit longer). So the outliers analysis here above seem to stand :-)


Tech lead for WEMC here - see https://tealtool.earth Straightforward charts of climate related data for different countries and regions around the globe

For temperature and a few other variables, it shows historical data from the EU Copernicus service (C3S) along with three different projected series out to 2100

for CO2, it shows the latest historical data

The charts are concerning and I am sure my co-workers are not hell bent on faking data to scare people just to get more funding; they work too much and go to too many meetings.


Too small signs in a very noisy data to permit to give panic to the world & people


I have not analyzed any data yet and the purpose of plotting the time series was to show an example of the data as a function of time. As others have already mentioned, the swing in Durban temperatures over the seasonal cycle is ~25°C while global temperature increases due to climate change so far are on the order of 1°C.

Plus weather data tends to be quite noisy, just think how variable the weather can be day-to-day and we're squishing 80 years of that into one plot. Also worth noting that different places may experience climate change differently. Some places may be the average temperature go up, some maybe only in the summer, so you'll have to look at averages. Some places may see more extreme summer highs, so then you can't just look at averages but the average extremes or the tail end of the temperature distribution.

So it'll be hard to discern any climate change from just a cursory glance. I'm not saying it's there, just that it requires more analysis.


Have you read any climate science? These are not the (only) numbers that the knowledge is based on.


I am an engineer, I read a lot and numbers are numbers, not a religion.


Huh, I wonder if climatologists might have based their analyses on more than just this single time series. No way of knowing.


IIRC, GPT-4 would actually be a bit _smaller_ to visualize than GPT3. Details are not public, but from the leaks GPT-4 (at least, some by-now old version of it) was a mixture of expert, with every model having around 110B parameters [1]. So, while the total number of parameters is bigger than GPT-3 (1800B vs. 175B), it is "just" 16 copies of a smaller (110B) parameters model. So if you wanted to visualize it in any meaningful way, the plot wouldn't grow bigger - or it would, if you included all different experts, but they are just copies of the same architecture with different parameters, which is not all that useful for visualization purposes.

[1] https://medium.com/@daniellefranca96/gpt4-all-details-leaked...


Mixture of Experts is not just 16 copies of a network, it's a single network where for the feed forward layers the tokens are routed to different experts, but the attention layers are still shared. Also there are interesting choices around how the routing works and I believe the exact details of what OpenAI is doing are not public. In fact I believe someone making a visualization of that would dispell a ton of myths around what are MoEs and how they work


The weights are different, because the model is different.

As jzbontar below mentions, the crucial point is that the random noise mask is the same. The diffusion models are trained to turn random noise to an image, and they are deterministic at that - the same noise leads to the same image.

What the authors did here was to find a smart way of training a new model able to "simulate" in a single step what diffusion achieves in many; to do so, they took many triplets of (prompt, noise, image) generated starting from random noise and a (fixed) pretrained stable diffusion checkpoint. The model is trained to replicate the results.

So, it is surprising that this works at all at creating meaningful images, but it would be _really_ surprising (i.e. probably impossible) if it generated meaningful images which were seriously different from the ones it was pretrained with!


Oh the images and prompts we see in article are from the training data?

Pardon my ignorance ...

Does MIT model then not work as a general text-to-image model to generate novel images based on arbitrary new text prompts that it has not seen before?


Nothing to pardon, asking questions is always the right thing to do :-) I also didn't look into the paper in great details, although I'm quite sure I am not fooling myself, but still take this with a grain of salt.

My understanding is that this paper by MIT doesn't train any new model from scratch. I takes a pretrained model (e.g. StableDiffusion), which however is trained to do "a small step" only: you fix a number of steps (e.g. 1000 in the MIT paper), and ask the model to predict how to "enhance" an image by a certain step (e.g. of size 1/1000); the constants are adjusted so that, if the model is "perfect", you get from pure white noise to an image in the exact number of steps you set. If I remember correctly how diffusion works, in theory you could set this number to any value, including 1, but in practice you need several hundreds to get a good result, i.e. the original StableDiffusion model is only able to fit a small adjustment.

This new paper shows how to "distil" the original model (in this case, StableDiffusion) into another model. However, unlike typical distillation, which is used to compress a big model into a smaller one, in this case the distilled model is basically the same as the one you start with; but it has been trained with a different objective, namely to transform random noise to the prediction that the original model (StableDiffusion) would make in 1000 steps. To do so, it is trained on a very large amount of triples (text, noise, image). But I don't think you can incorporate into this training procedure other "real" images that are not generated by the model you start with, because you don't have a corresponding noise (abstractly, there is no such concept as "corresponding noise" to a given image, because the relation noise -> image depends on the specific model you start with, and this map is not anywhere near invertible, since not all images can be generated by StableDiffusion, or any other model).

Once the model is trained, you can of course give it a new prompt and, in theory, it should generate something rather similar to what StableDiffusion would generate with the same prompt (hopefully, the example displayed on their web page are not from the training set! Otherwise it would be totally useless). But you should never obtain something "totally different" from what StableDiffusion would give you, so in that sense it's not "general", it is "just" a model that imitates StableDiffusion very well while being much faster. Which is already great of course :-)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: