I tried it to train a CNN-based CIFAR10 classifier, which worked well (only a tiny bit worse than Adam, but the difference might go away with hyper parameter tuning), but the optimizer totally failed (loss -> infinity) when training a U-Net for an image segmentation task. I had to increase eps to 1e-4 and decrease lr to 1e-3 so it would not explode, but that made it very slow to converge.
My summary is that the memory savings might be great if it works, but it does not work everywhere.
We are indeed Norwegian! The domain is owned by a squatter now. He wants EUR 3000 to release it back to us. It's simply too much right now. So, I wish that this HN post could change from the squatter URL to GitHub instead.
It is great so see a limitations section. What would be even more honest is a very large list of videos generated without any cherry picking to judge the expected quality for the average user. Anyway, the lack of more videos suggests that there might be something wrong somewhere.
> Directly charge users for it. This is effectively a non-starter, because the vast majority of people aren't willing to pay for it.
I would gladly pay for Firefox, but I only found a way to donate to Mozilla, which also finances many other things that I am not interested in.
> Insert ads or sell user data - users also hate this, it's probably not legal in the EU, and it may not be legal in most of the US in the future either.
Adds are legal, but "selling user data" is more tricky. Many news websites are currently paywalling access to their website unless a monthly fee is paid. As far as I know, there has been no ruling yet whether that is legal.
> Use the browser as a platform to push some product that does make money - a non-Google search engine? A social network? An LLM interface?
Firefox tried to sell a VPN product that way, but it was not priced competitively.
I agree. Below are a few errors. I have also asked ChatGPT to check the summaries and it found all the errors (and even made up a few more which weren't actual errors, but just not expressed in perfect clarity.)
Spoilers ahead!
First novel: The Trisolarans did not contact earth first. It was the other way round.
Second novel: Calling the conflict between humans and Trisolarans a "complex strategic game" is a bit of a stretch. Also, the "water drops" do not disrupt ecosystems. I am not sure whether "face-bearers" is an accurate translation. I've only read the English version.
Third novel: Luo Yi does not hold the key to the survival of the Trisolarans and there were no "micro-black holes" racing towards earth. Trisolarans were also not shown colonizing other worlds.
I am also not sure whether Luo Ji faced his "personal struggle and psychological turmoil" in this novel or in an earlier novel. He certainly was most certain of his role at the end. Even the Trisolarians judged him at over 92 % deterrent rate.
Yeah describing Luo Ji as having "struggles with the ethical implications of his mission" is the biggest whopper.
He's like God's perfect sociopath. He wobbles between total indifference to his mission and interplanetary murder-suicide, and the only things that seem to really get to him are a stomachache and being ghosted by his wife.
And this example does not even illustrate the long context understanding well, since smaller Qwen2.5 models can already recall parts of the Three Body Problem trilogy without pasting the three books into the context window.
And multiple summaries of each book (in multiple languages) are almost definitely in the training set. I'm more confused how it made such inaccurate, poorly structured summaries given that and the original text.
Although, I just tried with normal Qwen 2.5 72B and Coder 32B and they only did a little better.
Seems a very difficult problem to produce a response just on the text given and not past training. An LLM that can do that would seem to be quite more advanced than what we have today.
Though I would say humans would have difficulty too -- say, having read The Three Body problem before, then reading a slightly modified version (without being aware of the modifications), and having to recall specific details.
This problem is poorly defined; what would it mean to produce a response JUST based on the text given? Should it also forgo all logic skills and intuition gained in training because it is not in the text given? Where in the N dimensional semantic space do we draw a line (or rather, a surface) between general, universal understanding and specific knowledge about the subject at hand?
That said, once you have defined what is required, I believe you will have solved the problem.
I am happily using qwen2.5-coder-7b-instruct-q3_k_m.gguf with a context size of 32768 on an RTX 3060 Mobile with 6GB VRAM using llama.cpp [2]. With 16GB VRAM, you could use qwen2.5-7b-instruct-q8_0.gguf which is basically indistinguishable from the fp16 variant.
A more restrictive TLD list would have prevented this, but I certainly don't want to be the one to add new TLDs all the time, so I can see why the code looks like it does.
I was wondering why Figure 1 showed a HumanEval score of 61.6 for Qwen2.5-Coder-7B, but Table 1 shows a score of 88.4, i. e. better than this new model with a score of 66.5.
The reason is that those are actually two different models (Qwen2.5-Coder-7B-Base with 61.6, Qwen2.5-Coder-7B-Instruct with 88.4).
You can do math with word embeddings. A famous example (which I now see has also been mentioned in the article) is to compute the "woman vector" by subtracting "man" from "woman". You can then add the "woman vector" to e.g. the "king" vector to obtain a vector which is somewhat close to "queen".
To adapt this to your problem of ignoring writing style in queries, you could collect a few text samples with different writing styles but same content to compute a "style direction". Then when you do a query for some specific content, subtract the projection of your query embedding onto the style direction to eliminate the style:
I suspect this also works with text embeddings, but you might have to train the embedding network in some special way to maximize the effectiveness of embedding arithmetic. Vector normalization might also be important, or maybe not. Probably depends on the training.
Another approach would be to compute a "content direction" instead of a "style direction" and eliminate every aspect of a query that is not content. Depending on what kind of texts you are working with, data collection for one or the other direction might be easier or have more/fewer biases.
And if you feel especially lazy when collecting data to compute embedding directions, you can generate texts with different styles using e.g. ChatGPT. This will probably not work as well as carefully handpicked texts, but you can make up for it with volume to some degree.
Interesting, but your hypothesis assumes that 'tone' is one-dimensional, that there is a single axis you can remove. I think tone is very multidimensional, I'd expect to be removing multiple 'directions' from the embedding.
No, I don’t think the author is saying one dimensional - the vectors are represented by magnitudes in almost all of the embedding dimensions.
They are still a “direction” in the way that [0.5, 0.5] in x,y space is a 45 degree angle, and in that direction it has a magnitude of around 0.7
So of course you could probably define some other vector space where many of the different labeled vectors are translated to magnitudes in the original embedding space, letting you do things like have a “tone” slider.
I think GP is saying that GGP assumes "tone" is one direction, in the sense there exists a vector V representing "tone direction", and you can scale "tone" independently by multiplying that vector with a scalar - hence, 1 dimension.
I'd say this assumption is both right and wrong. Wrong, because it's unlikely there's a direction in embedding space corresponding to a platonic ideal of "tone". Right, because I suspect that, for sufficiently large embedding space (on the order of what goes into current LLMs), any continuous concept we can articulate will have a corresponding direction in the embedding space, that's roughly as sharp as our ability to precisely define the concept.
I would say rather that the "standard example" is simplified, but it does capture an essential truth about the vectors. The surprise is not that the real world is complicated and nothing is simply expressible as a vector and that treating it as such doesn't 100% work in every way in every circumstance all of the time. That's obvious. Everyone who might work with embeddings gets it, and if they don't, they soon will. The surprise is that it does work as well as it does and does seem to be capturing more than a naive skepticism would expect.
You could of course compute multiple "tone" directions for every "tone" you can identify and subtract all of them. It might work better, but it will definitely be more work.
My summary is that the memory savings might be great if it works, but it does not work everywhere.