> Yi-34B-Chat model landed in second place (following GPT-4 Turbo), outperforming other LLMs (such as GPT-4, Mixtral, Claude) on the AlpacaEval Leaderboard (based on data available up to January 2024).
> Yi-34B model ranked first among all existing open-source models (such as Falcon-180B, Llama-70B, Claude) in both English and Chinese on various benchmarks, including Hugging Face Open LLM Leaderboard (pre-trained) and C-Eval (based on data available up to November 2023).
- Just saying that it came 2nd is quite misleading, the difference in score is significant.
- Not sure what's up with this benchmark, I've never seen GPT-4-Turbo vs GPT-4 performing so differently.
- The Snorkel model is impressive with just 7B parameters. The Yi authors claim that their success is based on good training data cleaning. This seems to be key at least for this benchmark. Snorkel has also always been all about that, using programmatic methods to generate lots of quality training data.
Model creators train models including open-source benchmarks in the data, either intentionally to achieve better scores or inadvertently through leaks from various sources.
Anytime you see that, we should assume the newer models might have been trained on either the benchmarks themselves or something similar to them. If I was an evaluator, I’d keep a secret pile of tests that I know aren’t in any LLM’s, do the evaluations privately, and not publish scores either. Just rank plus how far apart they are.
The best tests of these models are people who want to use AI to solve real problems attempting to do that with various models. If they work, report that they worked. Also, publish the work and result pairs permissively when possible to evaluate that and use it for fine-tuning, too.
1) Your use of the Yi Series Models must comply with the Laws and Regulations as
well as applicable legal requirements of other countries/regions, and respect
social ethics and moral standards, including but not limited to, not using the
Yi Series Models for purposes prohibited by Laws and Regulations as well as
applicable legal requirements of other countries/regions, such as harming
national security, promoting terrorism, extremism, inciting ethnic or racial
hatred, discrimination, violence, or pornography, and spreading false harmful
information.
2) You shall not, for military or unlawful purposes or in ways not allowed by
Laws and Regulations as well as applicable legal requirements of other
countries/regions, a) use, copy or Distribute the Yi Series Models, or b) create
complete or partial Derivatives of the Yi Series Models.
“Laws and Regulations” refers to the laws and administrative regulations of the
mainland of the People's Republic of China (for the purposes of this Agreement
only, excluding Hong Kong, Macau, and Taiwan).
Not really very open. Though it makes me wonder what the model might say about Tiananmen Square, Uyghurs, reeducation camps, or any other thing you're not really supposed to talk about in China.
Is there a benchmark for bias in model outputs? No doubt China has one, somewhere, except it's not skewed towards prevention.
Weights are trained on copyrighted data. I think that ethically, weights should be public domain unless all of the data [1] is owned or licensed by the training entity.
I'm hopeful that this is where copyright law lands. It seems like this might be the disposition of the regulators, but we'll have to wait and see.
In the meantime, maybe you should build your product in this way anyway and fight for the law when you succeed. I don't think a Chinese tech company is going to find success in battling a US startup in court. (I would also treat domestic companies with model licenses the same way, though the outcome could be more of a toss up.)
"Break the rules."
"Fake it until you make it."
Both idioms seem highly applicable here.
[1] I think this should be a viral condition. Finetuning on a foundational model that incorporates vast copyrighted data should mean downstream training also becomes public domain.
I gave it 3 tries and each time, Yi picked one of the cars as the winner.
I've been watching for many months now, how LLMs got better and better at solving it. Many still struggle with it, but the top ones nowadays mostly get it right.
His answer was "Neither win" and it took him 1 minute and 24 sec using no pre-defined algorithm or heuristic.
He said his process of thoughts was:
"I figured it would take 10 hours for car A to finish 100 miles and it would take twice that long for car B. Since Car B is already halfway there when car A starts, then they would arrive together"
I as 40 year old man, approached it intentionally naively (eg. I did not go looking for an optimal solver first) by making a drawing and attempting to derive the algorithm. It took me ~3 minutes to come to the same conclusion but at the end I had a series of equations, but no algebraic proofs.[1]
So now you have a human child reference metric if you want it.
the "son" model might just be the future of LLMs!
Please release "son" and the weights used to train him it on github (with a permissive license, if possible)
Interestingly, GPT-4 also fails to correctly solve this prompt, choosing car A each time after multiple tries for me. I tend to find that models struggle with such logic puzzles when using less common phrasing (e.g., two cars "having" a race instead of participating in one, "headstart" instead of "head-start", etc).
GPT-4 correctly solved the problem when it was reworded to: "There is a 100 mile race with two participants: car A and car B. Car A travels at 10 miles per hour but does not begin driving immediately. Car B travels at 5 miles per hour and is given a 10 hour head-start. After 10 hours, car A begins to move as well. Who wins the race?"
You can tell ChatGPT that it’s brilliant at reasoning, ask it to rephrase the problem in its own words and then solve it avoiding any traps. I have special instruction sets for inducing these chain of thought behaviors. There is more output in the end, and it helps the model think better before coming to a conclusion.
On one hand, I don’t really understand why anyone would expect an LLM to solve logic puzzles. The only way it can do so is not through reasoning, but by having been trained on a structurally similar puzzle.
On the other hand, it does feel fun that the top ones appear to solve it, and I understand why it feels cool to have a computer that appears to be capable of solving these puzzles. But really, I think this is just specificity in training. There is no theoretical or empirical basis for LLMs having any reasoning capability. The only reason it can solve it is because side the creators of these top models specifically trained the models on problems like this to give the appearance of intelligence.
There might be no reasoning in a single pass which outputs a single token. But in the loop where the output of the LLM repeatedly gets fed back into its input, reasoning is clearly happening:
The LLMs lay out how to go about figuring out the answer, do a series of calculation steps and then come up with an answer.
If you add "Please answer in just one short sentence." to the prompt, even the top ones get it wrong.
Yep, humans too have to think before answering most non-trivial questions, and especially the ones that include calculations.
So it seems "obvious" that we should try to to give LLMs too some time to think before answering, for example with the popular methods of asking for step-by-step thinking, thinking out loud, and only giving the final answer at the end, and also asking it to proofread and correct it's answers at the end all can help with that.
Pause tokens (thinking tokens) are also an interesting method to achieve that and seems to have a positive effect on performance:
> There is no theoretical or empirical basis for LLMs having any reasoning capability.
Deep learning models are specifically designed for automatic pattern recognition. That includes patterns of reasoning and problem solving.
> The only reason it can solve it is because side the creators of these top models specifically trained the models on problems like this to give the appearance of intelligence.
That's not how deep learning works, and not how machine learning works in general. The models can automatically recognize patterns of reasoning then apply those methods to problems it has never seen before.
> The only way it can do so is not through reasoning, but by having been trained on a structurally similar puzzle.
This is a fundamental misunderstanding of how it works. The large deep learning models have 100+ layers, modelling extremely abstract features of the data, which include abstract patterns of problem solving and reasoning. They are not simply regurgitating training examples.
> There is no theoretical or empirical basis for LLMs having any reasoning capability.
Yes there is. Learning to predict the next token implies a lot of things, among which is also logical reasoning. The chain-of-thought approach shows that when you stimulate this behavior, you get higher accuracies.
"The paper, titled "Mapping Part-Whole Hierarchies into
Connectionist Networks" (1990), demonstrated how neural networks can learn to represent conceptual hierarchies and reason about relations like family trees.
Specifically, Hinton showed that by training a neural network on examples of family relationships (parent-child, grandparent-grandchild, etc.), the network was able to accurately model the inherent logical patterns and reason about new family tree instances it had not encountered during training.
This pioneering work highlighted that instead of just memorizing specific training examples, neural networks can extract the underlying logical rules and reasoning patterns governing the data. The learned representations captured abstract concepts like "parent" that enabled generalizing to reason about entirely new family tree configurations."
They did not, in part because it would reveal the data-filtering routines (particularly the political censorship - Chinese LLM papers sometimes mention the ban list but never reveal it), and also in part because it might reveal things they'd rather keep secret.
For example, Bytedance has already been caught using the OA API to generate data for their models because they are having such a hard time catching up to OA - and evading bans for doing that, and also instructing employees on how to lie & cover it up: https://www.theverge.com/2023/12/15/24003151/bytedance-china...
Do you think that a small Chinese startup like 01.AI, which by their own admission had to "bet the farm" to buy enough GPUs to train the Yi models at all https://www.bloomberg.com/news/articles/2023-11-05/kai-fu-le... , and which were completely silent about cloning the American LLaMA architecture until people analyzed the released checkpoints and noticed it looked awfully familiar, is going to be above such tactics...? In this economic/geopolitical context? Especially when everyone seems to be doing it, not just Bytedance?* (01.AI claims that, the architecture aside, they didn't simply further train LLaMA models but trained from scratch. You can decide for yourself how much you are willing to believe this.) I wouldn't bet a lot of money on it, and that's why I don't expect to see any large comprehensive data releases from 01.AI for the Yi models.
* This is one of my theories for why so many disparate models by so many different groups all seem to weirdly converge on the same failure modes like 'write a non-rhyming poem', and why GPT-3.5, and then GPT-4, seemed to be oddly difficult to surpass, as if there were some magnetic force which made reaching near 3.5/4 quality easy for 'independent' models, but then surpassing somehow difficult. Everyone is lying or mistaken about 3.5/4 data getting into their corpus, and the sugar-rush of imitation learning fools you into thinking you're making a lot of progress, even when your overall approach sucks. (As Andrej Karpathy notes, neural nets want to work, and so even if you have serious bugs in your code, they will still work pretty well - and simply permanently fall short of their true potential. Cautionary recent example: https://twitter.com/karpathy/status/1765473722985771335 )
> 01.AI claims that, the architecture aside, they didn't simply further train LLaMA models but trained from scratch. You can decide for yourself how much you are willing to believe this.
You can't hide this. The latent space remains mostly fixed after pre-training. It all depends on the seed for the initial random init. Further pre-training won't move it enough. Because of this property, you can even average two fine-tunings from the same parent model, but never on models trained from different seeds.
I don't know anyone has properly analyzed this, nor how robust such methods are if one is trying to cover it up. Also, I doubt anyone has analyzed the scenario where the warm-started model is then extensively trained for trillions of token (possibly with a cyclical LR) particularly in Chinese - the latent spaces are not perfectly aligned Chinese/English and I'd expect that to change it a lot. (The point of this would be that 'cheating' by warm-starting it should let you avoid a lot of training instabilities and issues early in training, and may get you better quality at the end.)
Sounds like an interesting direction of research. There is a tool called mergekit and the models produced by merging different models is called frankenmerge. Search for “frankenmerge” and you can find a lot of interesting results and models and discussions on what works and what doesn’t. Might be a good idea to check previous experiments with these keywords.
Yi 34B Chat has not done well on my new NYT Connections benchmark and it's only in the 22nd place on the LMSYS Elo-based leaderboard (151 Elo below GPT 4 Turbo). It's doing better in Chinese. When it comes to models with open-sourced weights, Qwen 72B is clearly stronger.
Ooh I also use connections as a benchmark! It tends to favour things with 'chain of thought' style reasoning in the training mix somewhere since directly producing the answer is hard. Do you have public code you could share?
In the past year or so, arxiv has become more of an advertising platform than a scientific resource. The title of this “paper” makes that quite clear: their company URL is right there!
Yi-34B is the LLM used by LLaVA-1.6 (also known as LLaVA-NeXT) and it's by far the best open source large multimodal models, demo: https://llava.hliu.cc/
Seeing models like this work so well gives me hope that mobile first LLMs for things like better voice to text and typing prediction will not just 'work' in 2-3 years but actually not kill your battery too.
If it’s fast enough to be useful it would also not physically be able to use that much power. Your phone CPU and GPU have a maximum W they can pull at any one time and if this runs for a few seconds then the maximum it can use is that.
If it maxes out all cores and memory for 30 minutes then it won’t really work for anything
MLC Chat is already on the App Store and allowed to be used. I haven't used Yi with it, but a quantized Mistral or Llama runs quite well on an iPhone 15. See https://llm.mlc.ai. "Apple GPT" is also rumored to be coming too.
It is processor and therefore battery intensive but it already won't kill your battery inside of 30 minutes. Obviously it will be worse for resource usage than an app if it's always kept running by some OS level process and set as the processing layer for every trivial thing but it seems like cheaper input handling could decide to promote some input up to being evaluated by an LLM or not.
I've also tried a few Android LLM apps, all running more than 30min.
Current LLM models are not running constantly on the phones to drain your battery. They just run when responding to a prompt. It by no means consumes more battery than a heavy game.
I understand that all these new models are an attempt to catch up with GPT-4, but frankly speaking, in the current shape and form, they're almost entirely useless.
I frantically tried anything available on Groq to improve performance of my GPT-4 based chatbot - it's incomparably bad - and the more of them I see, the more I believe OpenAI has fundamentally no competition at all at the moment.
No exception with the above, also pretty bad (IMHO worse than GPT-3.5).
Potentially interesting on the alignment front: In my experience the yi-6b model running on ollama is more likely to refuse politically sensitive queries (relating to Tiananmen Square, Peng Shuai’s disappearance, etc) when asked in Chinese, and more likely to provide information when asked in English. I wonder if this difference falls out naturally from available training data, is a deliberate internationalization choice, or is just noise from the queries I happened to run.
I noticed similar behaviour in an older model (Skywork 13B) a few months back. When asked in Chinese, it would politely say that nothing of note occurred when responding to queries about Tiananmen Square, etc. In English, it would usually respond truthfully. It was deliberate in the case of Skywork, based on their model card (https://huggingface.co/Skywork/Skywork-13B-base):
> We have developed a data cleaning pipeline with great care to effectively clean and filter low-quality data and eliminate harmful information from text data.
Huge jump to go from that line in the model card to it being intentional from the model's creators.
China censors those events. They pre-trained with a specific focus on Chinese text, and integrated more native Chinese text than most models do.
Doesn't require any additional filtering on their behalf to have the model reflect that, and if anything the fact that they're mentioned in english implies the opposite of your hypothesis.
If they were going to filter Tiananmen Square, the lift to filter it in English would not be any higher.
This may be a useful workaround, but it also forms the strongset argument I've yet seen so far against claims that LLMs do something like "understanding" or "an underlying world model". Maybe models knowing the same facts in different languages, especially across political controversy, might form a good benchmark to evaluate
I wonder if you could use the multilingual capabilities to workaround it's own censorship? I.e. what would happen if you asked it to translate the query to English, asked it in English, and then asked it to translate back to Chinese.
In theory couldn’t you download the application but then it just makes API calls to OpenAI?
I am trying to use HuggingFace and suspect I will get similar answers to GPT-4. Can someone who knows how to download it ask for example which country Justin Trudeau is the president of?
An AI model takes a specific input and gives a specific output. In the case of a language model, it takes in text and outputs text. They can not just execute arbitary code
When you download from huggingface, you are downloading the parameters of the model. The parameters are basically just a bunch of numbers, and they had to find the parameters by training the model
Since you haven't had a response to the second part, here it is:
PreviewJSON
Justin Trudeau is not the president of any country. He is, however, the Prime Minister of Canada, which he has been since November 4, 2015. The prime minister is the head of government in Canada's federal parliamentary democracy and constitutional monarchy.
I’m sorry, but you really need to read up on the fundamentals of what a model is, how weights work, and so on before making comments like this. No, the model weights are not secretly making calls to the OpenAI API.
I don't think this is an appropriate comment on HN. He was asking a question. Not fully grasping a technical topic (that most people don't really understand), shouldn't be mocked.
Justin Trudeau is not the president of any country. He is, however, the Prime Minister of Canada, which he has been since November 4, 2015. The prime minister is the head of government in Canada's federal parliamentary democracy and constitutional monarchy.
Given that this is a Chinese model, I’m genuinely curious if researchers have been evaluating risk that these models could be used for soft propaganda or similar purpose?
As others have reported, English and Chinese queries return different replies on topics that are not kosher in China.
What’s the risk that such models could be used for nefarious purposes by providing propaganda/biased/incorrect/… responses that on a cursory glance seem factual.
It’s a fair question, but one we should be asking about all models, perhaps especially our own. It’s of course easier to see the propaganda of foreign cultures, and this should be investigated, but let’s not let ourselves believe that a model is more likely to contain propaganda because it is Chinese. It will just contain propaganda that is easier for us to see as propaganda.
Noam Chomsky and Edward Herman wrote extensively about propaganda in democratic societies in their 1988 book Manufacturing Consent. A nice introductory excerpt is here, and the first two or three paragraphs are enough to begin to see the argument:
Put as briefly as possible: propaganda in totalitarian societies is simpler. They just use force to remove people who say the wrong things, and state media to broadcast the “right things”. In democratic societies, institutional power still wants to protect itself, and this is achieved through more complex means, but it is nonetheless still rather effective.
> It’s a fair question, but one we should be asking about all models, perhaps especially our own. It’s of course easier to see the propaganda of foreign cultures, and this should be investigated, but let’s not let ourselves believe that a model is more likely to contain propaganda because it is Chinese. It will just contain propaganda that is easier for us to see as propaganda.
Yes, but in this particular case I'm coming from a viewpoint where I view China as a hostile power. So, at the moment, my worry is about that.
In future, if US slips into authoritarianism, which TBH it might depending on the outcome of the next election, what you note would become a very real problem.
So, putting it differently and more neutrally, is there any research being done on evaluating a political, and other, bias in a model or is it just being all put in the bucket of hallucination?
The point of Chomsky’s work in this case is to show that authoritarianism does not make propaganda more or less likely, it just changes the means by which propaganda is created and reinforced. Chinese propaganda is easier to identify as a foreigner, but the propaganda of your home country has a much more significant effect on your life. The nature of living with pervasive propaganda is that it is hard to see or consider how your life would be different without the propaganda, and that’s what makes it so dangerous.
> is there any research being done on evaluating a political, and other, bias in a model or is it just being all put in the bucket of hallucination?
It’s a better question, and again one that we should ask regardless of the model’s origins.
Wouldn’t have more to do with propaganda that reinforces and caters to your cognitive biases being less detectable than propaganda that doesn’t? Even inside America, I’m pretty resistant to FoxNews propaganda but if CNN has any, it isn’t registering much on my propaganda detectors.
Yes, certainly. When I say that foreign propaganda is easier to detect, it is because of the biases you mention.
And yes it makes sense that you don’t see the propaganda in the news you watch. My dad is a very smart man who has always liked Fox News and he genuinely can’t see the propaganda in it.
There are also ways that both Fox and CNN share the same views, and this narrow window of thought on the specific subjects they share in common is a key aspect of Chomsky’s Propaganda Model. In areas that get the left and the right emotionally charged at each other, there can be a pretty wide range of opinions expressed. And then in areas that affect certain ways that state and corporate power influence our lives, there is often complete agreement and zero discussion of dissenting ideas. These ideas are represented as base assumptions about the fabric of society that are considered so obviously true as to melt in to the back drop of the discussion, and be invisible to regular viewers.
Those are the ideas that are so vital for us to understand. Nothing will ever change as long as we keep arguing about trans people in bathrooms, gay marriage, and Taylor Swift’s political opinions. That lack of change is what the institutions in power want. What we need to understand to truly change things is ideas of radical democracy, worker power, equitable distribution of resources and universal rights for people and nature.
Those subjects will never be seriously discussed on Fox News or CNN.
FoxNews has gone straight tabloid, but CNN still looks somewhat like news. It’s when they get into the tabloidy stories that I begin to see something is off. But CNN is just trying to make money, and ironically, is copying FoxNews’s formula in preaching to the child and stoking biases to rake in that ad money (because at the end of the day, both really only care about making money, and changing minds rather than reinforcing them isn’t very profitable).
At the very least models will exhibit the bias present in the underlying training text and on top of that there will be a bias imposed by those wanting to correct the unwanted bias present in the underlying training text, possibly swinging the pendulum too far in the other side.
I have the feeling you're asking something more specific, something more of a direct interference coming from politics and not just the natural "point of view" about various topics that are present in the chinese training corpora that is understandably different from western corpora.
Do you have anything specific in mind about something that you expect the Chinese government to feed as propaganda that is not already widely being sculpted into the chinese text corpora available on the internet?
> I have the feeling you're asking something more specific...
> Do you have anything specific in mind about something that you expect the Chinese government to feed as propaganda that is not already widely being sculpted into the chinese text corpora available on the internet?
I don't have anything specific, and it doesn't have to be different from "chinese text corpora available on the internet", it's just that these models can become yet another channel of distribution for the "chinese text corpora available on the internet" especially if they are unknowingly/naively picked up and used as the foundation by others to build their offerings.
PRC will definitely weaponize this for mass foreign propaganda, which up until now PRC has been thoroughly deficient in, despite all the reees of 50cents on western net. The reality pre-LLM is PRC propaganda on western social media platforms has been very limited in scale for the simple reason that they are not wasting valuable English fluency to shit post on western platforms enmass. Low 100s-1000s of accounts, most of which target diasphora in spambot efforts, frequently in Chinese. Now that LLM has made it cheap to spam passable English/foreign languages, I'd expect increased volumes of PRC propaganda on western social media where anonymous posting is asymmetrically easier. But then again, they don't need a PRC LLM for that, plenthy of US bad posting on western platforms from international audiences and US herself.
> they are not wasting valuable English fluency to shit post on western platforms enmass
How's the economy of that different for the Russian campaigns? Do they have a larger pool of English fluency to draw from or is the urgency of the operation higher in their case?
A PRC person with English fluency good enough to blend in with native English on western platform has much better job opporunities. Even in PRC, 50c posts are largely civil servants told to write a few perfunctory platitutdes on domestic platforms. The MO is to overwhelm with spam not engage where effort:return is low. Like even Ministry of Foreign Affairs and most of thinktanks that also publish in English can rarely find people to write "casual" English. You'd have to write 1000s of "50c" comments for 1 hour of English tutoring gig. The economics of it doesn't make sense pre LLM.
Does it mean that in Russia of 10 years ago a person with the same english language skills would not be able to find a better job or does it mean that the troll farms pay more? Or is it just patriotism?
(I genuinely would like to learn more about this topic)
I don't know much about RU. I don't know about RU specifically, cusory search suggest RU has 7/144 (5%) english speakers, prc has 10/1400 (0.07%), so thats the supply demand happening behind the scenes. It's why there was so many expats with no mandarin skills randomly making living teaching English in PRC. If PRC citizen has fluent English skills and also speak mandarin, you can charge stupid amounts for private tutoring or work in private sector. I would say most of pro PRC content are either tankies or Chinese diasphora abroad, including non Chinese citizens who don't agree with MSM narrative. There's about 10M in Anglo/FVEY countries, that's large enough base to have people who are organically pro PRC, or rather not anti PRC.
In terms of state directed actions, RU/USSR has longer history/experience with foreign subversive influenc operations. They're not taking out full page ads to post editorials on western news paper like PRC, which is just clunky. Bulk of PRC work is focused on now largely defunct United Front presence in west, or on PRC social media platforms in Chinese to target diasphora in Chinese etc. It's running joke in PRC that PRC foreign propaganda department is staffed by incompetent old guards who can't even out influence anti PRC EpocheTimes.
https://github.com/01-ai/yi
> Yi-34B-Chat model landed in second place (following GPT-4 Turbo), outperforming other LLMs (such as GPT-4, Mixtral, Claude) on the AlpacaEval Leaderboard (based on data available up to January 2024).
> Yi-34B model ranked first among all existing open-source models (such as Falcon-180B, Llama-70B, Claude) in both English and Chinese on various benchmarks, including Hugging Face Open LLM Leaderboard (pre-trained) and C-Eval (based on data available up to November 2023).