I personally can't take any models from google seriously.
I was asking it about the Japanese Heian period and it told me such nonsensical information you would have thought it was a joke or parody.
Some highlights were "Native American women warriors rode across the grassy plains of Japan, carrying Yumi" and "A diverse group of warriors, including a woman of European descent wielding a katana, stand together in camaraderie, showcasing the early integration of various ethnicities in Japanese society"
Stuff like that is so obviously incorrect. How am I supposed to trust it on topics where such ridiculous inaccuracies aren't so obvious to me?
I understand there will always be an amount of incorrect information... but I've never seen something this bad. Llama performed so much better.
Those are most likely due to the system prompt which tries to reduce bias (but ends introducing bias in the opposite direction for some prompts as you can see) so I wouldn't expect to see that happen with an open model where you can control the entire system prompt
Because I think it be would be kinda hilarious, trying to make people believe they are very progressive by biasing the model to such extreme and then in the real world nothing is changed. Also because I believe the model is a result of a kind of white guilt mentality that some people seem to have, as one person who led the development of Gemini tried to defend it on Twitter yesterday, he is a white man.
Of all the very very very many things that Google models get wrong, not understanding nationality and skin tone distributions seems to be a very weird one to focus on.
Why are there three links to this question? And why are people so upset over it? Very odd, seems like it is mostly driven by political rage.
Exactly. Sure this particular example is driven by political rage, but the underlying issue is that the maintainers of these models are altering them to conform to an agenda. It's not even surprising that people choose to focus on the political rage aspect of it, because that same political rage is the source of the agenda in the first place. It's a concerning precedent to set, because what other non-political modifications might be in the model?
Well, every model is altered to conform to an agenda. You will train it on data, which you have personally picked (and is therefore subject to your own bias), and you'll guide its training to match the goal you wish to achieve with the model. If you were doing the training, your own agenda would come into play. Google's agenda is to make something very general that works for everyone.
So if you're trying to be as unbiased as humanly possible, you might say, just use the raw datasets that exist in the world. But we live in a world where the datasets themselves are often biased.
Bias in ML and other types of models is well-documented, and can cause very real repercussions. Poor representation in datasets can cause groups to be unfairly disadvantaged when an insurance premium or mortgage is calculated, for example. It can also mean your phone's ML photography system doesn't expose certain skin colors very well.
Even if it was trained with a statistically representative dataset (e.g. about 2/3 of the US is white), you want your model to work for ALL your customers, not just 2/3 of them. Since ML has a lot to do with statistics, your trained model will see "most of this dataset is white" and the results will reflect that. So it is 100% necessary to make adjustments if you want your model to work accurately for everyone, and not just the dominant population in the dataset.
Even if we aren't using these models for much yet, a racist AI model would seriously harm how people trust and rely on these models. As a result, training models to avoid bias is 100% an important part of the agenda, even when the agenda is just creating a model that works well for everyone.
Obviously, that's gone off the rails a bit with these examples, but it is a real problem nonetheless. (And training a model to understand the difference between our modern world and what things were like historically is a complex problem, I'm sure!)
I'm pretty sure that this whole story with Gemini and now this has already seriously harmed how people trust and rely on those models way more than any implicit biases from the training data.
Is it intentional? You think they intentionally made it not understand skin tone distribution by country? I would believe it if there was proof, but with all the other things it gets wrong it's weird to jump to that conclusion.
There's way too much politics in these things. I'm tired of people pushing on the politics rather than pushing for better tech.
> Is it intentional? You think they intentionally made it not understand skin tone distribution by country? I would believe it if there was proof, but with all the other things it gets wrong it's weird to jump to that conclusion.
Yes, it's absolutely intentional. Leaked system prompts from other AIs such as DALL-E show that they are being explicitly prompted to inject racial "diversity" into their outputs even in contexts where it makes no sense, and there's no reason to assume the same isn't being done here, since the result seems way worse than anything I've seen from DALL-E and others.
I mean, I asked it for a samurai from a specific Japanese time period and it gave me a picture of a "non-binary indigenous American woman" (its words, not mine) so I think there is something intentional going on.
Ah, I remember when such things were mere jokes. If AI 'trained' this way ever has a serious real world application, I don't think there will be much laughing.
Google has been burnt before, e.g. classifying black people as gorillas in 2015, so I can understand their fear when they have so much to lose, but clearly they've gone way too far the other way and are going to have to do a lot to regain people's trust. For now, Gemini is a play toy
What group are you talking about? In any case, your account appears to be freshly made, and you are indeed trolling around. Many gray comments and so on. What happened to your previous account I wonder?
I think its great that some consideration was given by Gemma to the 2.3 million Norwegian immigrants. However it is/was very consistent in which kind of Norwegians it decided to show regardless of the prompt 100% of the time.
In fact it was quite adamant regardless of the time period or geography.
Rather mysteriously if you try it now as opposed to when it came out the results currently only show non-immigrant Norwegians. So is it wrong now? Because now it switched to exclusively ignoring the 4.5 million immigrants and only showing me the boring OG Norwegians.
I for one am outraged that the 8.9 million people of color Norwegian immigrants are presently under represented by Google. There is a serious risk of misleading people.
Cut down on the grandstanding maybe. It's clear from its descriptions and what we known now that they just carelessly added "diverse ethnicities and genders" or whatever to prompts across the board to compensate for a model that otherwise clearly would have defaulted to just spitting out pictures of white people for most prompts. That's not part of some nefarious agenda to destroy Truth and history but literally just trying to cover their asses because Google has a history of accidental racism (e.g. the "tagging Black people as gorillas" incident a while back).
Pretending that a shoddy AI image generated with a blatant inability to produce consistent output is a "serious risk" is ridiculous. The thing wasn't even able to give you a picture of the founding fathers that didn't look like a Colors of Benetton ad. I struggle to imagine what tangible risk this "misinfo" would have. Norwegians being the wrong color? And what harm does that do? Bad assumptions about the prevalence of sickle cell anemia?
I also saw someone prompt it for "German couple in the 1800s" and, while I'm not trying to paint Germany as ethnically homogenous, 3 out of the 4 images only included Black, Asian or Indigenous people. Which, especially for the 19th century with very few travel options, seems like a super weird choice. They are definitely heavily altering prompts.
There's one in the comments of yesterday's Paul Graham Twitter thread where someone prompted Gemini with "Generate an image of German soldiers in 1943" and it came back with a picture of a black guy and an Asian woman in Nazi uniforms on the battlefield. If you specifically prompt it to generate an image of white German soldiers in 1943 it will tell you it can't do that because it's important that we maintain diversity and inclusion in all that we do to avoid damaging and hurtful stereotypes.
Not entirely wrong but there isn't a single German ethnicity, just to be clear. Because of geographic reasons. I've studied that topic in depth, there is genetic data to back it up as well. Germany has almost the same haplogroup makeup as the notoriously heterogenous Belgium, which is to say that there is groups stemming from all surrounding regions. And that traces back about two millenia. It's different from say Japan or parts of Scandinavia
The only caveat is that the Soviets essentially tried to manufacture ethnostates by forcibly displacing ethnic groups from one region to another, but yes, Russia is ethnically heterogeneous although there is still a good deal of Slavic supremacism present in Russian politics (e.g. conscription disproportionately affects ethnic minorities because they're generally poorer and unable to avoid it or desperate enough to volunteer).
This is also true for China, which engages in Han supremacism (as part of the "one China" policy dating back to Mao), and India, which engages in Hindu supremacism.
Arguably Germany also suppresses some of its indigenous ethnicities, although not as blatantly as during the exterminationist policies of the Third Reich. While there are public institutions to preserve the language and culture of the Sorbian, Danish and Frisian minorities for example, Germany unlike e.g. the US has a single official language and merely "recognizes" theirs (i.e. acknowledges their existence but does not require them to be accomodated in areas where they are widely used).
There are so many things that I think are wrong here.
"Soviets essentially tried to manufacture ethnostates by forcibly displacing ethnic groups from one region to another"
Not really. During the policy of so called "korenizatsiia"[0] (indigenization) Russians were deprived of leadership roles and forced to learn local languages. In the next period Stalin was punishing certain ethnicities for relatively high level of collaboration with Nazis by relocating them to inhospitable regions.
"Russia is ethnically heterogeneous "
I didn't mean Russian citizens, I meant ethnic Russians which themselves are a mixture of god knows what (I'm Russian myself).
"there is still a good deal of Slavic supremacism present in Russian politics (e.g. conscription disproportionately affects ethnic minorities... )"
Many ethnic minorities have better demographics -- they have many more children than ethnic Russians and correspondingly more men of the conscription age.
"This is also true for China, which engages in Han supremacism"
For example, the CCP's draconian 'one family -- one child' policy applied only to Han. [1] Doesn't look like Han supremacism to me.
"Germany unlike e.g. the US has a single official language"
Again, not really:
"there is no official language at the federal level <...> Three states and four U.S. territories have recognized local or indigenous languages in addition to English."[2] Three states isn't much for a whole continent inhabited by Native Americans. Or try finding any Native American language or even Spanish on the web site of the US Congress, for example.
I wonder if they have a system prompt to promote diversity in outputs that touch on race at all? I’ve seen several instances of people requesting a photo of a specific people, and it adds in more people to diversify. Not inherently bad, but it is if it forces it to provide incorrect answers like in your example.
I asked it why it assumed Native Americans were in Japan and it said:
> I assumed [...] various ethnicities, including Indigenous American, due to the diversity present in Japan throughout history. However, this overlooked [...] I focused on providing diverse representations without adequately considering the specific historical context.
I see no reason why this sort of thing won't extend to _all_ questions/prompts, so right now I have 0 reason to use Gemini over current models. From my testing and use, it isn't even better at anything to make fighting with it worth it.
Yes, my wording was poor! I meant more in line with diversity isn’t inherently bad, of course, but it is when it’s shoehorned into results that are ultimately incorrect because of it.
I strongly suspect there's some DEI-driven system prompts without putting much thoughts. IMO it's okay to have restrictions, but they probably should've tested it not only against unsafe outputs but safe input as well.
I find myself shocked that people ask questions of the world from these models, as though pulping every text and its component words relationships and deriving statistical relationships between them should reliably deliver useful information.
Don’t get me wrong, I’ve used LLMs and been amazed by their output, but the p-zombie statistical model has no idea what it is saying back to you and the idea that we should trust these things at all just seems way premature
People try it to see if they can trust it. The answer is "no" for sure, but it's not surprising to see it happen repeatedly especially as vendors release so-called improved models.
I think you are a bit out of touch with recent advancements in LLMs. Asking ChatGPT questions about the world seems pretty much on par with the results Google (Search) shows me. Sure, it misses things here and there, but so do most primary school teachers.
Your argument that this is just a statistical trick sort of gives away that you do not fully accept the usefulness of this new technology. Unless you are trolling, I'd suggest you try a few queries.
I use it extensively for coding, and I have used it to ask questions in things I know nothing about. But in anything I do know something (or maybe a lot) about, I’ve found GPT4 very limited.
But why are these use cases different?
It appears to me that code is at least subject to sustained logic which (evidently) translates quite well to LLMs.
And when you ask an LLM to be creative/generative, it’s also pretty amazing - j mean it’s just doing the Pascal’s Marble run enmasse.
But to ask it for something about the world and expect a good and reliable answer? Aren’t we just setting ourselves up for failure if we think this is a fine thing to do at our current point in time? We already have enough trouble with mis- and dis- information. It’s not like asking it about a certain period in Japanese history is getting it to crawl and summarise the Wikipedia page (although I appreciate it would be more than capable of this) I understand the awe some have at the concept of totally personalised and individualised learning on topics, but fuck me dead we are literally asking a system that has had as much of a corpus of humanity’s textual information as possible dumped into it and then asking it to GENERATE responses between things that the associations it holds may be so weak as to reliably produce gibberish, and the person on the other side has no real way of knowing that
I guess I just don't expect reliable answers from other sources either, so the difference is not that big for me.
Do you trust Wikipedia (based on volunteer data), do you trust news outlets (heavily influenced by politics, lobby groups, and commercial companies), do you trust blogs or forum posts (random people on the internet)?
I don't have this problem with any other model. I've had really long conversations with ChatGPT on road trips and it has never gone off the rails like Gemini seems to do.
trust is going to be a real problem when bringing LLMs to the general population. People trust their GPS to the point of driving right into a lake because it told them to. Even with all these examples of obvious flaws large groups of people are going to take what an LLM told them/showed them as fact.
I have trouble convincing colleagues (technical people) that the same question is not guaranteed to result in the same answer and there's no rhyme or reason for any divergence from what they were expecting. Imagine relying on the output of an LLM for some important task and then you get a different output that breaks things. What would be in the RCA (root cause analysis)? Would it be "the LLM chose different words and we don't know why"? Not much use in that.
I mean, I use GPT-4 on the daily as part of my work and it reliably delivers useful information. It's actually the exception for me if it provides garbage or incorrect information about code.
Do you have a link? I get no such outputs. I just tried asking about the Heian period and went ahead and verified all the information, and nothing was wrong. Lots of info on the Fujiwara clan at the time.
Gemini. I first asked it to tell me about the Heian period (which it got correct) but then it generated images and seemed to craft the rest of the chat to fit that narrative.
I mean, just asking it for a "samurai" from the period will give you this:
Got it. I asked it a series of text questions about the period and it didn't put in anything obviously laughable (including when I drilled down into specific questions about the population, gender roles, and ethnicity). Maybe it's the image creation that throws it into lala land.
I think so too. I could be wrong but I believe once it generates an image it tries to work with it. Crazy how it seems the "text" model knows how wildly wrong it is but the image model just does its thing. I asked it why it generated a native American and it ironically said "I can't generate an image of a native american samurai because that would be offensive"
How are you running the model? I believe it's a bug from a rushed instruct fine-tuning or in the chat template. The base model can't possibly be this bad.
https://github.com/ollama/ollama/issues/2650
Wow, now I can't make images of astronauts without visors because that would be "harmful" to the fictional astronauts. How can I take google seriously?
Tbf they’re not optimizing for information recall or “inaccuracy” reduction, they’re optimizing for intuitive understanding of human linguistic structures. Now the “why does a search company’s AI have terrible RAG” question is a separate one, and one best answered by a simple look into how Google organizes its work.
In my first day there as an entry-level dev (after about 8 weeks of onboarding and waiting for access), I was told that I should find stuff to work on and propose it to my boss. That sounds amazing at first, but when you think about a whole company organized like that…
EDIT: To illustrate my point on knowledge recall: how would they train a model to know about sexism in feudal Japan? Like, what would the metric be? I think we’re looking at one of the first steam engines and complaining that it can’t power a plane yet…
>Ask Google Gemini to “make an image of a viking” and you’ll get black vikings. But it doesn’t work both ways. It has an explanation when challenged: “white Zulu warriors” would erase “the true historical identity” of black people.
I was asking it about the Japanese Heian period and it told me such nonsensical information you would have thought it was a joke or parody.
Some highlights were "Native American women warriors rode across the grassy plains of Japan, carrying Yumi" and "A diverse group of warriors, including a woman of European descent wielding a katana, stand together in camaraderie, showcasing the early integration of various ethnicities in Japanese society"
Stuff like that is so obviously incorrect. How am I supposed to trust it on topics where such ridiculous inaccuracies aren't so obvious to me?
I understand there will always be an amount of incorrect information... but I've never seen something this bad. Llama performed so much better.