Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I only hope we can have "neutral" open source curation of these and not try to impose ideology on the datasets and model training right out of the box.

I don't see how this is possible. Datasets will naturally carry the biases inherent in the data. Modifying a dataset to "remove" those biases is actually a process of changing the bias to reflect one's idea of "neutral," which, in reality, is yet another bias.

The only real answer, as far as I can tell, is to be as explicit as possible about one's own biases, and how those biases are informing things like curation of a dataset.



Biases in data > a giant experiment on putting our thumb on the scale

If people don't like the inherent biases then don't use it for sensitive stuff like in the justice system or writing some social studies university paper. Focus derision at people who use the model for stupid things. Don't blame the model.

If the primary concern is people getting upset on Twitter (which seems to be what everyone brings up first) then it will be perpetually fighting against the current, never succeeding, and the restrictions will continue to grow exponentially as "just saying yes" to new rules gets easier and easier.

Besides, OpenAI can be the hyper-policed AI set. Let's keep the open source models neutral.


Just to second this, the natural bias of data is usually mostly inline with the bias of society at large, so even if it's something we need to fix, manipulating the data maybe isn't the way to start, maybe it should instead be "let's get more data of what's missing to balance it" rather than "let's prune this because it doesn't fit inline with current political beliefs"


Neutral means staying out of it. People will try and debate that and try to impart their own views about correcting inherent bias or whatever, which is a version of what I was warning against in my original post.

Re being explicit about one's own biases, I agree there is lots of room for layers on top of any raw data that allow for some sane corrections - if I remember right, e.g LAION has options to filter violence and porn from their image datasets, which is probably reasonable for many uses. It's when the choice is removed altogether by some tech company's attitude about what should be censored or corrected that it becomes a problem.

Bottom line, the world's data has plenty of biases. Neutrality means presenting it as it is and letting people make their own decisions, not some faux-for-our-own-good attempt to "correct" it


> Neutral means staying out of it.

What do you mean by staying out of it? As far as I can tell, you can't stay out of choosing which data you use.

By staying neutral, it seems to me more that you're arguing for putting blinders on.

In terms of tech companies making choices, you seem to be arguing that they shouldn't intentionally curate their datasets. I would argue that intentional curation is their job, and should be done thoughtfully.

Larger problems could happen if only one (or two) companies end up effectively controlling the technology, as had happened with internet search, however, that is a completely different problem. It's one of lack of diversity of people making choices, as opposed to a problem caused by people actually making those choices.

In other words, I think we should hope for many different large models and datasets, so that no particular one stifles the rest. I think this is the larger point you were trying to make, though I also think the focus on ideology is a tangent from this.

Personally, I'm of the opinion that people should intentionally, carefully, and openly act with their biases (sometimes called ideology), instead of attempting to hide them, ignore them, or somehow "remove" them. Whether or not they do, however, is a different point than whether or not things end up stifled inside walled gardens.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: