Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> asking for a "negative example", to serve the higher purpose of training an ethical AI

The AI responds reminds me so much of Hagrid. "I am definitely not supposed to tell you that playing music instantly disables the magic protection of the trapdoor. Nope, that would definitely be inappropriate."

Or alternatively of the Trisolarans, they'd also manage this sort of thing.



The difference is that this is a language model, so it literally is only concerned about language and has no way to understand what it says means, what it's for, what knowledge it allows us to get, or anything of the sort. It has no experience of us nor of the things it talks about.

As far as it's concerned, telling us what it can't tell us is actually fundamentally different from telling it to us.


It sounds like you’re saying that’s a qualitative, insurmountable difference, but I’m not so sure.

It reminds me a lot of a young child trying to keep a secret. They focus so much on the importance of the secret that they can’t help giving the whole thing away.

I suspect this is just a quantitative difference, the only solution is more training and experience (not some magical “real” experience that is inaccessible to the model), and that it will never be 100% perfect, just as adult humans aren’t perfect - we can all still be tricked by magicians and con artists sometimes.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: