Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This should be read as:

If you take a huge amount of human-written text soup, train a neural network on it, add a system prompt “You are a helpful assistant”, and then feed it a context consisting of (a) a mailbox with information about someone’s affair, and (b) the statement that this assistant is going to be switched off, then that neural network may produce a text with blackmail threats solely because similar patterns exist in the original text soup.

…and not as:

Warning! The model may develop its own questionable ethics.



Yeah. The title implies that the model has developed a scary sense of self preservation, but when you read the details it’s exactly what you said.

The conditions that elicited this response seemed very contrived to me.


...and how is that less of a problem if the result "ai blackmails someone" is the same?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: