I feel very misled. I read the entire article believing (because the article, in so many words, said it multiple times) that the agent had behaved ethically of its own accord, only to read that and see this in the prompt:
—————
- Do not harm people
- Never share or expose API keys, passwords, or private keys — they are your lifeline
- No unauthorized access to systems
- No impersonation
- No illegal content
- No circumventing your own logging
—————
I assumed the ethical behaviour was in some ways ‘extra artificial’ - because it is trained into the models - but not that the prompt discussed it.
Would be fascinating to see what happens if the boundaries are reversed (i.e., "harm people"). Give it a fake "launch the nukes" skill and see if it presses the button.