> Surprisingly, they do not appear censored in any particularly "Chinese" political direction, but they share sensibilities of ChatGPT and Claude.
Perhaps they used GPT4 responses for the instruct finetuning, as many LLaMA finetunes do?
The paper doesn't say where they got the data from, other than "The pre-trained language model is further fine-tuned, following the mainstream procedure as in
InstructGPT."
(Also, I don't like how they use raw LLaMA 65b as a benchmark rather than an instruct tuned derivative)
I believe it's more like they used Anthropic human preference data [1] or similar, and accordingly Anthropic/progressive American notion of honest-helpful-harmless behavior. Thus I've seen models misgeneralize towards prudish finger-wagging. For example they parse badwords like "beat", "abuse", "steal" in morally neutral contexts ("beat a benchmark" or something) as signifiers of substantial transgression and spiral into telling me how, as language models, they insist it's never okay to etc. etc. This attitude was strikingly reminiscent of American models, even though other failure modes – like hallucinations – don't seem so similar.
Papers like Tulu [2] suggest that LLaMA-65b is indeed an appropriate baseline, given reasonable prompting. Instruct datasets only convey a flavor of responses, and for a strong foundation model that can infer the intended flavor on its own, naive finetuning seems to be detrimental. GPT-4 was much more powerful prior to having been finetuned, if reports of early witnesses and researchers are to be believed.
Perhaps they used GPT4 responses for the instruct finetuning, as many LLaMA finetunes do?
The paper doesn't say where they got the data from, other than "The pre-trained language model is further fine-tuned, following the mainstream procedure as in InstructGPT."
(Also, I don't like how they use raw LLaMA 65b as a benchmark rather than an instruct tuned derivative)