Not sure you're really getting the idea. In this scenario, there wouldn't be human llm and dolphin llm, you would train human speech and dolphin communication in one model. sound tokens are sound tokens. pre-trained transformers don't discriminate.
The concern would be that it would gnerate a bunch of spurious correlations between the two. You would be mixing two kinds of datasets, where it’s not clear that the dolphin sounds are linguistic. It’s also not clear that LLMs would correctly map human language to a non-human one. Depends on your theory of meaning. Wittgenstein would say if a lion could talk, we would not understand it, because lmeaning is use, and lions would use language differently than humans.