One of my side projects has been hacking a ChatGPT model and a JS App to make a multiplayer dnd style game. The AI does all the creative descriptions which are saved by the JS app and fed to the players.
It has been a lot of fun working on it because it gave me a taste of new way of working, honing prompts and finding the limits of the AI, then figuring out how to offload the parts it struggled with (mainly storing stats, inventory, regurgitating room descriptions, etc.).
The “limits of the ai” for this kind of thing is basically everything other than pure dialogue in my experience.
I also had a bit of a play with this, but the LLMs were just rubbish at it.
The only meaningful way of doing this is to write an actual game using actual state, and use the LLM as a “renderer” that renders coherent state into free text, and free texts into (more or less) structured action requests.
Without rails, it just becomes like AI dungeon; free wheeling adhoc story telling with no rules or structure…
Large context windows don’t solve large scale coherence, and prompt engineering does sfa against the devoted trolling efforts of actual players.
I wonder how far you could go with a combination of fine-tuning and a two-pass generate & audit process.
Step 1. Fine-tune a base LLM to the scenario. Feed it as much background material as possible. This would work best for a franchise with a huge extended universe or associated works: Dungeons & Dragons, Star Wars, Doctor Who, etc...
Step 2. Fine-tune / RLHF with negative weights against anything out-of-context. Basically, stop the AI ever referencing anything that can't exist in the fictional universe. Penalise references to real-world events, places, or people, modern technology, etc...
Step 3. Fork the AI model, once for each character. Fine-tune for conversations in that character's "tone" or mannerisms, backstory, etc... These can be via RLHF generated with a powerful model such as GPT 4. Again, reward/penalise the AI if it references anything it should or shouldn't know from that character's perspective.
When players play the game and converse with characters:
1. They'd always be talking to an LLM fine-tuned to death for that specific character in that fictional universe.
2. Then have both the input and output run past a general-purpose "nanny" AI that is prompted to look for exploits, out-of-context shenanigans, or unexpected output. Respond to the user with "I don't understand the strange things you're talking about" or some similar general push-back against jailbreaks.
Alternatively, have the generic security filter AI rewrite inappropriate terms in user inputs with unintelligible garbage. E.g.: if the user asks
"Are you a computer?"
Rewrite that to:
"Are you a gizwallop?"
Then the in-game character would rightly be confused by the nonsense term and answer something like:
"I have no idea what you mean, what is a gizwallop?"
Which could be translated back to:
"I have no idea what you mean, what is a computer?"
This would be absurdly expensive and slow right now, but in 5 years? 10?
I can imagine the cost of tuning the above for a GPT5-equivalent model dropping down to the budget of even a tiny indie game, let alone big-budget AAA game studios.
You've got the pattern for LLM's that I've come to as well - LLM's can decode natural language (into API calls) and can encode them (from state and prompts).
As an AI language model, I cannot help your rogue kill that goblin in the cave. Instead, you can try things like: capitalism, finance, technology.
I joke but it is terribly jarring when the API is working perfectly and then starts apologizing that it cannot do something, like access personal information, when it is internally prompted that it should only use information it receives in the prompt.
...mmm, I'm currently in the 'this is a technical limitation, not an artificial constraint' camp.
Sure, the artificially applied constraint in the APIs also exist... and sure, you can have a long context with for example, gpt-3.5-turbo-16k, but the problem is fundamentally that no amount of wishing can make an LLM into a compiler that executes code.
You cannot, and will probably never be able to define your constraints in free text to an LLM of this type and then expect it to also be able to execute those constraints in an error free manner. That's not how the technology works. You might be able to make it generate procedural code that satisfy the constraints and execute that code in a reliable procedural manner, but afaik no one has managed to get that to work reliably and at scale (if you're thinking of smol developer right now, you clearly haven't actually used it).
When you define an RPG system to an LLM, the issue isn't that it isn't allowed to say things, it's just that it cant follow the rules reliably, and it can't keep track of what's going on as the context length gets larger and larger.
...and, for an RPG system, where the contrived RPG rules and internal consistency are everything, it's a deal breaker.
> when the API is working perfectly and then starts apologizing that it cannot do something
I've been experimenting with using local LLMs (Llama of course) and building d&d games for my kids (not multiplayer but we just pass the phone around in the car so each kid can have a turn talking to the DM). It's going surprisingly well, and has been a fun way to explain LLMs and the tech behind them. I wrote up the initial experiences here:
I'm doing all of it in JavaScript. I'm still not excited about langchain, and the JS version is lagging a lot from the python version. By the way, pocketbase is amazing for something like that.
I'm using JS to wrap around the llama.cpp project. Most of the work is done by that. JS is just used to pull the tokens out, normalize them, and then send them to a real-time database (pocketbase). The frontend is a Svelte app.
It's not a lot of code, llama.cpp does almost everything.
How do people work without seeing the underlying prompts being sent to the LLM? I think I know what's going on (which probably isn't much?) but at least make it clear what's going on. The details all matter.
Not much here though, just rotating chat with the LLM taking on both sides. The prompts (https://python.langchain.com/docs/use_cases/agent_simulation...) are fine, but not great. No guidance on tension, on actions failing or succeeding, on plot advancement. Nothing to break out of anticipation loops, which are common (when the LLM promises something "is about to happen" but doesn't actually know what and keeps deferring the action).
It probably will work OK because the LLM plays both sides, and does so "fairly", i.e., always in character and never trying to "win". It'll run out of space for the history, but simple pagination might fix it (maybe that's even a LangChain feature?) – it'll drift, maybe dramatically. Or, given the system prompt, it might _not_ drift when appropriate; that is, it might not allow diverging from the original concept, and so not allow consequential action.
Am too also working on a multiplayer D&D Dungeon Master agent, Banderschnappen, though right now it is in a non-playable state as I convert a number of Jupyter Notebooks into more modular and cleaner code. I treat the LLM (ChatGPT in this case) like a rendering engine of the world, and also of the user interface, but ultimately all the heavy lifting of logic is done by a traditional game engine.
I'm 100% sure ever major providers will have at least 2 models in the future: Fun fiction and boring truthful; is just too common of a division to not having them aparts, companies who try to do it with a single model will do much worse, at best they can make it feel like is a single model but internally it just picks one of the two.
Yeah doing RAG is becoming harder and harder. Had to shift away from some models because they are too much conditioned at using their learned truth and if the provided context shifts significantly from it, it gets ignored.
Would anyone who has a throwaway OpenAI key be willing to host this on HuggingFace so people stopping by this thread can play the game? If I have free time in a few hours I'll try.
Just code it yourself. Most of the core logic can be replaced with a function that that inserts some parameters into a string template and calls an API.
This was the answer for myself as well, pretty cool that we are still at the level where if you have an idea you can build a proof extremely quickly and easily.
I've been actively contributing to Langroid as well. It is easy to use, and the intuitive design allows for the rapid development of LLM applications, streamlining the whole process. Highly recommended for anyone looking into this space!
for m in reversed(self.messages):
if total_tokens + m.length <= max_tokens:
recent_messages_reversed.append({
"role": m.role,
"content": m.text
})
total_tokens += m.length
else:
break
It would be important to change that to not drop system prompts, ever. Otherwise a user can defeat the system prompt simply by providing enough user messages.
It has been a lot of fun working on it because it gave me a taste of new way of working, honing prompts and finding the limits of the AI, then figuring out how to offload the parts it struggled with (mainly storing stats, inventory, regurgitating room descriptions, etc.).