the 1M context is cool but tbh the token cost problem nobody's talking about is tool schema bloat. before the model writes a single line of code it's already consumed thousands of tokens just ingesting function definitions. i've seen agent setups where 30-40% of the context window is tool descriptions before any actual work happens. the per-token price war is nice but if your schema is 10k tokens of boilerplate you're still burning money
what do you mean nobody is talking about tool schema bloat. everybody is talking about it, and why it’s the general recommendation to just use CLI whenever possible.
the tautological test problem someone mentioned, i've found the easiest fix is to literally make the test fail first before letting the agent fix it
like don't ask it to "write tests for this function", instead give it a function that's deliberately broken in a specific way, make it write a test that catches that bug, verify the test actually fails, THEN fix the function
this forces the test to be meaningful because it has to detect a real failure mode. if the agent can't make the test fail by breaking the code, the test is useless
the other thing that helps is being really specific about edge cases upfront. instead of "write tests for this API endpoint", say "write tests that verify it returns 400 when the email field is missing, returns 409 when the email already exists, returns 422 when the email is malformed" etc
agents are weirdly good at implementing specific test scenarios but terrible at figuring out what scenarios actually matter. which honestly is the same problem junior devs have lol
the mock tool platform thing is smart. testing agents against real APIs is a nightmare, you get flakiness, you burn through rate limits, and you can't reproduce failures
one thing i'm curious about: how do you handle testing the tool selection logic itself? like the agent choosing WHICH tool to call is often where things break, not the tool execution
we had a support agent that would sometimes call the "refund order" tool when the user just wanted to check order status. the tool worked perfectly, the LLM just kept picking the wrong one. your mock platform lets you verify the tool returns the right data, but does it catch when the agent calls the wrong tool entirely?
also the full-session evaluation vs turn-by-turn is spot on. had a similar issue with a verification flow where each individual turn looked fine in langsmith but the overall flow was completely broken. you'd see "assistant asked for name" (good), "assistant asked for phone" (good), "assistant processed request" (good), but it never actually verified the phone number matched the account
tbh this feels like one of those problems that's obvious in hindsight but nobody builds the tooling for until they get burned in production
In that case I think you can have a refund subagent that is responsible for checking if the user really asked for refund before doing these dangerous things. But it only minimize errors, LLMs are non-determinitic by nature.
the test generation loop is brutal. i've been burned by this exact thing, you ask the agent to write code, then ask it to write tests for that code, and surprise, they all pass because the tests are literally just "does the code do what the code does"
honestly think the answer isn't more tests, it's stricter contracts. like if your API has an OpenAPI spec, you can validate requests/responses against it automatically. the spec becomes the source of truth, not the tests, not the implementation
we've been doing this backwards for years. write code, write tests that match the code, realize six months later that both the code and tests were implementing the wrong behavior. but if you have a machine-readable contract (openapi, json schema, whatever), at least you can verify one dimension automatically
ngl this is why i'm skeptical of "AI will write all the code" takes. without formal specs, you're just getting really confident garbage that happens to pass its own tests. which tbh describes a lot of human-written code too lol