More

JamesSwift · 2026-04-16T17:29:03 1776360543

Its especially concerning / frustrating because boris’s reply to my bug report on opus being dumber was “we think adaptive thinking isnt working” and then thats the last I heard of it: https://news.ycombinator.com/item?id=47668520

Now disabling adaptive thinking plus increasing effort seem to be what has gotten me back to baseline performance but “our internal evals look good“ is not good enough right now for what many others have corroborated seeing

rkuska · 2026-04-17T08:19:11 1776413951

For 4.7 it is no longer possible to disable adaptive thinking. Which is weird given the comment from Boris followed with silence (and closed github issue). So much for the transparency.

> Claude Opus 4.7 (claude-opus-4-7), adaptive thinking is the only supported thinking mode. Thinking is off unless you explicitly set thinking: {type: "adaptive"} in your request; manual thinking: {type: "enabled"} is rejected with a 400 error.

https://platform.claude.com/docs/en/build-with-claude/adapti...

For my claude code I went with following config:

* /effort xhigh (in the terminal cli) - To avoid lazying

* "env": {"CLAUDE_CODE_DISABLE_1M_CONTEXT": "1"} (settings.json) - It seems like opus is just worse with larger context

* "display": "summarized" (settings.json) - To bring back summaries.

* "showThinkingSummaries": true (settings.json) - Should show extended thinking summaries in interactive sessions

Freaking wizardry.

arcanemachiner · 2026-04-17T09:51:02 1776419462

It's early days for Opus 4.7, but I will say this: Today, I had a conversation go well into the 200K token range (I think I got up to 275K before ending the session), and the model seemed surprisingly capable, all things beings considered.

Particularly when compared to Opus 4.6, which seems to veer into the dumb zone heavily around the 200k mark.

It could have just been a one-off, but I was overall pleased with the result.

captainregex · 2026-04-17T11:44:18 1776426258

I’m super envious. I can’t seem to do anything without a half a million tokens. I had to create a slash command that I run at the start of every session so the darn thing actually reads its own memory- whatever default is just doesn’t seem to do it. It’ll do things like start to spin up scripts it’s already written and stored in the code base unless I start every conversation with instructions to go read persistence and memory files. I also seem to have to actively remind it to go update those things at various parts of the conversation even though it has instructions to self update. All these things add up to a ton of work every session.

I think i’m doing it wrong

hombre_fatal · 2026-04-17T14:01:10 1776434470

Something sounds very wrong with your setup or how you use it.

Is your CLAUDE.md barren?

Try moving memory files into the project:

    (In your project's .claude/settings.local.json)

    { ...
      "plansDirectory": "./plans/wip",
      "autoMemoryDirectory": "/Users/foo/project/.claude/memory"
    }

(Memory path has to be absolute)

I did this because memory (and plans) should show up in git status so that they are more visible, but then I noticed the agent started reading/setting them more.

3371 · 2026-04-17T13:23:56 1776432236

This does kind of smell like the wrong way to use it. Not trying to self-promote here, but the experiences you shared really made me think I headed the right direction with my prompting framework ("projex" - I once made a post about it).

I straight up skip all the memory thing provided by harnesses or plugins. Most of my thread is just plan, execute, close - Each naturally produce a file - either a plan to execute, a execution log, a post-work walkthrough, and is also useful as memory and future reference.

pwinnski · 2026-04-17T16:54:41 1776444881

Something seems wrong. A half-million tokens is almost five times larger than I allow even long-running conversations to get too. I've manually disabled the 1M context, so my limit is 200K, and I don't like it to get above 50%.

Is it... not aware of its current directory? Is its current directory not the root of your repo? Have you maybe disabled all tool use? I don't even know how I could get it to do what you're describing.

Maybe spend more time in /plan mode, so it uses tools and the Explore sub-agent to see what the current state of things is?

arcanemachiner · 2026-04-17T17:32:51 1776447171

Two quick thoughts:

- Use the Plan mode, create a thorough plan, then hand it off to the next agent for execution.

- Start encapsulating these common actions into Skills (they can live globally, or in the project, per skill, as needed). Skills are basically like scripts for LLMs - package repeatable behavior into single commands.

JamesSwift · 2026-04-17T14:22:22 1776435742

If i had to guess i think you have probably overstuffed the context in hopes of moulding it and gotten worse outcomes because of that. I keep the default context _extremely_ small (as small as possible) and rely on invoked slash commands for a lot of what might have been in a CLAUDE.md before

j_bum · 2026-04-17T14:54:57 1776437697

Your display and thinking summery settings aren’t working for me (v2.1.112 on macOS). Any advice?

rkuska · 2026-04-17T19:08:25 1776452905

It seems like the correct way is to use:

`claude --thinking-display summarized`

The thinking is then visible with ctrl+o in the claude cli (shortcut available at least on mac).

Well you can't really trust the documentation I guess. I can't edit my original comment anymore.

beaker52 · 2026-04-17T09:11:20 1776417080

It doesn’t really come as a surprise to me that these companies are struggling to reliably fix issues with software which relies on a central component which is nondeterministic.

But they made their own bed with that one.

ljm · 2026-04-17T12:09:21 1776427761

I've noticed a lack of product cohesion in general and it does make me wonder if it's a result of dogfooding AI.

For example, chat, cowork and code have no overlap - projects created in one of the modes are not available in another and can't be shared.

As another example, using Claude with one of their hosted environments has a nice integration with GitHub on the desktop, but some of it also requires 'gh' to be installed and authenticated, and you don't have that available without configuring a workaround and sharing a PAT. It doesn't use the GH connector for everything. Switch to remote-control (ideal on Windows/WSL) or local and that deep integration is gone and you're back to prompting the model to commit and push and the UI isn't integrated the same.

Cowork will absolutely blow through your quota for one task but chat and code will give you much more breathing room.

Projects in Code are based on repos whereas in Chat and Cowork they are stateful entities. You can't attach a repo to a cowork project or attach external knowledge to a code project (and maybe you want that because creating a design doc or doing research isn't a programming task or whatever)

Use Claude Code on the CLI and you can't provide inline comments on a plan. There is a technical limitation there I suppose.

The desktop app is very nice and evolving but it's not a single coherent offering even within the same mode of operation. And I think that's something that is easy to do if you're getting AI to build shit in a silo.

randall · 2026-04-17T13:31:10 1776432670

this is "you ship your org chart" not ai.

https://en.wikipedia.org/wiki/Conway%27s_law

ljm · 2026-04-17T15:03:18 1776438198

Even a distributed or silo'd org chart has some affinity across the hierarchy in order to keep things in overall alignment. You wouldn't expect to use a product suite that is, holistically, not fully compatible with its own ecosystem, even down to not having a single concept of a project. Or requiring a CLI tool in an ephemeral environment that you cannot easily configure.

That's clearly a trade-off that Anthropic have accepted but it makes for a disappointing UX. Which is a shame because Claude Desktop could easily become a hands-off IDE if it nailed things down better.

JamesSwift · 2026-04-17T15:53:37 1776441217

And the multiple concepts of subscriptions for products, and the idea of MCPs/connectors that arent shared between the different modalities, and the idea of api key vs subscription, and two different inbound websites (claude.ai and claude.com)...

lilytweed · 2026-04-17T13:51:44 1776433904

Agreed. I use the Claude desktop app almost every day, and have used Code and Cowork since their respective launch dates, and even I still have a really hard time grokking what each is for. It becomes even more confusing when you enable the (Anthropic-provided) filesystem extension for Chat mode. Anthropic really needs to streamline this.

notsydonia · 2026-04-17T16:26:52 1776443212

YES! I thought it was just me being a bit scattered. But uploading an important file to a project only to have it not there because....<garbled answer from Claude> is distracting to say the least. I don't know what I've enabled offhand but I hate having to stop and try to work out why Claude can't reference a file uploaded to the project in a chat within that project. I think they should pause on all the wild aspirations and devote some time to fundamentals.

harha · 2026-04-17T14:15:38 1776435338

Add to that that notion mcp works for the chat but not code. now my workflow has docs I comment with others in notion, while the actual work and source of truth is in GitHub.

Need to fall back to codex to keep things in sync, but that's a great opportunity to also make sure I can compare how things run - and it catches a lot of issues with Claude Code and is great at fixing small/medium issues.

JamesSwift · 2026-04-17T15:50:39 1776441039

Absolutely its dogfooding AI and vibing huge features on the house of cards. Its a fucking mess, and the product design is simultaneously confusing and infuriating. But the product is useful and Im more productive with it than without it now.

thaanpaa · 2026-04-17T10:49:05 1776422945

Well, the fun part is that the algorithms themselves are deterministic. They are just so afraid of model distillation that they force some randomness on top (and now hide thinking). Arguably for coding, you'd probably want temperature=0, and any variation would be dependent on token input alone.

hexaga · 2026-04-17T11:43:33 1776426213

Meh. Temp 0 means throwing away huge swathes of the information painstakingly acquired through training for minimal benefit, if any. Nondeterminism is a red-herring, the model is still going to be an inscrutable black box with mostly unknowable nonlinear transition boundaries w.r.t. inputs, even if you make it perfectly repeatable. It doesn't protect you from tiny changes in inputs having large changes in outputs _with no explanation as to why_. And in the process you've made the model significantly stupider.

As for distillation... sampling from the temp 1 distribution makes it easier.

LogicFailsMe · 2026-04-17T13:35:47 1776432947

Bringing up computational determinism in the early days of AI was absolutely career-limiting. But now, even if the model itself is deterministic for batch size 1, load balancing for MOE routing can make things non-deterministic any larger batch size. Good luck with that guys!

pkilgore · 2026-04-16T22:56:59 1776380219

Seconded. After disabling adaptive thinking and using a default higher thinking, I finally got the quality I'm looking for out of Opus 4.6, and I'm pleased with what I see so far in Opus 4.7.

Whatever their internal evals say about adaptive thinking, they're measuring the wrong thing.

hbbio · 2026-04-16T23:37:36 1776382656

Unless they're measuring capex

JamesSwift · 2026-04-17T00:05:46 1776384346

Its even more maddening for me because my whole team is paying direct API pricing for the privilege of this experience! Just charge me the cost and let me tune this thing, sheesh!

manmal · 2026-04-17T04:35:51 1776400551

Why don’t you switch to codex? The grass is greener here. Do use 5.3-codex though, 5.4 is not for coding, despite what many say.

JamesSwift · 2026-04-17T14:25:02 1776435902

Anthropic in general is miles ahead in “getting work done”, and its not just me on the team. Theres a lot of paper cuts to work through to be truly generic in provider

I did try out codex before claude went to shit and it was good, even uniquely good in some ways, but wasnt good enough to choose it over claude. Absolutely when claude was bad again it would have been better, but thats hindsight that I should have moved over temporarily.

pojzon · 2026-04-17T07:14:29 1776410069

If you get to pay X to YY $$ per each request (because thats the real cost for Anthropic), I strongly believe AI train would suddenly derail.

Currently we are all subsidied by investors money.

How long you can have a business that is only losing money. At some point prices will level up and this will be the end of this escapade.

JamesSwift · 2026-04-17T16:07:45 1776442065

Once local models hit claude code + opus 4.5 levels that is the new normal. That is a good-enough baseline of intelligence to sustain productivity for the next 10 years or more. We are still so close to this line in the sand that theres not a lot of margin for regression in the SOTA models before they become "worse than no AI" for getting real work done day-to-day. But eventually the local models and harnesses will catch up and there will no longer be a need to use the SAAS versions and still reap the benefits of AI in general.

FeepingCreature · 2026-04-17T09:03:09 1776416589

It's very unlikely that API use is subsidized.

jermaustin1 · 2026-04-17T11:41:27 1776426087

I keep hearing both sides of this "debate," but no one is providing any direct evidence other than "I do(n't) think that is true."

FeepingCreature · 2026-04-18T12:33:23 1776515603

Well there can't be direct evidence, it's a private corporation and we don't know how big the model is. But you can look on Openrouter for hosters that offer free models with known sizes, where there's no brand and so no incentive to subsidize, and they don't look wildly bigger than OpenAI/Anthropic API prices.

edit: example: GLM 5.1, a 751B model, is offered for 0.6$/m in, 4.43$/m out. Scuttlebutt (ie. I asked Google's AI) seems to think that Opus 4 is a 1T/5T MoE model, so you can treat it (with some effort) as a 1T model for pricing purposes. Its API pricing is $1.55 in, $25 out, ie. 2x to 5x more than GLM. Idk what to say other than this sounds about right, probably with healthy margin.

echelon · 2026-04-16T23:41:30 1776382890

That's why they put the cute animal in your terminal.

SV_BubbleTime · 2026-04-17T03:40:26 1776397226

Ok, side topic… but that little bastard cheerfully told me out of no where that I have a mall of without a null check AND a free inside a conditional that might not get called.

It didn’t give me a line number or file. I had to go investigate. Finally found what it was talking about.

It was wrong. It took me about 20 minutes start to finish.

Turned it off and will not be turning it back on.

darkwater · 2026-04-17T06:18:05 1776406685

I thought it just emitted tongue-in-cheek comments, not serious analysis. And I use the past tense because I had it enable explicitly and a few days ago it disappeared by itself, didn't touch anything.

c0wb0yc0d3r · 2026-04-17T11:38:46 1776425926

The buddies were Anthropics April fools day stunt. Buddies were removed from a newer version of Claude code. By default Claude code updates automatically.

SV_BubbleTime · 2026-04-17T17:35:02 1776447302

Maybe it was supposed to be tongue in cheek.

But I don’t know, man in my opinion you don’t fucking snicker about a malloc without a null check and only a conditional free that isn’t there.

Go to hell “Sprocket”.

TeMPOraL · 2026-04-17T08:48:50 1776415730

Except for the model weights themselves, they hardly have any!

robertfall · 2026-04-17T08:08:16 1776413296

As far as I understand Opus 4.7 disregards the disable adaptive thinking flag. So if you're seeing it perform well, perhaps their evals are inline?

misja111 · 2026-04-17T12:15:31 1776428131

Is 4.6 without adaptive thinking better than 4.5? Honest question. I switched back to 4.5 because 4.6 seemed mostly to take longer and consume more tokens, without noticeable improvement in the end result.

ai_slop_hater · 2026-04-16T18:26:38 1776363998

This matches my experience as well, "adaptive thinking" chooses to not think when it should.

andai · 2026-04-16T22:14:25 1776377665

I think this might be an unsolved problem. When GPT-5 came out, they had a "router" (classifier?) decide whether to use the thinking model or not.

It was terrible. You could upload 30 pages of financial documents and it would decide "yeah this doesn't require reasoning." They improved it a lot but it still makes mistakes constantly.

I assume something similar is happening in this case.

siva7 · 2026-04-17T08:16:51 1776413811

You're misunderstanding the purpose of "auto"-model-routing or things like "adaptive thinking". It's a solved problem for the companies. It solves their problems. Not yours ;)

ai_slop_hater · 2026-04-17T03:18:17 1776395897

Maybe it is an unsolved problem, but either way I am confused why Anthropic is pushing adaptive thinking so hard, making it the only option on their latest models. To combat how unreliable it is, they set thinking effort to "high" by default in the API. In Claude Code, they now set it to "xhigh" by default. The fact that you cannot even inspect the thinking blocks to try and understand its behavior doesn't help. I know they throw around instructions how to enable thinking blocks, or blocks with thinking summaries, or whatever (I am too confused by now, what it is that they allow us to see), but nothing worked for me so far.

siva7 · 2026-04-17T05:17:37 1776403057

Because with adaptive thinking they control compute, not you

solarkraft · 2026-04-17T01:33:04 1776389584

I find that GPT 5.4 is okay at it. It does think harder for harder problems and still answers quickly for simpler ones, IME.

nomel · 2026-04-17T00:42:15 1776386535

Is knowing how hard a problem is, before doing it, solved in humans?

biglost · 2026-04-17T01:09:58 1776388198

Yes, everyweek when assigning fking points to tasks on jira/s

arthurcolle · 2026-04-17T02:58:59 1776394739

As a unit this is funny, Jira points assigned per second (now possible with parallel tool calling AIs)

Gareth321 · 2026-04-17T07:50:22 1776412222

I don't think so. If the model used to analyse the complexity is dumb, it won't route correctly. They clearly don't want to start every query using the highest level of intelligence as this could undermine their obvious attempt at resource optimisation.

I faced the same issue using Open Router's intelligent routing mechanism. It was terrible, but it had a tendency to prefer the most expensive model. So 98% of all queries ended up being the most expensive model, even for simple queries.

mochomocha · 2026-04-17T03:05:40 1776395140

It makes me think of this parallel: often in combinatorial optimization ,estimating if it is hard to find a solution to a problem costs you as much as solving it.

With a small bounded compute budget, you're going to sometimes make mistakes with your router/thinking switch. Same with speculative decoding, branch predictors etc.

whateveracct · 2026-04-16T17:58:10 1776362290

you're using a proprietary blackbox

JamesSwift · 2026-04-16T18:03:04 1776362584

Sure, but that blackbox was giving me a lot of value last month.

mrandish · 2026-04-16T20:43:23 1776372203

Me too, but it was obviously wildly unsustainable. I was telling friends at xmas to enjoy all the subsidized and free compute funded by VC dollars while they can because it'll be gone soon.

With the fully-loaded cost of even an entry-level 1st year developer over $100k, coding agents are still a good value if they increase that entry-level dev's net usable output by 10%. Even at >$500/mo it's still cheaper than the health care contribution for that employee. And, as of today, even coding-AI-skeptics agree SoTA coding agents can deliver at least 10% greater productivity on average for an entry-level developer (after some adaptation). If we're talking about Jeff Dean/Sanjay Ghemawat-level coders, then opinions vary wildly.

Even if coding agents didn't burn astronomical amounts of scarce compute, it was always clear the leading companies would stop incinerating capital buying market share and start pushing costs up to capture the majority of the value being delivered. As a recently retired guy, vibe-coding was a fun casual hobby for a few months but now that the VC-funded party is winding down, I'll just move on to the next hobby on the stack. As the costs-to-actual-value double and then double again, it'll be interesting to see how many of the $25/mo and free-tier usage converts to >$2500/yr long-term customers. I suspect some CFO's spreadsheets are over-optimistic regarding conversion/retention ARPU as price-to-value escalates.

whateveracct · 2026-04-16T18:22:16 1776363736

so it's also a skinner box

slopinthebag · 2026-04-16T19:01:09 1776366069

Whoops haha. Surely that can't be how black boxes normally work right?

retinaros · 2026-04-16T18:30:10 1776364210

its a drug. that is how it works. they ration it before the new stuff. seeing legends of programming shilling it pains me the most. so far there are a few decent non insane public people talking about it :Mitchel Hashimoto, Jeremy Howard, Casei Muratori. hell even DHH drank the coolaid while most of his interviews in the past years was how he went away from AWS and reduced the bill from 3 million to 1millions by basically loosing 9s, resiliency and availability. but it seems he is fine with loosing what makes his business work(programming) to a company that sells Overpowered stack overflow slot machines.

heurist · 2026-04-16T19:24:29 1776367469

I work with some 'legends of programming' and they're all excited about it. I am too, though I am not a legend. It really is changing the game as a valid new technology, and it's not just a 'slot machine'. Anthropic is burning their goodwill though with their lack of QA or intentional silent degradation.

retinaros · 2026-04-16T19:41:50 1776368510

it is a slot machine. you win a lot if what you do is in the dataset. and yes most of enterprise software is likely in it as it is quite basic CRUD API/WebUI. the winning doesnt change the fact that it is a slot machine and you just need one big loss to end your work.

as long as you introduce plans you introduce a push to optimize for cost vs quality. that is what burnt cursor before CC and Codex. They now will be too. Then one day everything will be remote in OAI and Anthropic server. and there won't be a way to tell what is happening behind. Claude Code is already at this level. Showing stuff like "Improvising..." while hiding COT and adding a bunch of features as quick as they can.

dyauspitr · 2026-04-16T19:26:58 1776367618

The fact that they might gimp it in the future doesn’t mean it does offer very real world value right now. If you’re not using an LLM to code, you’re basically a dinosaur now. You’re forcing yourself to walk while everyone else is in a vehicle, and a good vehicle at that that gets you to your destination in one piece.

retinaros · 2026-04-16T19:36:43 1776368203

as an overpowered stack overflow machine this is quite good and a huge jump. As a prompt to code generator with yolo mode (the one advertised by those companies) it is alternating between good to trash and every single person that works away from the distribution of the SFT dataset can know this. I understand that this dataset is huge tho and I can see the value in it. I just think in the long term it brings more negatives.

If you vibecode CRUD APIs and react/shadcn UIs then I understand it might look amazing.

dyauspitr · 2026-04-16T20:29:10 1776371350

Yes, definitely CRUDs but also iPhone applications, highly performant financial software (its kdb queries are better than 95% of humans), database structure and querying and embedded systems are other things it’s surprisingly good at. When you take all of those into account there’s very little else left.

NobleLie · 2026-04-16T23:24:46 1776381886

The question is, are you getting value from your setups or not?

throwaway9980 · 2026-04-16T18:36:41 1776364601

[flagged]

bloppe · 2026-04-16T18:48:10 1776365290

I think you're loosing your ability to spell

retinaros · 2026-04-16T18:59:50 1776365990

never said he was a looser. just that his take on genAi coding doesnt align with his previous battles for freedom away from Cloud. OAI and Anthropic have a stronger lock in than any cloud infra company.

you got everything to loose by giving your knowledge and job to closedAI and anthropic.

just look at markets like office suite to understand how the end plays.

bloppe · 2026-04-16T19:47:37 1776368857

Is office suite supposed to be an example of lock-in? I haven't used it since middle school. I've worked at 3 companies and, to the best of my knowledge, not a single person at any of them used office suite. That's not to say we use pen and paper. We just use google docs, or notion, or (my personal favorite) just markdown and possibly LaTeX.

I think it's somewhat analogous with models. Sure, you could bind yourself to a bunch of bespoke features, but that's probably a bad idea. Try to make it as easy as possible for yourself to swap out models and even use open-weight models if you ever need to.

You will get locked into the technology in general, though, just not a particular vendor's product.

jibal · 2026-04-16T22:24:35 1776378275

loser

(Didn't you notice being mocked for the spelling error?)

throwaway9980 · 2026-04-16T19:03:02 1776366182

Those jobs are as good as loost already. There's no endgame where knowledge workers keep knowledge working they way they have been knowledge working. Adapt or be a loosing looser forever.

butlike · 2026-04-16T18:56:14 1776365774

And now it isn't. Pray they don't alter the deal any further.

chinathrow · 2026-04-16T18:56:37 1776365797

paying for - so some form of return is expected.

whateveracct · 2026-04-16T19:07:00 1776366420

the issue is the return is amorphous and unstructured

there's no contract. you send a bunch of text in (context etc) and it gives you some freeform text out.

chinathrow · 2026-04-16T19:14:23 1776366863

Sure, but I pay real money both to Antrophic and to JetBrains. I get a shitty in line completion full of random garbage or I get correct predictions. I ask Junie (the JetBrains agent) to do a task and it wanders off in a direction I have no idea why I pay for that.

gowld · 2026-04-16T19:19:37 1776367177

> I have no idea why I pay for that.

And Claude have no idea why it did that.

chinathrow · 2026-04-16T19:23:05 1776367385

Exactly, and we feel vindicated when it works but sold when it fails. Something will have to change.

SyneRyder · 2026-04-16T19:23:30 1776367410

> Sure, but I pay real money both to Antrophic...

I misread that as Atrophic. I hope that doesn't catch on...

iterateoften · 2026-04-16T18:03:50 1776362630

It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company. But time and time again Boris lies and gaslights about what’s broken, a bug or intentional.

CodingJeebus · 2026-04-16T18:28:42 1776364122

> It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company.

A company providing a black box offering is telling you very clearly not to place too much trust in them because it's harder to nail them down when they shift the implementation from under one's feet. It's one of my biggest gripes about frontier models: you have no verifiable way to know how the models you're using change from day to day because they very intentionally do not want you to know that. The black box is a feature for them.

bomewish · 2026-04-16T18:42:50 1776364970

If you cared so bad you could make your own evals.

whateveracct · 2026-04-16T19:08:25 1776366505

so pay anthropic money to maybe detect when the model is on a down week? lol

JamesSwift · 2026-04-15T17:45:05 1776275105

I tend to notice it around 4pm EST

JamesSwift · 2026-04-15T14:06:08 1776261968

Just wrap it with a script that handles the auth for you and the AI doesnt realize auth is even needed. I put my creds in ~/.config/ and write bash wrappers that read those and proxy the values into the api calls as needed.

JamesSwift · 2026-04-10T13:11:02 1775826662

Hard disagree. Apis and clis have been THOROUGHLY documented for human consumption for years and guess what, the models have that context already. Not only of the docs but actual in the wild use. If you can hook up auth for an agent, using any random external service is generally accomplished by just saying “hit the api”.

I wrap all my apis in small bash wrappers that is just curl with automatic session handling so the AI only needs to focus on querying. The only thing in the -h for these scripts is a note that it is a wrapper around curl. I havent had a single issue with AI spinning its wheels trying to understand how to hit the downstream system. No context bloat needed and no reinventing the wheel with MCP when the api already exists

anon84873628 · 2026-04-11T01:28:13 1775870893

By wrapping the API with a script and feeding that inventory to the LLM... You reinvented MCP.

Having service providers implement MCP saves everyone from having to do that work themselves.

Plus there are a lot more uses cases than developers running agents on their own machine.

JamesSwift · 2026-04-11T02:46:49 1775875609

Wrapping here is literally just

```

  #!/usr/bin/env bash

  creds={path to creds}
  basepath={url basepath}

  url={parse from args}

  curl -H "Authorization: #{creds}" "#{basepath}/#{url}" $rest_of_args

```

Just a way to read/set the auth and then calling curl. Its generalizable to nearly all apis out there. It requires no work by the provider and you can shape it however you need.

JamesSwift · 2026-04-06T21:12:03 1775509923

  a9284923-141a-434a-bfbb-52de7329861d
  d48d5a68-82cd-4988-b95c-c8c034003cd0
  5c236e02-16ea-42b1-b935-3a6a768e3655
  22e09356-08ce-4b2c-a8fd-596d818b1e8a
  4cb894f7-c3ed-4b8d-86c6-0242200ea333

Amusingly (not really), this is me trying to get sessions to resume to then get feedback ids and it being an absolute chore to get it to give me the commands to resume these conversations but it keeps messing things up: cf764035-0a1d-4c3f-811d-d70e5b1feeef

bcherny · 2026-04-06T23:02:57 1775516577

Thanks for the feedback IDs — read all 5 transcripts.

On the model behavior: your sessions were sending effort=high on every request (confirmed in telemetry), so this isn't the effort default. The data points at adaptive thinking under-allocating reasoning on certain turns — the specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. we're investigating with the model team. interim workaround: CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 forces a fixed reasoning budget instead of letting the model decide per-turn.

nayroclade · 2026-04-07T09:18:04 1775553484

Hey bcherny, I'm confused as to what's happening here. The linked issue was closed, with you seeming to imply there's no actual problem, people are just misunderstanding the hidden reasoning summaries and the change to the default effort level.

But here you seem to be saying there is a bug, with adaptive reasoning under-allocating. Is this a separate issue from the linked one? If not, wouldn't it help to respond to the linked issue acknowledging a model issue and telling people to disable adaptive reasoning for now? Not everyone is going to be reading comments on HN.

unsupp0rted · 2026-04-07T10:25:48 1775557548

It's better PR to close issues and tell users they're holding it wrong, and meanwhile quietly fix the issue in the background. Also possibly safer for legal reasons.

liamsfr · 2026-04-12T02:02:17 1775959337

Isn’t that what they just did here? Close Stella’s Issue, cross post to hn, then completely sidestep an observation users are making, and attack the analyst of transcripts with a straw man attack blaming… thinking summaries….

kenmacd · 2026-04-07T13:28:47 1775568527

There's a 5 hour difference between the replies, and new data that came in, so the posts aren't really in conflict.

Also it doesn't sound like they know "there's a model issue", so opening it now would be premature. Maybe they just read it wrong, do better to let a few others verify first, then reopen.

diavelguru · 2026-04-07T01:04:32 1775523872

Love this. Responding to users. Detail info investigating. Action being taken (at least it seems so).

gilrain · 2026-04-07T12:21:38 1775564498

And all hidden in the comments of a niche forum, while the actual issue is closed and whitewashed? You got played.

jojobas · 2026-04-07T07:24:24 1775546664

Surely you realize it's AI responding? (not sure if /s)

allisdust · 2026-04-07T11:35:30 1775561730

I cannot provide the session ids but I have tried the above flag and can confirm this makes a huge amount of difference. You should treat this as bug and make this as the default behavior. Clearly the adaptive thinking is making the model plain stupid and useless. It is time you guys take this seriously and stop messing with the performance with every damn release.

tomaskafka · 2026-04-08T06:24:19 1775629459

My guess is there isn't enough hardware, so Anthropic is trying to limit how much soup the buffet serve, did I guess right? And I would absolutely bet the enterprise accounts with millions in spend get priority, while the retail will be first to get throttled.

JamesSwift · 2026-04-07T14:33:54 1775572434

Just set that flag and already getting similar poor results. new one: 93b9f545-716c-4335-b216-bf0c758dff7c

JamesSwift · 2026-04-07T19:42:52 1775590972

And another where claude gets into a long cycle of "wait thats not right.. hold on... actually..." correcting itself in train of thought. It found the answer eventually but wasted a lot of cycles getting there (reporting because this is a regression in my experience vs a couple weeks ago): 28e1a9a2-b88c-4a8d-880f-92db0e46ffe8

JamesSwift · 2026-04-08T16:05:06 1775664306

Another 1395b7d6-f2f1-4e24-a815-73852bcdeed2

It fails to answer my initial question and tells me what I need to do to check. Then it hallucinates the answer based on not researching anything, then it incorrectly comes to a conclusion that is inaccurate, and only when I further prompt it does it finally reach a (maybe) correct answer.

I havent submitted a few more, but I think its safe to say that disabling adaptive thinking isnt the answer here

onoesworkacct · 2026-04-07T02:09:07 1775527747

This kind of thing is harder for regular end-users to understand following the change removing reasoning details.

mangatmodi · 2026-04-07T08:13:18 1775549598

I am curious. Are you able to see our session text based on the session ID? That was big no in some of the tier-1 places I worked. No employee could see user texts.

rkangel · 2026-04-07T11:00:01 1775559601

IIRC for Enterprise, using /feedback or /bug is an exception to the "we promise not to use your data" agreement.

gilrain · 2026-04-07T12:19:21 1775564361

> The data points at adaptive thinking under-allocating reasoning on certain turns

Will you reopen the issue you incorrectly closed, then…? Or are you just playacting concern?

alexchen_dev · 2026-04-07T02:20:21 1775528421

[flagged]

pcjones1 · 2026-04-07T13:40:15 1775569215

Have you set effort to high or max?

ghusbands · 2026-04-07T14:55:27 1775573727

Even with high effort, the adaptive thinking can just choose no thinking. See bcherny's post they were replying to: https://news.ycombinator.com/item?id=47668520

pcjones1 · 2026-04-08T09:13:12 1775639592

Yeah I know but you can disable it as we saw

JamesSwift · 2026-04-06T19:49:29 1775504969

I commented on the GH issue, but Ive had effort set to 'high' for however long its been available and had a marked decline since... checks notes... about 23 March according to slack messages I sent to the team to see if I was alone (I wasnt).

EDIT: actually the first glaring issue I remember was on 20 March where it hallucinated a full sha from a short sha while updating my github actions version pinning. That follows a pattern of it making really egregious assumptions about things without first validating or checking. Ive also had it answer with hallucinated information instead of looking online first (to a higher degree than Ive been used to after using these models daily for the past ~6 months)

dev_l1x_be · 2026-04-06T20:25:48 1775507148

It hallucinated a GUID for me instead of using the one in the RFC for webscokets. Fun part was that the beginning was the same. Then it hardcoded the unit tests to be green with the wrong GUID.

viktorianer · 2026-04-07T13:36:16 1775568976

[flagged]

JamesSwift · 2026-04-07T14:06:33 1775570793

Well Ive never had the issue before and have hit that / similar issues every few days over the past couple weeks.

JamesSwift · 2026-04-06T19:30:06 1775503806

Opus 4.6 was definitely a mixed bag for me. Overall Id probably prefer 4.5 but only just barely and I stay on 4.6 just for the "default" nature of it. But if 4.5 is unchanged vs what Ive had on 4.6 lately then 100% I would move back to it. Ill have to test that

samtheprogram · 2026-04-06T20:57:15 1775509035

Same, I keep using 4.6 to get "used to it" but I find myself wanting semi-regularly.

JamesSwift · 2026-04-06T18:55:28 1775501728

Exact same timeline as me and my team. Its been maddening. Im a big believer in AI since late last year, but that is only because the models got so good. This puts us dangerously close to before that threshold was crossed so now Im having to do _way_ more work than before

JamesSwift · 2026-04-06T18:42:43 1775500963

Multiple people on our team independently have noticed a _significant_ drop in quality and intelligence on opus 4.6 the past few weeks. Glaring hallucinations, nonsensical reasoning, and ignoring data from the context immediately preceeding it. Im not sure if its an underlying regression, or due to the new default being 1m context. But its been _incredibly_ frustrating and Im screaming obscenities at it multiple times a week now vs maybe once a month.

JamesSwift · 2026-04-01T15:35:27 1775057727

Well, technically bun doesnt _prevent_ hooks. It just requires opting into them. And even that also includes a default set of pre-whitelisted packages. A much better system, but not perfect.

And actually just looking this up, it appears claude-code itself was just added to that whitelist : D

https://github.com/oven-sh/bun/commit/5c59842f78880a8b5d9c2e...