I’ve never been one to complain about new models, and also didn’t experience most of the issues folks were citing about Claude Code over the last couple months. I’ve been using it since release, happy with almost each new update.
Until Opus 4.7 - this is the first time I rolled back to a previous model.
Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.
I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.
I noticed the difference, but coming from Gemini and xAI models it wasn’t that glaring. I still find that Opus makes much better plans than anything else I’ve tried, and it’s been very good at catching my mistakes in using public-key cryptography, also finding out why my crsqlite queries were failing despite no official documentation on the topic.
I’d never use such an expensive model for coding, so that might explain why I have little to complain about.
I've been using it with `/effort max` all the time, and it's been working better than ever.
I think here's part of the problem, it's hard to measure this, and you also don't know in which AB test cohorts you may currently be and how they are affecting results.
Agree. I keep effort max on Claude and xhigh on GPT for all tasks and keep tasks as scoped units of work instead of boil the ocean type prompts. It is hard to measure but ultimately the tasks are getting completed and I'm validating so I consider it "working as expected".
It works better, until you run out of tokens. Running out of tokens is something that used to never happen to me, but this month now regularly happens.
Maybe I could avoid running out of tokens by turning off 1M tokens and max effort, but that's a cure worse than the disease IMO.
I would risk a guess that people have a wrong intuition about the long-context pricing and are complaining because of that.
Yeah, the per-token price stays the same, even with large context. But that still means that you're spending 4x more cache-read tokens in a 400k context conversation, on each turn, than you would be in a 100k context conversation.
We're a VC-funded startup (recently raised $51M Series C) building an infrastructure orchestrator and collaborative management platform for Infrastructure-as-Code – from OpenTofu, Terraform, Terragrunt, CloudFormation, Pulumi, Kubernetes, to Ansible.
On the backend we're using 100% Go with AWS primitives. We're looking for backend developers who like doing DevOps'y stuff sometimes (because in a way it's the spirit of our company), or have experience with the cloud native ecosystem. Ideally you'd have experience working with an IaC tool, i.e. Terraform, Pulumi, Ansible, CloudFormation, Kubernetes, or SaltStack.
Overall we have a deeply technical product, trying to build something customers love to use, and have a lot of happy and satisfied customers. We promise interesting work, the ability to open source parts of the project which don't give us a business advantage, as well as healthy working hours.
Hey! I believe I've been posting this (not verbatim, it's been evolving) almost every month since 2020 - we've been actively hiring since, and we are still hiring.
Honestly, for now they seem to be buying companies built around Open Source projects which otherwise didn't really have a good story to pay for their development long-term anyway. And it seems like the primary reason is just expertise and tooling for building their CLI tools.
As long as they keep the original projects maintained and those aren't just acqui-hires, I think this is almost as good as we can hope for.
Once you’re acquired you have to do what the boss says. That means prioritizing your work to benefit the company. That is often not compatible with true open source.
How frequently do acquired projects seriously maintain their independence? That is rare. They may have more resources but they also have obligations.
And this doesn’t even touch on the whole commodification and box out strategy that so many tech giants have employed.
> Once you’re acquired you have to do what the boss says.
Or quit, and take the (Open Source) project and community with you. Companies sometimes discover this the hard way; see, for instance, the story of how Hudson became Jenkins.
A reasonably cool part about this approach (duplicating the commits, though I suppose you could just add your own bookmark on the existing commit, too) is that you can easily diff the current pr state with what you last reviewed, even across rebases, squashes, fixups, etc. Will have to give that a go.
Unfortunately GitHub still doesn't make that easy, and branch `push --force`'s make it really hard to see what changed, would be amazing if they ever fixed that.
In general, I think with the rise of agentic coding, and more review work, I hope we see some innovation in the "code review tooling" space. Not AI reviewers (that's useful too but already works well enough)! I want tools that help the human review code faster, more effectively, and in a more pleasant way.
Of course can't end the comment without the obligatory "jj is great, big recommend, am not affiliated, check out the blog post I wrote a year ago for getting started with it[0]", ha! I'm still very happy with it, no going back.
> I hope we see some innovation in the "code review tooling" space.
There's some things happening, for sure, but GitHub working towards finally supporting Stacked Diffs will, I hope, accelerate the general demand for better things.
Definitely won't use it for prod ofc but may try it out for a side-project.
It seems that this is more or less:
- instead of modules, write specs for your modules
- on the first go it generates the code (which you review)
- later, diffs in the spec are translated into diffs in the code (the code is *not* fully regenerated)
this actually sounds pretty usable, esp. if someone likes writing. And wherever you want to dive deep, you can delve down into the code and do "microoptimizations" by rolling something on your own (with what seems to be called here "mixed projects").
That said, not sure if I need a separate tool for this, tbh. Instead of just having markdown files and telling cause to see the md diff and adjust the code accordingly.
We're a VC-funded startup (recently raised $51M Series C) building an infrastructure orchestrator and collaborative management platform for Infrastructure-as-Code – from OpenTofu, Terraform, Terragrunt, CloudFormation, Pulumi, Kubernetes, to Ansible.
On the backend we're using 100% Go with AWS primitives. We're looking for backend developers who like doing DevOps'y stuff sometimes (because in a way it's the spirit of our company), or have experience with the cloud native ecosystem. Ideally you'd have experience working with an IaC tool, i.e. Terraform, Pulumi, Ansible, CloudFormation, Kubernetes, or SaltStack.
Overall we have a deeply technical product, trying to build something customers love to use, and have a lot of happy and satisfied customers. We promise interesting work, the ability to open source parts of the project which don't give us a business advantage, as well as healthy working hours.
This seems to agree with my own previous tests of Sonnet vs Opus (not on this version). If I give them a task with a large list of constraints ("do this, don't do this, make sure of this"), like 20-40, Sonnet will forget half of it, while Opus correctly applies all directives.
My intuition is this is just related to model size / its "working memory", and will likely neither be fixed by training Sonnet with Opus nor by steadily optimizing its agentic capabilities.
I'd agree that this effect is probably mainly due to architectural parameters such as the number and dimensions of heads, and hidden dimension. But not so much the model size (number of parameters) or less training.
Saw something about Sonnet 4.6 having had a greatly increased amount of RL training over 4.5.
reply