Hacker Newsnew | past | comments | ask | show | jobs | submit | eithed's commentslogin


What I find fascinating that there is so little substance in this article about the quality of produced code and the medium. Is the code documented and tested? Is it understandable and extendable? Is it secure? What language, framework, database was used? Author mentions judgement and taste - well, is the code tasteful? Will the model rearchitecture the entire thing if I ask it to add new functionality, spending another 9.5h in tokens? I assume that the research part is domain knowledge = how different types of travel translate to time making it presentable; how did the author verify this?

These questions are even not about AI: if I were to give money to a human agency and were given something they tell me works, I would ask the same questions. If I did not know how to evaluate, I would hire people that do. With LLMs the verification part is what bothers me the most.


These posts are never written by software engineers, it’s always some tech exec, retired engineer, or VC. This author is apparently a professor at the Wharton School of Management? None of these people have to ship or maintain real products, they’re just making side projects.

The only decent software engineering perspective I’ve seen has been from Mitchell Hashimoto.


Relevant quote:

> I am sure it is not perfect (I only spent an hour working with the results), but a software engineer would iron out the remaining potential bugs that I could not find quickly [...]

People have said things like this many times in the past, and, in the past (perhaps not now), it's always been a misunderstanding of what is good and bad, what's difficult and easy.

For example, someone would draw a UI in a GUI painter that generates code (or a resource file), and a manager would see it and think the majority of the work towards the product is done. (Incidentally, then there seemed to be a reaction, towards making your UI mockups look abstract or otherwise different from runnable code, helping the nontechical to understand that this isn't 90% of the finished product.)

Or a student intern hacks out a homework-grade demo, and a manager who understands neither software engineering nor product domain says "we just need some engineers to polish it up for production", and thinks the student is a star and why can't their engineers be as brilliant and productive. (I might have once been that energetic intern, who was happy for the encouragement, but then learned more, and saw it was a thing.)

This common misunderstanding was sometimes self-correcting -- when trying to ship became a disaster of misery and regretted-attrition, or the product was poorly received by the market because it wasn't thought through nor implemented well, or building subsequent functionality atop it was a nightmare. (But adverse effects of bad approaches is one of the reasons for management and ICs to job-hop, before the unwanted effects affect them personally.)

What might be different now is that some of these AI tools are outputting better-engineered work than some software engineers, and much faster.

At the back of my mind, I'm wondering how the really great software engineers will continue to stand out, as the discipline is being devalued in the minds of most leadership, and anyone can prompt an AI to generate something that superficially appears to them like what they assume a great software engineer would produce. (Even if the great engineer would do much better quality of implementation, have innovative ideas that ML from open source code would not, and maybe arrive at better product concepts as they worked through the problems.)


Well that’s kind of the point.

They can just summon bespoke software out of the ether that only handles the use cases of themselves and a few of their collaborators.

Making “side projects” was mot possible for non-developers before powerful LLMs. Now it is.


I don’t think that’s true, I think these authors are making a much stronger claim that AI is proficient or even an expert at software engineering. This author describes how complex and sophisticated their software is, and the only value he’ll concede to “coders” is that there might be a few bugs they’d need to fix.

Imagine not being an architect and using Claude to put together a building plan, then concluding it’s basically done but we might need a real architect to double check the measurements. It may even be true but I’d be skeptical if it’s always non-architects saying this.


And - we kind of have been here before. The "proto"-type is almost complete. Its just a little slow, a little spaghettificated, just written in excel-vb, clicked together in node-graphs, or the next hot thing that makes coding unnecessary.

Why do they even need coders to fix these bugs? It would be an order of magnitude (at least) to ask Claude to find and fix them, and it will likely be successful.

Building in the physical world has physical and time constraints that cannot be overcome, which is one of the reasons architecture (and engineering) are so important in this domain. In software development these constraints were only inherent when people were writing the majority of the software. I feel like I’m seeing what I thought were fundamental constraints being eroded by the increasing speed and correctness of these tools and it’s making me reconsider the importance of some of the values that are held by software engineering.

It’s obviously dependent on the domain and solution, but if your software can be extremely rapidly rearranged, bugs found and fixed with little effort, and features added with only a minimum prompt, I think the entire definition of technical debt has changed. I’ve been sceptical of these tools and still approach their output with caution. I also worry that, as a software developer, if more can be accomplished in less time there will be less room on this planet for software developers.


It's quick to build a hut in a green field, but slow to remodel the expanded building after. I think that will remain true regardless of if a team of sw developers are doing it, or an AI with a product manager or somewhere in between.

> I think the entire definition of technical debt has changed. I’ve been sceptical of these tools and still approach their output with caution.

This very well summarizes my current thinking on the subject as well. And most of my career has been playing the role of technical debt nazi. Much to the detriment of my earning potential.

Does AI make incredibly inefficient code most of the time? Yup. But it does it at lightspeed with minimal effort.

I think many software engineers forget they exist to get real things done (in many cases at least) and they are a cost center for most businesses. If your end product is not selling software, very few people actually Doing the Thing(tm) will give a single solitary care about code quality or maintainability when they can just spend 30 minutes and $15 worth of tokens to fix it.

It won't take over everything, but I've already seen otherwise very intelligent go-getter type folks who are not technical or know how to code made extremely useful things for themselves and their small little enterprises. And this will seemingly only get better and more efficient.

For someone who really does love the idea of well architected and future-proof code this is just icky to even say or consider. But I'm coming around to this is the future for the majority of software for most places. And it may have the ability to seriously even the playing field for small enterprises in some industries.

I'm currently using it to implement a zillion side projects at home I've been "meaning to get to" for years. It makes incredibly silly unmaintainable code most of the time - but I learned to not care, and just tell the AI bot to fix it/add to it as I go along. Worst-case I spend a single night deleting it all and starting from zero to "refactor" an entire thing.


> I think many software engineers forget they exist to get real things done (in many cases at least) and they are a cost center for most businesses. If your end product is not selling software, very few people actually Doing the Thing(tm) will give a single solitary care about code quality or maintainability when they can just spend 30 minutes and $15 worth of tokens to fix it.

I am suprised to hear people so naive they expect their token usage to stay flat if code quality and maintainability starts falling exponentially?

What if to fix 2 bugs your LLM starts adding 50 new ones? Will you tell your customers in supports channel "sorry software is finished, if we try fixing anything, everything else might break, not worth it". Or "we can probably fix it, but our AI usage will raise so much we need to up the subscription 3 fold, you choose".

The speed at which LLM codes is only comparable to the speed at which they add garbage to your repo. If you stop caring about maintainability, you also stops caring about your AI/LLM related bills and the viability of your project past the PoC stage.


The GP explicitly mentioned "end product is not selling software". But even then, bugfixes introducing new bugs are not unheard of before. Most code used to be mediocre quality so there's not a sea of change with AI. Perhaps it even becomes better on average.

Another thing though is selling software in the first place will soon become tough proposition outside of a few niches.


I am suprised to hear people so naive they expect their token usage to stay flat if code quality and maintainability starts falling exponentially?

There's no reason to think that quality and maintainability will start falling exponentially. On the contrary, these models get better every couple months, and 99% of software isn't actually that complicated. There's just no reason for the fear-mongering that fixing 2 bugs will cause the LLM to add 50 new ones.


Except that i witness it create new bugs while fixing existing ones?

Not 50:1 but it does happen


I think many software engineers forget they exist to get real things done

One billion percent. I think the vast majority of the anti-AI sentiments I hear from software engineers comes down to them caring more about playing with their tools than actually solving the problem.


> Does AI make incredibly inefficient code most of the time? Yup. But it does it at lightspeed with minimal effort.

This hits the nail in the head.

Detractors often hang on to examples of coding assistants making mistakes or output subpar code, but they somehow miss the fact that coding assistants can also be prompted again and refactor whole swaths of code just as fast as they introduce oopsies. This means that the worst case scenario implies fast convergence to an acceptable outcome, and from there also fast iteration to improve upon that.


The problem is that this approach is not sustainable. Errors compound. The cost to fix one issue might seem small at first, but over a stretch of time all these "oopsies" become architectural spaghetti that can only be fixed with a complete rewrite, which will certainly become more expensive than getting the code "organically" developed.

The only way I see AI coding working in the long run is if we go back to a Waterfall/BDUF process and having actual engineering. Let engineers really own the architecture. Enforce that any new feature - no matter how small - to be specced out with complete sequence diagrams. Ensure that every new software package needs to be put on an UML component diagram for the team to review and see each addition interacts with the whole system, etc.

If we do that, then we can just give all the documents to a coding agent and say "go ahead and implement this" with a minimal amount of confidence. But in doing this, I bet we will realize the following:

    - the "effort" has never been about writing code itself. The code is just the material manifest of all the thought that went to think over a solution into the problems that the product is attempting to solve.

   - we will likely be better off by using code generation tools (i.e, UML-to-code) and a "weak" LLM (than can run locally) than by playing the token lottery at the Anthropic Casino.

I mirror your thoughts. I think we'll end up with "perfect map" paradox = you cannot be vague or indecisive on what you want (and if you are then these decisions don't matter) and you're creating a 1:1 representation of what the code needs to be.

I'd substitute "owner" for the team and in that sense the owner will not need to be human.

We're at this state where Claude is great at doing the "middle" part of work, but it's crap at gathering requirements and verification of what it has done. I also don't see people caring about these aspects of software development as shown in the article


> The problem is that this approach is not sustainable. Errors compound. The cost to fix one issue might seem small at first, but over a stretch of time all these "oopsies" become architectural spaghetti that can only be fixed with a complete rewrite, which will certainly become more expensive than getting the code "organically" developed.

That's so far been called software development.

All software developed by people suffers from this issue.

Where exactly is the novelty?

> The only way I see AI coding working in the long run is if we go back to a Waterfall/BDUF process and having actual engineering.

Nonsense. The problem is exactly the same.

With agents iterations are much faster, and this can mean things can get messier faster but can get in shape just as fast.

Ironically, agents improve the quality of the deliverable as well. Approaches such as spec-driven development do a far better job delivering features up to spec than manual coding by flesh and blood developers.

There's an awful lot of baseless scaremongering in your post. You make it sound like with AI assisted coding developers stopped paying any attention to quality.


> Where exactly is the novelty?

The compounding speed. Your devs might reach a point where they have to rewrite and refactor, in a decade.

Your LLM, with its higher throughput, may put you in that game breaking situation next week.


> All software developed by people suffers from this issue.

And that’s pretty much where you are wrong. Take any long running open source project and you can see the craftsmanship that goes into it. It may not be perfect, but hacks are clearly marked as such.


I haven’t used Fable/Mythos yet, but my experience with recent version of Opus, GPT 5.5 and recent Chinese models is that promoting again isn’t guaranteed to fix the underlying issues, nor is it guaranteed to not introduce more issues. I’ve seen SOTA models make ridiculously stupid architectural decisions that they were then unable to back out of without being prompted very specifically, instead adding a patchwork of “fixes” on top.

I’m not saying that you can’t use AI to do it because I believe that with carefully controlled workflows and context management you can, but it’s not a simple prompt away, it’s requires guidance and understanding, and isn’t the speed demon that raw prompting is.


> I haven’t used Fable/Mythos yet, but my experience with recent version of Opus, GPT 5.5 and recent Chinese models is that promoting again isn’t guaranteed to fix the underlying issues, nor is it guaranteed to not introduce more issues.

That's not really the point though. That presumes models are only useful if they are one-shot models. That is false.

I mean, what if your prompt successfully changes 20 source files and makes a mess in one? How much work did it saved?

And the elephant in the room is when models actually outperform whatever the prompter is able to deliver, and faster. That is somehow left out.


> That presumes models are only useful if they are one-shot models

That’s not at all what I’m saying.

I’m saying that in my experience across multiple models, the follow up prompts don’t fix prior underlying issues. They usually patch on top instead, unless you give them significant and time consuming guidance.

I want them to be more useful outside of one-shot uses, but I find that they currently miss the mark.


> I’m saying that in my experience across multiple models, the follow up prompts don’t fix prior underlying issues. They usually patch on top instead, unless you give them significant and time consuming guidance.

That's not my experience at all, and I have been using models that are far from being cutting edge. Even in the cases where a model generates utter nonsense, a couple of clarifying questions is all it takes to get it back on track.

But that might be a factor of the project being worked on, and the extension of the changes being asked.


I think this is overlooking the fact that assigning a coding assistant to fix the bugs it re-introduces for all eternity just leads to spiraling token costs, which might cost more than just hiring a competent engineer in the first place.

Maybe. We will see.

I think computers are incredibly cheap compared to humans. These models and infrastructure to run them are going to only get more efficient in time. Right now we are still using (for the most part) entire hardware architectures mostly shoehorned from one purpose (graphics) into another. As purpose-built hardware becomes more prevalent and the SOTA starts to slow down I can't imagine a $100k hardware box not being able to handle a small team of developer's needs for many things.

I do think there will be a place for the top 20% of software engineers forever. But most people are not in that top 20%, and the quality when you get below average is not a linear progression. It will not be that difficult for AI generated code to beat the "bottom end" of the industry since tbh it's hard for me to tell the difference between LLM generated code and some of the shit I've seen over the years. I've ran across code written by folks who don't know what an array is more than once.

Most software is not built by MIT and Stanford grads making $500k/yr in the Valley. It's built by work-a-day programmers in the middle of nowhere making $80k/yr to keep some niche small business going with hyper-specific software that was first designed for Windows 95. Or stuff like making horribly designed Wordpress plugins. Or Shopify integrations. etc. etc.

I've also seen these small businesses totally held back by incompetent programmers, and despite their best efforts and huge amounts (for them!) of investment they can never seem to fix it. These types of enterprises are having AI run circles around their current engineering practices, even if it would make most FAANG engineers gasp in horror.

Either way it will certainly be interesting to watch! I just wish I was closer to retirement.


This has been a debate for ever, long before LLMs. On the one hand you have people who don't care, on the other you have people who produce good code.

Doesn't matter how fast you can make the wrong thing.


Don't forget that you can adjust your requirements (either via plan or skill) to ensure the mistakes do not happen. The problem is that neither LLMs, nor humans (that don't work with the domain) will know they made these mistakes. Even coders don't think about everything all the time

> Don't forget that you can adjust your requirements (either via plan or skill) to ensure the mistakes do not happen.

No, you can't. Adjusting prompts ensures absolutely nothing.


I disagree. What I should have added is that with agents (as well as humans) you do need to have tests that verify what was done.

That assumes you can write automated tests that reliably identify the mistakes over an entire codebase. Nice idea in theory. If it were actually possible, we would long since have generalized libraries of tests to catch every significant security and performance gotcha. What we have are static code analysis tools, fuzzers, etc. None of which have come close to eliminating security and performance problems. I don't see how AI somehome changes that.

Ah, I see what you mean now. Yes, my mind went straight to static analysis and testing (unit, feature, uat, mutation). Thanks for expanding on your point!

In my experience, the refactors are just as bad, just in different ways. All you end up doing is treading water with different iterations of shitty code. By the time you get somewhere acceptable, you could've just fixed it up yourself.

My preferred workflow these days is to pair program with an LLM until it gets close-ish and then manually touch it up. Without that, it just produces junk in different forms.


Technical debt remains the same. LLMs are found not to work as well when editing messy codebases - exactly the kind you get after using an LLM for a while. After a few weeks or months you have to either throw it away and start over, or involve a human at exorbitant prices.

> I think these authors are making a much stronger claim that AI is proficient or even an expert at software engineering.

The author specifically says:

> I am sure it is not perfect (I only spent an hour working with the results), but a software engineer would iron out the remaining potential bugs that I could not find quickly (which is one reason we may need more, not less, coders in the future, to help with the explosion of new uses for software)

which acknowledges pretty clearly that engineers bring a level of insight and experience still missing from Mythos. Saying that, I totally disagree with his contention that this will always be true. It's pretty weird that the author of an article stressing the steep improvements in a model's capability can't seem to imagine further improvements in that capability. As if Mythos is where development ends or whatever gap remains between models and experts won't steadily narrow or eventually widen in reverse.


Well, right, but if the real use case for LLMs is "making software that wasn't economical to make before" that's bearish for the labs because it means they're only going to be chasing the low end of the market.

It is, and it's cool that it is, but the calibration is important. Statements like this:

> With Fable the spell has gotten powerful enough that I am no longer sure I am the wizard. I am closer to a patron. I describe what I want, I pay for it, and I judge the result. The conjuring happens somewhere I cannot watch, in hundreds of small choices I never get a vote on. The work has shifted from process to outcome. I no longer steer; I commission.

have a very different meaning coming from a non-technical researcher than they would from someone who builds software for a living.


Making side projects isn't a trillion dollar industry tho, adding to the fact that we are facing another global supply chain crisis due to the Iran War; the US is about to commit the biggest self-own ever in the history of empire.

There are actually quite a few trillion dollar industries that exist thanks to "side projects".

Apple was Woz's side project, once upon a time. Adsense came from Google's 20% time. Social media started as a side project.

Forests grow from trees. Trees grow from seeds. More potential seeds = more potential forests.


All the undiscovered Woz's of the world add up to a trillion dollars? There's $1T of money out there waiting to be spent on side projects?

The question was "are side projects a trillion dollar industry" not "has a side project ever started an industry"

How much of a new $1T software product will anthropic capture in token costs, anyway?


The US has been on a course of self-owns ever since Trump got into office. That they still are a dominant power on the globe shows how much they were one before Trump, but it seems to be changing. At every self-own they commit, China laughs and inches up a little closer. I think we will see the day, when they are evenly matched in our lifetimes.

But which self-own exactly do you mean, of the many there are?


I’m starting to realize that LLMs are really good at building low-stakes projects. Your questions mostly presume that the stakes are higher. The software will last a long time; the requirements will evolve; we can’t tolerate mistakes; etc.

The trick to getting good at using LLMs for software is to learn how to make _all_ projects low-stakes.


You don't need LLM for that. You make _all_ projects low-stakes by working on green field project using (insert buzzword soup of the day) and leaving for a new green field opportunity (that requires experience with buzzword soup of the day) before the project ships.

No, what you’re describing still requires you to do some actual work, and also, while you work there, there is still some level of accountability. A much, much better grift is coaching.

Like, an AI coaching session for executives at the yearly executive retreat. You show up, spend a few hours going through some nonsense slides ChatGPT put together for you, you charge an eye watering fee for it, HR or whoever organizes it will gladly pay for it because it will make them look all cutting edge in front of the CEO, by the next day everyone will forget about it. No accountability at all!


In the LLM world you never get a chance to get paid to work on those greenfield projects because the person with the idea is churning the prototyping and discovery work themselves.

If you want to get paid to work on software, you get involved after its found success and the stakes get higher.

(Which assumes there are still significant areas where economies of scale reward that vs everybody just having their own DIY version of everything.)


Or economies of liability and buck passing. I suspect managers and businesses will still want to be in the game of "not my fault, supplier is working on it, we can sue them if they don't meet SLA".

You've got to be the person with the idea. I'm currently doing that. I spent the past year working on a frustrating project where everybody else did everything wrong, so now I'm building it on my own, hoping to sell it to them. (No idea if that will work)

If there's a viable way to make all projects low-stakes we'd have done it. Consider this: microservices.

This is really insightful, but I think it also extends to making the project either low stakes or low complexity. I have this lurking feeling that the preferable architecture for software will change as a result of LLMs because they're good at working on low complexity modular components more than they are on high complexity million-line code bases.

You'll just shift complexity to the orchestration of the modular components.

Monoliths vs micro-services.


They aren't necessarily as great at building low-complexity high-modularity components, though. ;)

Unless you know enough to tell them to! And keep them honest about it...


But not all projects can be low stakes. None of the important ones are.

> The trick to getting good at using LLMs for software is to learn how to make _all_ projects low-stakes.

this doesn't really work in the real world. There are many things that actually matter, engineering is fundamentally about handling them.


Welcome to every LLM discussion in the past 2 years or so. When asked for anything of substance, we're faced with a barrage of "but humans aren't good at this too!" Very few quantifiable evidence and lots of pure rhetoric.

I’ve seen this pattern again and again, and I don’t bother replying. There’s also the “strong statement, and when you contradict it, they point out some particular circumstances that no one cares about”.

I think a lot of us have stopped talking to each other about this. I see it the other way round to you. I see constant scepticism and doubt that LLMs can build anything useful, and whenever provided with examples, the goalposts just move.

And at my own firm, I think every developer is generating most of their code using agentic coding. We're still sceptical enough that we are doing the usual heavy handed human review process, so we're not seeing a huge speed up in delivery times, but we are seeing a volume increase. That is because writing the changes and raising the PRs is much faster, but also a lot of boring admin and support work is now mostly done by LLMs. Reports of instability, vague client requests, etc? Throw the LLM at them and it usually figure it out why I continue to engineer.

So I know, first hand, that these things are very good. I also know second and third hand that pretty much every fintech in the industry is as heavily using agentic coding as we are.

And then I come to HN or reddit and I see people telling us that they cannot write decent production code, and this is just wrong. This isn't opinion wrong, it is objectively wrong. Any fintech that wants to keep up will tell you this.

I can't speak for other industries but I can't imagine they're different.

So, I'm not sure what to conclude from this. I don't want to be uncharitable, but when HN/reddit posts just don't match the reality I see for myself, I have no choice but to categorise them as being emotionally driven to stick to a particular narrative, and so I can dismiss them.


It is all the same narratives from around the invention of the power loom if you look into it.

What I take from that time also is that the hand loom weavers were not incorrect. The power loom did not do as good of a job as they did by hand.

You can still by a hand woven shirt today at a premium price.

There is a category error as if quality is the product as opposed to one input of the product.

You probably don't get to be a master craftsman without that quality mindset so they aren't wrong but missing the forest for the trees.


I use Claude Code at a fintech, and I'm seeing garbage PRs from careless coworkers all the time. I'm having to correct Claude output regularly.

Yes, it does nearly all the typing for me now. But left to its own devices, it'll happily spit out awful code.


> I see constant scepticism and doubt that LLMs can build anything useful, and whenever provided with examples, the goalposts just move.

> I see people telling us that they cannot write decent production code, and this is just wrong.

At least for me, that has never been the counterpoint that I’ve been making. I’ve never cared about code itself, especially with languages like Java and Kotlin, where you basically autocomplete most of the code, and with SDK like ios where you can collect snippets for most of the patterns that you need. And with frameworks like Laravel, where most big additions are done with the tooling. And because code is so repetitive, editors like emacs and vim have lots of features and plugins to help with copying and pasting (registers, macros, navigation, snippets,…)

And the fact is some code you wrote today will be worthless tomorrow and will be replaced and deleted. So, it’s very rare to care about some particular snippets or patch of code.

What myself, and others, have been complaining about is the quality of the codebase and the sustainability of the practice. Especially with the associated claims about increased productivity.

I care about correctness. Simplicity and reduced amount of code increase my confidence that I can achieve it. New features, until tested in production, are more probable to decrease the reliability of the software. And with each fix for a bug, I need to make sure that I’m not adding five more.

To this day, I’ve not seen any compelling arguments that is about writing better code reliably. I’ve seen a lot about writing more code. It’s like manager thinking if you’re not at your computer typing, you’re not working.

> We're still sceptical enough that we are doing the usual heavy handed human review process, so we're not seeing a huge speed up in delivery times, but we are seeing a volume increase

Are you seeing a quality increase? Less customer bugs, less outages, faster resolution? Are you measuring those?


> Are you seeing a quality increase? Less customer bugs, less outages, faster resolution? Are you measuring those?

We're not at the stage to measure yet. We may be behind others, not sure. Actually, this isn't quite true. I was interested, so a created an ad-hoc report (with AI) on PRs landed per week over time. This has gone up over the last 6 momths. But that is hard to say why that is. It might just be people are raising smaller PRs because it becomes easy to have the AI split things up, while before, people were too lazy to do this.

Our bottleneck is still that we want humans to review. Sometimes we spot errors, but our pre-existing testing frameworks are very robust already, so if these pass, we're very confident to release to production, and the agent is excellent at understanding the existing testing frameworks and adding to them for new stuff.

So in our team, we don't often see blatant logic errors. It is mostly to do with things like using a pattern that is used elsewhere in the codebase (or not at all) and doesn't belong in our specific section of the code (we have a large monorepo). These become fewer as we enhance our ruleset (AGENTS.md or CLAUDE.md) for our particular developers.


> And then I come to HN or reddit and I see people telling us that they cannot write decent production code, and this is just wrong. This isn't opinion wrong, it is objectively wrong

So how can you justify this comment of yours from your reply if you’re not measuring anything? Mind you, I can easily get good results from AI tools, but I don’t like the experience and the code is often over-engineered and drifts away from my target architecture.

But the worst is quickly loosing sight of the tiny technical details that matters when solving bugs or altering features. I don’t like typing code. What I like is to be able to go directly to the code that I need to change, modify it, and then verify that it works. Most of my time is spent deep thinking about the design of the software which is orthogonal to code.

And if there is one thing that is common about people fully onboard with LLM is that they can talk about the product, but they can’t argue about its behavior and its correctness. There’s no intrinsic model that they can compare with the real code. They don’t know the edge cases, the technical pitfalls, how the software will react if you modify one component. Any brainstorming session quickly turns into a slog because they cannot contrast approaches anymore. You can see the decay of understanding in realtime.


Ok. I get what you're saying. I definitely agree with losing touch with the code, but I still review everything the agent writes, and I steer it heavily. My perspective is from what I observe. The entire industry (my industry) is embracing agentic coding. And I don't believe developers are doing it because they're stupid.

I think it is going to continue to get better, and I don't think we'll be having this argument in two years time. Our entire industry will look very different.


Yeah, never concrete examples from these guys.

I am creating a game and I can say that with the coding part the models help a lot, mostly gpt 5.5 high. Tbh to me all the frontier models feel the same and they can all solve the stuff I do quite well with some guidance and prompting. But that kind of makes me appreciate the other stuff more like visual style, sound design, mechanics etc etc. Tons of work still.

For brainstorming I find the models bad nowadays or maybe I am just too critical of the results


    the quality of produced code and the medium
A thought I have been tossing around in my head as the models get better is that it really may not matter what the code looks like.

If the observed behavior of the software is good, then the software is good. If a bug, of whatever kind, can be fixed by a model on a vibe-coded codebase, then that's a fixable bug. If there are no exploitable vulnerabilities, then the code is secure. If the performance is adequate, then the code is performant.

It simply does not matter what the code looks like if, from the outside, it does what its supposed to, and, from the inside, a model can fix the issue if one is found.

More than ever, software engineering is now really a job about making sure the code is doing what its supposed to.

And even if it DOES matter what the code looks like, you can have a model fix that too.


Don't forget that LLMs are trained on human code. If they cannot understand what your code does then they cannot make changes to it, or at least - having them understand your codebase becomes expensive (more trips to Anthropic servers)

The thing is that a lot of code rely on multiple layers of abstractions with their own correctness and failure states. And then you overlay the domain correctness and failure cases on top of that.

But all of those correctness are imaginary. The hardware only enforce a few (and it may be buggy). The OS adds some more (and it’s buggy). The compiler/interpreter may have bugs (but that’s rarely a nuisance) and the libraries are often brittle. There are cracks everywhere in the tower of abstractions.

The code has never mattered. What has always mattered is the knowledge of what is the model of correctness of the software (programming as a theory by NauR), so that you can discern where a program is wrong.

The thing is a crash or some other immediate errors are actually nice to have. You get to react immediately and can have a core dump or a stacktrace that points you the error. What is truly a terror is silent corruption (wrong order of operations, wrong values for a comparison that has expanded the idea of correctness, security issues that has been backdoored for years,…).

As Hoare said:

  There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.
  The first method is far more difficult.
LLM are very much the second kind. You write a lot of complicated code, and then you can no longer reason about their correctness.

> There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

That is so real. Brilliant !



>What I find fascinating that there is so little substance in this article about the quality of produced code and the medium.

I clicked one of his examples intrigued "a snake game where the snake is self-aware and crazy things happen;". Played for 1-2 minutes, and it's the classic 1980s snake game. Am I missing something? What is "self-aware" about it? Some funny messages at the bottom of the screen? And what are the "crazy things"?


It sounds like you either didn't play enough or you are missing the new mechanics that get added over time. There's definitely more to it than just regular snake.

I had the exact same thought. To me, it feels like they just took the fairly common “sentient video game character” trope and bolted it onto a very conventional snake game.

I will say, the act of eating creates a "bulge distortion" that flows down the length of the snake is a nice touch though.


You didn't play long enough. There are layers and layers and layers of features in that game if you play for 10 minutes or more.

Can you spoil it for us?

Being the first to release an article gives you great SEO or whatever. Doing the things you've mentioned takes time.

Less fascinating when you consider that this is a non-coders perspective.

It's still fascinating, but for a different reason. The "Concord" tool that got created bills itself as "Instrument-grade measurement of qualitative text. Explore in minutes, publish with honest statistics." Instrument-grade! How wonderful! That presumably means its accuracy has been ensured, and it's been carefully calibrated, right? What, nobody's ever measured or even examined the code? Well, no matter, let's go ahead and publish it and advertise it as "honest" "instrument-grade measurements."

Yeah, the README looks like slop to me.

Fair enough, but enterpreunership should, I guess, ask questions if given Next Big Thing has substance behind it or is it just snake oil.

Ah, but billions of dollars depend on those questions not being asked in a genuine manner. Don't you want a slice of that or are you an... AI skeptic thunder clashes.

Yeah, this made it basically clickbait for me, in terms of time I wasted with the wrong expectation.

The lack of downvotes on posts on HN has always felt like more of a bug than a feature to me.


So, the perspective of the one that gains the most, that will value this the most, and that will pay the most? ;)

These days it's uneconomical for human to verify AI generated code. So we ask the AI to do it. Like when we asked the FBI to audit itself and they found no problems :)

You probably don't care about the ingredients or engineering of asphalt, only if the road does its job well or is filled with potholes. Outside of the software industry, nobody gives a shit about code or databases.

> You probably don't care about the ingredients or engineering of asphalt

Everyone does. You don’t think about it everyday because we’ve delegated it to experts which don’t come up with a new composition of Asphalt every time you press “generate”. It’s rigorously battle tested and short of intentional negligence, it’s consistent. I’m amazed how people are forgetting how the world actually works.


Exactly - the normalization of craft (?) is interesting

You've missed the point.

The point doesn’t seem to have been thought through.

The point is, if road engineers changed their process and materials, and to you it felt like driving on the same road, with the same wear and tear and potholes, you wouldn't even notice.

If AIs can generate code that looks ridiculous to humans but over time has the correct performance, the correct behaviour, no-one outside of software engineers will know or care.


But they don't. LLMs can't understand messy code much better than humans can. Maybe a little, but not enough to compensate for the code they create being messy.

> The point is, if road engineers changed their process and materials,

They do those in labs, and then studies are made to prove that it can replace the current composition. They do not invent those on the spot and let the drivers QA the road.

> If AIs can generate code that looks ridiculous to humans but over time has the correct performance, the correct behaviour

It’s on you to prove that this big “if” can be realized. A -> B only matters when A is true.


> It’s on you to prove that this big “if” can be realized. A -> B only matters when A is true

Not really. This is a discussion about what code looks like if AI can write applications that are as good, stable, correct as humans.

I think they can, better than most programmers at the moment, with the correct guardrails and supervision. But in time, I think we may not need to review the code at all, but instead verify correctness and performance only. The AI can write the code however it likes.

Obviously I don't have a proof for this, but based on the progress I've seen so far, if someone forced me to bet one way or the other, this is what I'd bet on.


I agree. But if I'm paying for the road (even as a taxpayer) I get angry that after a year it's full of potholes and that there are unnecessary signs warning about penguin crossing, making it cost 2 times more than it should have (and dont get me started why this road is really a highway leading to my house). I'd want certain qualities. And this article is basically = you will get a road, built quickly

But yes, you are right - I don't build roads and don't know what is a price to build a road and how to determine the quality of correctly built one, nor I will ever care or learn.


> And this article is basically = you will get a road, built quickly

That's not how I am reading it. You will get a road built exactly to your spec, quickly. So no penguin crossings unless you ask for them.

I am also not entirely sure how the pothole argument translates.


The road will be built to some specs, including features nobody asked for. If the corpus was trained for roads built in Arctic, you will get penguin crossings.

The ingredients and composition of the tarmac is the difference between having the road full of pot holes after a week of use

Sure, but if there's a trillion dollar company saying that it's going to replace all our road workers or engineers - I'd want to listen to the opinion of an expert. Some reporter from CNN driving over it like "yeah seems good to me, good this" has approximately zero persuasive power to me.

I care that the engineer followed industry standard best practices and used high quality asphalt. How could i not care about that? How do you think potholes aren't related to the engineering of asphalt?

It still does make errors, yes? Because it is not usable, if we need to verify everything. AI is only interesting if it can do things that humans can not do. If you can verify results because you can do it yourself, then why use AI? It will just bind highly skilled people to do verification work. Instead these people should do the actual work, results will come quicker.

So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?

If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.


By that measure, most software developers should be unemployed.

So would you be more comfortable if the user them just prompted the AI to use a specific language, framework and database. Aren't we all just going to reddit and finding out what all goes best with what? But also I don't trust nothing from it, even though I've seen it.

It's an ad.

Does it matter to the people requesting the software if it acts in the way they expect?

We've lived in a software bubble for so long, most software engineers have completely forgotten that the purpose of (most) software is to solve a problem. If that problem solves the problem well and reliably it doesn't matter the quality of the code.

In fact, that's the entire reason we care about "quality code", because we assume that quality code is code that does what you expect well and consistently.

I say this as someone who hand writes code pretty much every night for fun, just to experiment with computation. Which, oddly, is more fun than ever because I don't feel like there's any need to connect this type of programming with "real world software", and I can really enjoy code for it's own sake, meanwhile my job is mostly just running agent loops (which I quite like as well).


Exactly. Quality of code is a programming invention to make it easier to write and maintain correctly functioning applications.

That is the entire purpose of "quality of code".

If the end user experiences a correctly performing application, now, and in the future, they don't care at all what the code looks like.

AIs could resort to a single global array of primitives and forget all about functions, and just use gotos if it helped them (it probably doesn't).


I haven't forgotten that, I affirmatively think it's false. High quality code is necessary to solve problems reliably. Perhaps some people call things code quality when they don't matter (I really don't care what most variables are named), but there have always been teams who try to increase velocity by disregarding code quality, and from what I've seen AI does not stop them from shipping outages constantly.

True, but you should say that about every thing. Does it matter to you how the car drives, as long as it takes you to your destination? Well, yes, it matters: how will it deal with a crash, and if it's possible to replace a part and if anybody can just open it if you leave it outside. I will be amazed if somebody shows me their home-printed car, but if they'll try to sell it to me like a new one...

Don't harsh my vibes, man.

There also isn't any meaningful articulation of why this is a "leap forward"... literally everything claimed in the article has been claimed in the same breathless tones in articles written a year prior.

I get that there's little sense in arguing with the MBA hivemind, but... c'mon.

I manage two teams of highly motivated, largely pro-AI engineers. Both teams have independently concluded that they needed to ramp down GenAI usage because of code quality / maintainability concerns. Both teams have suffered from protracted outages caused by LLM jank not being sufficiently fenced off and guarded against. Both teams have expressed concern that the code generated by LLMs is far too verbose, full of slop, and rapidly becomes an unmaintainable mess.

These are teams that are building non-trivial LLM solutions (deep agentic data synthesis and multi-modal data tagging). They are using the technology creatively and pro-actively, not just vibe-coding slop and throwing their hands up when it fails. Both teams will continue using GenAI coding agents, don't get me wrong - but the gains are incremental, not transformative, and need careful fencing to make sustainable.

Nothing in these articles resonates as real. People who work in reality don't agree. I don't understand why this shit keeps getting attention (or rather I do, but the reasons aren't good).


I'm becoming more convinced these are questions of the Before Times. Yes, yes—heresy, I know.

Yet, I can't deny the reality that I observe working with LLMs every day. If this truly is a step-function (as some are sgguesting), then I have absolutely zero concern for the quality of the code.


Kind of a circular argument, isn't it? "Some people are saying it's very good at coding. If that's true, I don't care if the code is good."

I didn't say I don't care if the code is good.

I said I had zero concern for the quality of the code. That is, I do not have concern that the quality of the code will be a concern in and of itself.

It's a subtle, but IMO important difference. We only care about code quality so as it gives us stable, understandable systems. Historically that meant a human had to read and understand it. Suppose a future where that's no longer the case, then we may still end up with stable, understandable systems without understanding every minutiae of the substrate. It's the same way I don't really know if my compiler is correct, but the behavioral patterns of my code suggest it is without me understanding anything about its code quality.


You can either adapt or survive man, coping and negation dont help, AI is here to stay and yes it does require pilots but this map would have taken you weeks to do, the AI did it in 10 hours, you can still dedicate a week to refactor.

Also this is easily solved by .md spec files, this whole "bad code" cope is just FUD'


> Please write a manual on how to cleanup after AI rockstar managers who think they can code.

Why are you allowing AI rockstar managers to (I assume) push without code review? Why are you cleaning up the fallout? It's not AI issue, it's people issue


It's the mandate from top-down. Of course it's a people issue, the problem is that the people creating this issue are exactly the ones paying us.

My manager got the mandate she needs to start coding, she doesn't want that, no one in our team wants that, she's a great manager exactly where she is right now. Nonetheless, we are helping her to code to show something for the higher-ups so she can keep her job, we really don't want to lose one of our best managers because some C-level is anxious about AI...


Even if it's a mandate top-down, you can:

- show increase in errors and outages caused by this approach

- integrate manager changes into your CI pipeline (coding / reviewing / testing / documentation)

- discuss how your manager can do the changes they need to do without sidetracking all other work

Make it indeed about the money: coding by PM + fixing what was coded + dealing with fallout is greater expense than coding by PM + automated guidelines + reviewing what was coded.

That is - if the environment you're working in is reasonable and it's not a power play by your PM


I don't know why you assumed it was a PM thing, the mandate is for EMs to be more "hands-on".

On your points:

Manager changes are always going to go through the usual pipeline, we are 10k+ people so there's not even a way to go all gung-ho pushing stuff. They need to be reviewed, approved, etc.

But we don't want our manager to code, nor does our manager, so we are just helping to cross the bare minimum expected from higher ups. For my team it's not an issue with our manager's code (those will be at max small fixes, well-defined, the most trivial stuff), the issue is they are mandating managers to do it.

I don't know what size of company you work at, where I am at there is simply no incentive for me to do all the extra work to show execs/higher management the issues cropping up.

I don't even have access to them, I have to pass through other channels, those might compile reports from many people to try to present a case, if they get to present a case then there's a whole other discussion to happen at director/VP/C-level about what they want to do, and since it goes against their big mandate it most likely will just be thrown out.

In this structure I have no motivation at all to go out of my way to perform data gathering/analysis, wrapping it all in a nice concise document explaining what the data means, potential remediation, etc. just to become a footnote in someone else's document that ultimately will not change anything from the VP/C-level mandate.


I assumed that as it was hitting too close home (both PMs and EMs were expected to use AI, with PM trying to code in solutions that they didn't have expertise nor domain knowledge to deal with; EM was prototyping solutions that had access to prod DB that were shut down as soon as we found out). I can only symphatize nonetheless and thank you for giving your PoV.

Because most people, in most parts of the world, are not allowed to question whatever their superiors do? And, yes, unfortunately are also expected to clean up after said superiors' messes. Of course it's a people issue. AIs just make people issues worse in new and entertaining ways.

100%. You don't "clean up after them." You make them clean up their own mess. You refuse to let a mess into the system in the first place.

Same as it ever was.

The only difference now is that if you let it happen, it'll happen 100x as fast.

When I was mentoring junior devs, I would start by fully reviewing their code. If they had a ton of mistakes more than a few times, I would only review until the first mistake, and then reject it. Repeat, repeat, repeat, until they got the picture that I wasn't going to let mistakes through, and handing me a ton of mistakes was going to waste more of their time than mine.

I let the pain be their pain, instead of mine.

But good developers, I'd help them by doing a more thorough review and not wasting their time. Good developers were the ones that made an honest effort to follow the requirements to the letter and test their own work.

We further emphasized this by having a very simple coding test during the interview, and the only thing we cared about was whether they followed the requirements to the letter. There wasn't a lot left to the imagination, and the requirements were very clear. Anyone who missed them wasn't someone who would do well with us.

That very same test will help filter out a lot of AI-braindead candidates that don't check the AI's work as well.

Actually, I wish I still had the exact test so I could throw it against an AI and see what happens. I'm a little afraid that it would pass it too easily now. I'm not sure how I'd fix it to prevent them from just using AI.


The code is AI reviewed, and I was ordered to change repo setting so a single AI review is enough. I've tried suggesting a lot of things, but it is not on my paygrade to allow or disallow something, only recommend.

Sure, producing code has become cheap. Yet again the taste matters and LLMs do not have taste - they will apply patterns that are unnecessary or not extendible, producing unmaintainable systems that nobody understands. Capturing domain knowledge was the crux of development process, but so was verifying, documenting, ensuring that multiple systems work together, maintaining uniformity. I don't know where the assumptions, done by developers, that they only need to produce code that just works or goes brrr fast comes from.

Domain expert can develop working code, but they will not be able to ensure above.


Depends - using Sonnet here and generally it should be as you say: plan would produce the result.

Still Claude will sneak things in - in my recent plan, for example I had defined, per acceptance criteria what colours the statuses should be: green for live, blue for sold, grey for anything else; it changed this to: green for live, orange for in progress, blue for sold, red in demolition, etc. When pressed why did it to this, it was unable to explain why. This is with a plan where AC were explicitly provided from the task in Given/When/Then format and were to be adhered to strictly. I've caught this within planning, but I shouldn't need to be doing this.

Even in standard prompts where I tell it "Change this label from X to Y", it ended reordering the tabs unrelated to ask. Again I was not able for it to explain why - it was so abrupt. And it was in fresh context, without any pollution on what I expect it to do.

I also noticed a different behaviour regarding skill; today and yesterday it would not be following skill guidance at all ie: skill writing skill - I'd have to explicitly tell it to test skills after writing them, when this is a behaviour expected by default. Similarly with other skills - knowing that it should have done something per skill guidelines and it not doing it at all. This is new behaviour that I've not seen a week ago.


Can you explain the benefits of running this over rector / eslint? (and to certain degree phpstan / deptrac)

Write a skill outlining your expectations of the code, put that skill into the pipeline, so that it can be included within your workflow.

Webdev here, but currently I have: - a skill where I outlined how the architecture of the system should look like, with guards (static analysis, architecture tests, linting) confirming that the code it generates adheres to standards

- a skill that tells it how tests should look like (use generators, write both feature / unit tests)

- a skill that tells it to generate docs from the code in a form of acceptance criteria (Given / When / Then)

- a skill that tells it to generate frontend uat tests + accompanying backend seeders given the AC

- a skill that tells it to verify that ticket objectives match what was delivered

At this point I still need to guide it to move task from one stage to the other (coding, testing, verification that indeed what was coded adheres to what was required), but I believe that these dynamic workflows can automate this work as well.


While $500k 90m movie done in two weeks is an accomplishment, looking at the trailer it's very dubious to me on the quality of it. Plot, characters, audio - everything screams "I've already seen this somewhere", there's no substance here, at least for me. And while computer visuals are nice, it's nowhere "Love Sex Robots" quality where they're driven by computer graphics as well.


> While $500k 90m movie done in two weeks is an accomplishment

Is it, though? If all you want is a movie, you can make it for both less money and less time. And if you actually have some modicum of talent, you can make it higher-quality to boot; see Joel Haver, who challenged himself to author, film, edit, and release 12 feature-length films during the course of 2024 on effectively no budget whatsoever (playlist here: https://www.youtube.com/watch?v=C-ZRRTsa5SY&list=PLKtIcOP0Wv... ).


True - your comment reminded me of Cube; that was done in 3 weeks, with budget of $350,000 CAD (according to wikipedia). Another favorite of mine - Primer = 5 weeks with budget of $7k.

edit: looking at others, Pi - 4w and $130k


> Cube; that was done in 3 weeks

The cube was not “done” in 3 weeks. Maybe they shot it in 3 weeks, but there were years of pre-production, and at least months of post-production. (According to wikipedia.)

Saying that it was done in 3 weeks is like saying that windows 11 was done in 45 minutes, because that is how long the compilation lasted.

> with budget of $350,000 CAD

“50% of the budget as C$350,000 to C$375,000 in cash and the other 50% as donated services, for a total of C$700,000. Natali considered the cash figure to be deceptive, because they deferred payment on goods and services, and got the special effects at no cost.”

Direct quote from wikipedia.


Fair enough, I was looking at budget fields on wiki rather than reading the tidbits - thanks for the correction still!


The trailer gave me this weird feeling like I've seen the movie before, even though I obviously haven't. Then it started to dawn on me. Nearly every line in the trailer is a line from another similar action-adventure movie. I bet if you searched a corpus of scripts from all past movies, you'd find each line directly in some other movie. Then I noticed the same thing about the characters. They may look unique at a surface level, but the essence of the characters are all tropes from previous movies. Same for the fight choreography, same for the score. It's as if the movie creator's AI prompt was "Take every movie made in the last 10 years that would have appealed to 14 year old boys and mash them up into another movie with visibly different characters."


This needs to be treated like LLMs, it's obvious that those flaws will be "fixed", we must already assume that this 90m movie will suddenly have the graphics and consistency of a marvel movie, soon enough, it's not like we will not have Kling 7 available in a few years.

Last year many developers were saying that it produces slop and so-on which is genuinely annoying when we know it's months/years to be GUARANTEED to be solved, as theory already proves we can go way further with models (theory means practice eventually), so we must not talk about "now" as in 1 week near but what it will be, as if it's already there imo. Even more annoying about the image gen AI, it's OBVIOUS that it will reach perfect accuracy (at least for human eye), as if we will just throw TRILLIONS of investment by the window and just stop here, nope, this will reach camera level, runtime, instantly rendered.

Else for the job loss, it's like the moment we realize that it can automate 99% of white collar jobs, we would suddenly be surprised when Opus 10 can do it? We shouldn't, we KNOW there will be Opus 10 that reach 99.9% in all benchmarks, like we know we will have Opus equivalent models running on our phones.

I won't be surprised when I see Opus 4.8 equivalent performance running on a 10B model, as this is just logical, I start to kinda hate it that we all act "surprised" with new models every few months as if the science behind it all changed suddenly, no... we just start developing what science is backing up already.

So obviously, music, video, writing... will be produced at a much higher level than humans, soon enough, there is no ceiling with AIs, humans are pretty limited.


Last line of the trailer “That was terrible”… yup.


I'd say it depends - evaluate the vibes. I spent 8y and recently 7y at a company where I genuinely responded with what I thought. But I'd say it's a matter of the audience - some people want to hear certain things and deciding if you can share these thoughts is up to you. It also allowed me to make decisions - if people don't care what I think and want sycophancy is this the company I want to be working at? I understand though it depends on one circumstances = you have to grin and bear it


Why can't people own to their mistakes and reschedule


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: