I think, there are different niches. AI works extremely well for Web prototyping because a lot of that work is superficial. Back in the 90s we had Delphi where you could make GUI applications with a few clicks as opposed to writing tons of things by hand. The only reason we don't have that for Web is the decentralized nature of it: every framework vendor has their own vision and their own plan for future updates, so a lot of the work is figuring out how to marry the latest version of component X with the specific version of component Y because it is required by component Z. LLMs can do that in a breeze.
But in many other niches (say embedded), the workflow is different. You add a feature, you get weird readings. You start modelling in your head, how the timing would work, doing some combination of tracing and breakpoints to narrow down your hypotheses, then try them out, and figure out what works the best. I can't see the CLI agents do that kind of work. Depends too much on the hunch.
Sort of like autonomous driving: most highway driving is extremely repetitive and easy to automate, so it got automated. But going on a mountain road in heavy rain, while using your judgment to back off when other drivers start doing dangerous stuff, is still purely up to humans.
AI can very efficiently apply common patterns to vast amounts of code, but it has no inherent "idea" of what it's doing.
Here's a fresh example that I stumbled upon just a few hours ago. I needed to refactor some code that first computes the size of a popup, and then separately, the top left corner.
For brevity, one part used an "if", while the other one had a "switch":
I wanted the LLM to refactor it to store the position rather than applying it immediately. Turns out, it just could not handle different things (if vs. switch) doing a similar thing. I tried several variations of prompts, but it very strongly leaning to either have two ifs, or two switches, despite rather explicit instructions not to do so.
It sort of makes sense: once the model has "completed" an if, and then encounters the need for a similar thing, it will pick an "if" again, because, well, it is completing the previous tokens.
Harmless here, but in many slightly less trivial examples, it would just steamroll over nuance and produce code that appears good, but fails in weird ways.
That said, splitting tasks into smaller parts devoid of such ambiguities works really well. Way easier to say "store size in m_StateStorage and apply on render" than manually editing 5 different points in the code. Especially with stuff like Cerebras, that can chew through complex code at several kilobytes per second, expanding simple thoughts faster than you could physically type them.
Just pointing out that there are limits and there’s no reason to believe that models will improve indefinitely at the rates we’ve seen these last couple of years.
There is reason to believe that humans will keep trying to push the limitations of computation and computer science, and that recent advancements will greatly accelerate our ability to research and develop new paradigms.
Look at how well Deepseek performed with the limited, outdated hardware available to its researchers. And look at what demoscene practitioners have accomplished on much older hardware. Even if physical breakthroughs ceased or slowed down considerably, there is still a ton left on the table in terms of software optimization and theory advancement.
And remember just how young computer science is as a field, compared to other human practices that have been around for hundreds of thousands of years. We have so much to figure out, and as knowledge begets more knowledge, we will continue to figure out more things at an increasing pace, even if it requires increasingly large amounts of energy and human capital to make a discovery.
I am confident that if it is at all possible to reach human-level intelligence at least in specific categories of tasks, we're gonna figure it out. The only real question is whether access to energy and resources becomes a bigger problem in the future, given humanity's currently extraordinarily unsustainable path and the risk of nuclear conflict or sustained supply chain disruption.
> And remember just how young computer science is as a field, compared to other human practices that have been around for hundreds of thousands of years.
How long do you think Homo sapiens have been on Earth and how long has civilization been here?
I’ve been programming since 89. I know what you can squeeze into 100k.
But you can only blast so much electricity into a dense array of transistors before it melts the whole thing and electrons jump rails. We hit that limit a while ago. We’ve done a lot of optimization of instruction caching, loading, and execution. We front loaded a ton of caching in front of the registers. We’ve designed chips specialized to perform linear algebra calculations and scaled them to their limits.
AI is built on scaling the number of chips across the board. Which has the effect of requiring massive amounts of power. And heat dissipation. That’s why we’re building out so many new data centres: each one requiring land, water, and new sources of electricity generation to maintain demand levels for other uses… those sources mostly being methane and coal plants.
Yes, we might find local optimizations in training to lower the capital cost and external costs… but they will be a drop in the bucket at the scale we’re building out this infrastructure. We’re basically brute forcing the scale up here.
And computer science might be older than you think. We just used to call it logic. It took some electrical engineering innovations to make the physical computers happen but we had the theoretical understanding of computation for quite some time before those appeared.
A young field, yes, and a long way to go… perhaps!
But let’s not believe that innovation is magic. There’s hard science and engineering here. Electrons can only travel so fast. Transistor density can only scale so much. Etc.
> How long do you think Homo sapiens have been on Earth and how long has civilization been here?
I already corrected my typo in a child comment.
> We’re basically brute forcing the scale up here
Currently, but even that will eventually hit thermodynamic and socioeconomic limits, just as single chips are.
> And computer science might be older than you think. We just used to call it logic.
In my opinion, two landmark theory developments were type theory and the lambda calculus. Type theory was conceived to get around Russell's paradox and others, which formal logic could not do on its own.
As far as hardware, sure we had mechanical calculators in the 17th century, and Babbage's analytical engine in the 19th century, and Ada Lovelace's program, but it wasn't until the mid-20th century that computer science coalesced as its own distinct field. We didn't used to call computer science logic; it's a unification of physical advancements, logic and several other domains.
> Electrons can only travel so fast.
And we have no reason to believe that current models are at all optimized on a software or theoretical level, especially since, as you say yourself, we are currently just focused on brute-forcing innovation as its the more cost-effective solution for the time being.
But as I said, once theoretical refinement becomes more cost-effective, we can look at the relatively short history of computer science to see just how much can be done on older hardware with better theory:
>> Even if physical breakthroughs ceased or slowed down considerably, there is still a ton left on the table in terms of software optimization and theory advancement.
I agree. And if human civilization survives, your concerns about energy and resources will be only short term on the scale of civilization, especially as we make models more efficient.
The human brain uses just 20 watts of power, so it seems to me like it is possible to reach human-level intelligence in principle by using much greater power and less of the evolutionary refinement over billions of years that the brain has.
I am working on a GUI for delegating coding tasks to LLMs, so I routinely experiment with a bunch of models doing all kinds of things. In this case, Claude Sonnet 3.7 handled it just fine, while Llama-3.3-70B just couldn't get it. But that is literally the simplest example that illustrates the problem.
When I tried giving top-notch LLMs harder tasks (scan an abstract syntax tree coming from a parser in a particular way, and generate nodes for particular things) they completely blew it. Didn't even compile, let alone logical errors and missed points. But once I broke down the problem to making lists of relevant parsing contexts, and generating one wrapper class at a time, it saved me a whole ton of work. It took me a day to accomplish what would normally take a week.
Maybe they will figure it out eventually, maybe not. The point is, right now the technology has fundamental limitations, and you are better off knowing how to work around them, rather than blindly trusting the black box.
3) autistic rigidity regarding a single hallucination throwing the whole experience off
4) subconscious anxiety over the threat to their jerbs
5) unnecessary guilt over going against the tide; anything pro AI gets heavily downvoted on Reddit and is, at best, controversial as hell here
I, for one, have shipped like literally a product per day for the last month and it's amazing. Literally 2,000,000+ impressions, paying users, almost 100 sign ups across the various products. I am fucking flying. Hit the front page of Reddit and HN countless times in the last month.
Idk if I break down the prompts better or what. But this is production grade shit and I don't even remember the last time I wrote more than two consecutive lines of code.
If you are launching one product per day, you are using LLMs to convert unrefined ideas into proof-of-concept prototypes. That works really well, that's the kind of work that nobody should be doing by hand anymore.
Except, not all work is like that. Fast-forward to product version 2.34 where a particular customer needs a change that could break 5000 other customers because of non-trivial dependencies between different parts of the design, and you will be rewriting the entire thing by humans or having it collapse under its own weight.
But out of 100 products launched on the market, only 1 or 2 will ever reach that stage, and having 100 LLM prototypes followed by 2 thoughtful redesigns is way better than seeing 98 human-made products die.
Can you provide links to these 30 products you have shipped?
I keep hearing how people are so god damn productive with LLMs, but whenever I try to use them they can not reliably produce working code. Usually producing something that looks correct at first, but either doesn't work (at all or as intended).
Going over you list:
1. if the problem is that I need to be very specific with how I want LLM to fix the issue, like providing it the solution, why wouldn't I just make the change myself?
2. I don't even know how you can think that not vibe coding means you lack experience
3. Yes. If the model keeps trying to use non-existent language feature or completely made up functions/classes that is a problem and nothing to do with "autism"
4. This is what all AI maximalists want to think; that only reason why average software developer isn't knee deep in AI swamp with them is that they are luddites who are just scared for their jobs. I personally am not as I have not seen LLMs actually being useful for anything but replacing google searches.
5. I don't know why you keep bringing up Reddit so much. I also don't quite get who is going against the tide here, are you going against the tide of the downvotes or am I for not using LLMs to "fucking fly"?
>But this is production grade shit
I truly hope it is, because...
>and I don't even remember the last time I wrote more than two consecutive lines of code.
Means if there is a catastrophic error, you probably can't fix it yourself.
> if the problem is that I need to be very specific with how I want LLM to fix the issue, like providing it the solution, why wouldn't I just make the change myself?
I type 105 wpm on a bad day. Try gpt-4.1. It types like 1000 wpm. If you can formally describe your problem in English and the number of characters in the English prompt is less than whatever code you write, gpt-4.1 will make you faster.
Obviously you have to account for gpt-4.1 being wrong sometimes. Even so, if you have to run two or three prompts to get it right, it still is going to be faster.
> I don't even know how you can think that not vibe coding means you lack experience
If you lack experience, you're going to prompt the LLM to do the wrong thing and engineer yourself into a corner and waste time. Or you won't catch the mistakes it makes. Only experience and "knowing more than LLM" allows you to catch its mistakes and fix them. (Which is still faster than writing the code yourself, merely by way of it typing 1000 wpm.)
> If the model keeps trying to use non-existent language feature or completely made up functions/classes that is a problem and nothing to do with "autism"
You know that you can tell it those functions are made up and paste it the latest documentation and then it will work, right? That knee-jerk response makes it sound like you have this rigidity problem, yourself.
> I personally am not as I have not seen LLMs actually being useful for anything but replacing google searches.
Nothing really of substance here. Just because you don't know how to use this tool that doesn't mean no one does.
This is the least convincing point for me, because I come along and say "Hey! This thing has let me ship far more working code than before!" and then your response is just "I don't know how to use it." I know that it's made me more productive. You can't say anything to deny that. Do you think I have some need to lie about this? Why would I feel the need to go on the internet and reap a bunch of downvotes while peddling some lie that does stand to get me anything if I convince people of the lie?
> I also don't quite get who is going against the tide here, are you going against the tide of the downvotes
Yeah, that's what I'm saying. People will actively shame and harass you for using LLMs. It's mind boggling that a tool, a technology, that works for me and has made me more productive, would be so vehemently criticized. That's why I listed these 5 reasons, the only reasons I have thought of yet.
> Means if there is a catastrophic error, you probably can't fix it yourself.
See my point about lacking experience. If you can't do the surgery yourself every once in a while, you're going to hate these tools.
Really, you've just made a bunch of claims about me that I know are false, so I'm left unconvinced.
I'm trying to have a charitable take. I don't find joy in arguing or leaving discussions with a bitter taste. I genuinely don't know why people are so mad at me claiming that a tool has helped me be more productive. They all just don't believe me, ultimately. They all come up with some excuse as to why my personal anecdotes can be dismissed and ignored: "even though you have X, we should feel bad for you because Y!" But it's never anything of substance. Never anything that has convinced me. Because at the end of the day, I'm shipping faster. My code works. My code has stood the test of time. Insults to my engineering ability I know are demonstrably false. I hope you can see the light one day. These are extraordinary tools that are only getting better, at least by a little bit, in the foreseeable future. Why deny?
Would also love to see those daily shipped products. What I see on reddit is the same quiz done several times just for different categories and the pixel art generator. That does not look like shipping a product per day as you claim.
On my main, not gonna dox myself. Being pro AI is clearly a faux pas for personal branding.
Just a few days ago got flamed for only having 62 users on GameTorch. Now up to 91 and more paying subs. Entire thing written by LLMs and hasn't fallen over once. I'd rather be a builder than an armchair critic.
People would rather drag you down in the hole that they're in than climb out.
This is going to be all over the place and possibly hard to follow, I am just going to respond in "real time" as I read your comment, if you think that is too lazy to warrant reading I completely understand. I hope you have a nice day.
WPM is not my limiting factor. Maybe the difference is that I am not working on trivial software so a lot of thought goes into the work, typing is the least time consuming part. Still I don't see how your 105 wpm highly descriptive and instructive English can be faster than just fixing the thing. Even if after you prompt your LLM takes 1ms to fix the issue you have probably already wasted more time by debugging the issue and writing the prompt.
So your "you lack engineering experience" was actually "you don't know LLMs well", maybe use the words you intend and not make them into actual insults.
I am not going to be pasting in any C++ spec into an LLM.
Yet when I checked your profile you have shipped one sprite image generator website. I find all these claims so hard to believe. Everyone just keeps telling me how they are making millions off of LLMs but no one has to the recipes to show. Just makes me feel like you have stock in OpenAI or something and are trying your hardest to pump it up.
I think the shaming and harassing is mostly between your ears, at least I am not trying to shame or harass you for using LLMs, if anything I want to have superpowers too. If LLMs really work for you that is nice and you should keep doing it, I just have not seen the evidence you are talking about. I am willing to admit that it could very well be a skill issue, but I need more proof than "trust me" or "1000 wpm".
I don't think I have made any claims about you, although you have used loaded language like "autism" and "lack of engineering experience" and heavily implied that I am just too dumb to use the tools.
>I'm trying to have a charitable take.
c'mon nothing about your comments has been charitable in anyway. No one is mad at you personally. Do not take criticism of your tools as personal attacks. Maybe the tools will get good, but again my problem with LLMs and hype around them is that no one has been able to demonstrate them actually being as good as the hype suggests.
What is everyone working on that takes more than five minutes to think about?
For me, the work is insurmountable and infinite, while coming up with the solution is never too difficult. I'm not saying this to be cocky. I mean this:
In 99.9999999999% of the problems I encounter in software engineering, someone smarter than me has already written the battle tested solution that I should be using. Redis. Nginx. Postgres. etc. Or it's a paradigm like depth first search or breadth first search. Or just use a hash set. Sometimes it's a little crazier like Bloom filters but whatever.
Are you like constantly implementing new data structures and algorithms that only exist in research papers or in your head?
Once you've been engineering for 5 or 10 years, you've seen almost everything there is to see. Most of the solutions should be cached in your brains at that point. And the work just amounts to tedious, unimportant implementation details.
Maybe I'm forgetting that people still get bogged down in polymorphism and all that object oriented nonsense. If you just use flat structs, there's nothing too complicated that could possibly happen.
I worked in HFT, for what it's worth, and that should be considered very intense non-CRUD "true" engineering. That, I agree, LLMs might have a little more trouble with. But it's still nothing insane.
Software engineering is extremely formulaic. That's why it's so easy to statistically model it with LLMs.
I write embedded software in C++ for industrial applications. We have a lot of proprietary protocols and custom hardware. We have some initiatives to train LLMs with our protocols/products/documentation, but I have not been impressed with the results. Same goes with our end-to-end testing framework. I guess it isn't so popular so the results vary a lot.
I have been doing this for 8 year and while yes I have seen a lot you can't just copy-paste solutions due to flash, memory, and performance constraints.
Again maybe this is a skill issue and maybe I will be replaced with an LLM, but so far they seem more like cool toys. I have used LLMs to write AddOns for World of Warcraft since my Lua knowledge is mostly writing Wireshark plugins for our protocols and for that it has been nice, but it is nothing someone who actually works with Lua or with WoW API couldn't produce faster or just as fast, because I have to describe what I want and then try and see if the API the LLM provides exists and if it works as the LLM assumed it would.
Again, I appreciate the reply. I think my view on LLMs is skewed towards the positive because I've only been building CRUD apps, command line tools, and games with them. I apologize if I came off as incendiary or offensive.
The interesting questions happen when you define X, Y and Z and time. For example, will llms be able to solve the P=NP problem in two weeks, 6 months, 5 years, a century? And then exploring why or why not
> AI can very efficiently apply common patterns to vast amounts of code, but it has no inherent "idea" of what it's doing.
AI stands for Artificial Intelligence. There are no inherent limits around what AI can and can't do or comprehend. What you are specifically critiquing is the capability of today's popular models, specifically transformer models, and accompanying tooling. This is a rapidly evolving landscape, and your assertions might no longer be relevant in a month, much less a year or five years. In fact, your criticism might not even be relevant between current models. It's one thing to speak about idiosyncrasies between models, but any broad conclusions drawn outside of a comprehensive multi-model review with strict procedure and controls is to be taken with a massive grain of salt, and one should be careful to avoid authoritative language about capabilities.
It would be useful to be precise in what you are critiquing, so that the critique actually has merit and applicability. Even saying "LLM" is a misnomer, as modern transformer models are multi-modal and trained on much more than just textual language.
I learned neural networks around 2000, and it was old technology then. The last real jump we saw was going from ChatGPT 3.5 to 4, and that is more than 2 years ago.
It seems you don't recollect how much time passed without any big revolutions in AI. Deep learning was a big jump. But when the next jump comes? Might be tomorrow, but looking at history, might be in 2035.
According to what I see, the curve has already flattened and now only a new revolution could get us to the next big step.
Since I can't seem to add an edit to my post, here's a realization:
My 2035 prediction actually seems pretty optimistic. It was more than 20 years that we haven't seen any big AI revolutions, so 2045 would be more realistic.
And it seems our current AI is also not going to get us there any faster.
Agree, the AI companies aren’t able to improve the base models so they’re pivoting to making add-ons like “agents” which seem to only be instructions atop the base models.
> AI stands for Artificial Intelligence. There are no inherent limits around what AI can and can't do or comprehend.
Artificial, as in Artificial sand or artificial grass. Sure, it appears as sand or grass at first, but upon closer examination, it becomes very apparent that it's not real. Artificial is basically a similar word to magic - as in, it offers enough misdirection in order for people to think there might be intelligence, but upon closer examination, it's found lacking.
It's still impressive that it can do that, going all the way back to gaming AIs, but it's also a veil that is lifted easily.
What a ridiculous response, to scold the GP for criticising today's AI because tomorrow's might be better. Sure, it might! But it ain't here yet buddy.
Lots of us are interested in technology that's actually available, and we can all read date stamps on comments.
You're projecting that I am scolding OP, but I'm not. My language was neutral and precise. I presented no judgment, but gave OP the tools to better clarify their argument and express valid, actionable criticism instead of wholesale criticizing "AI" in a manner so imprecise as to reduce the relevance and effectiveness of their argument.
> But it ain't here yet buddy . . . we can all read date stamps on comments.
That has no bearing on the general trajectory that we are currently on in computer science and informatics. Additionally, your language is patronizing and dismissive, trading substance for insult. This is generally frowned upon in this community.
You failed to actually address my comment, both by failing to recognize that it was mainly about using the correct terminology instead of criticizing an entire branch of research that extends far beyond transformers or LLMs, and by failing to establish why a rapidly evolving landscape does not mean that certain generalizations cannot yet be made, unless they are presented with several constraints and caveats, which includes not making temporally-invariant claims about capabilities.
I would ask that you reconsider your approach to discourse here, so that we can avoid this thread degenerating into an emotional argument.
> They were obviously not trying to make a sweeping comment about the entire future of the field
OP said “AI can very efficiently apply common patterns to vast amounts of code, but it has no inherent "idea" of what it's doing.”
I'm not going to patronize you by explaining why this is not "very precise", or why its lack of temporal caveats is an issue, as I've already done so in an earlier comment. If you're still confused, you should read the sentence a few times until you understand. OP did not even mention which specific model they tested, and did not provide any specific prompt example.
> Are you using ChatGPT to write your loquacious replies?
If you can't handle a few short paragraphs as a reply, or find it unworthy of your time, you are free to stop arguing. The Hacker News guidelines actually encourage substantive responses.
I also assume that in the future, accusing a user of using ChatGPT will be against site guidelines, so you may as well start phasing that out of your repertoire now.
Here are some highlights from the Hacker News guidelines regarding comments:
- Don't be snarky
- Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.
- Assume good faith
- Please don't post insinuations about astroturfing, shilling, brigading, foreign agents, and the like. It degrades discussion and is usually mistaken.
This is a lot of words, but does any of it contradict this:
> AI can very efficiently apply common patterns to vast amounts of code, but it has no inherent "idea" of what it's doing.”
Are you saying that AI does have an inherent idea of what it's doing or is doing more than that? Today?
We're in an informal discussion forum. I don't think the bar we're looking for is some rigorous deductive proof. The above matches my experience as well. Its a handy applied interactive version of an Internet search.
If someone has a different experience that would be interesting. But this just seems like navel-gazing over semantics.
> Are you saying that AI does have an inherent idea of what it's doing or is doing more than that?
No. I stated that OP cannot make that kind of blanket, non-temporally constrained statements about artificial intelligence.
> We're in an informal discussion forum. I don't think the bar we're looking for is some rigorous deductive proof
We're in a technology-oriented discussion forum, the minimum bar to any claim should be that it is supported by evidence, otherwise it should be presented as what it is: opinion.
> this just seems like navel-gazing over semantics.
In my opinion, conversation is much easier when we can agree that words should mean something. Imprecise language matched with an authoritative tone can mislead an audience. This topic in particular is rife with imprecise and uninformed arguments, and so we should take more care to use our words correctly, not less.
Furthermore, my argument goes beyond semantics, as it also deals with the importance of constraints when making broad, unbacked claims.
>The same kind of bias keeps resurfacing in every major system: Claude, Gemini, Llama, clearly this isn’t just an OpenAI problem, it’s an LLM problem.
It's not an LLM problem, it's a problem of how people use it. It feels natural to have a sequential conversation, so people do that, and get frustrated. A much more powerful way is parallel: ask LLM to solve a problem. In a parallel window, repeat your question and the previous answer and ask to outline 10 potential problems. Pick which ones appear valid, ask to elaborate. Pick your shortlist, ask yet another LLM thread to "patch" the original reply with these criticisms, then continue the original conversation with a "patched" reply.
LLMs can can't tell legitimate concerns from nonsensical ones. But if you, the user, do, they will pick it up and do all the legwork.
I worked on a very early iteration of LMs (they weren't "large" yet) in grad school 20 years ago and we drove it with a Makefile. The "prompt" was an input file and it would produce a response as an artifact. It never even occurred to us to structure it as a sequential "chat" because at that point it was still too slow. But it does make me wonder how much the UX changes the way people think about it.
There is the "classic" text completion interface that OpenAI used before ChatGPT. Basically a text document that you ask the LLM to extend (or insert text at a marker somewhere in the text). Any difference between your text and the AI's text is only visible in text color in the editor and not passed on to the LLM.
That does favor GP's workflow: You start the document with a description of your problem and end with a sentence like: "The following is a proposed solution". Then you let the LLM generate text, which should be a solution. You edit that to your taste, then add the sentence: "These are the 10 biggest flaws with this plan:" and hit generate. The LLM doesn't know that it came up with the idea itself, so it isn't biased towards it.
Of course this style is much less popular with users and much harder to do things like instruction tuning. It's still reasonably popular in creative writing tools and is a viable approach for code completion
ChatGPT is how old again? People are FAR more familiar with other interfaces. For coding, autocomplete is a great already-existing interface; products that use it don't get as much hype, though, as the ones that claim to be independent agents that you're talking to. There's any number of common interfaces attached to that (like the "simplify this" right-click for Copilot) for refactoring, dealing with builds, tests, etc. No shortage of places you could further drop in an LLM instead of pushing things primarily through "chat with me" to type out "refactor this to make these changes".
Or you could make the person's provided workflow not just more automatic but more integrated: generate the output, have labels with hover text or inline overlays or such along "this does this" or "here are alternative ways to do this" or "this might be an issue with this approach." All could be done much better in a rich graphical user interface than slamming it into a chat log. (This is one of Cursor's biggest edges over ChatGPT - the interactive change highlighting and approval in my tool in my repo, vs a chat interface.)
In some other fields:
* email summarization is automatic or available at the press of a button, nobody expects you to open up a chat agent and go "please summarize this email" after opening a message in Gmail
* photo editors let you use the mouse to select an area and then click a button labeled "remove object" or such instead of requiring you to try to describe the edit in a chat box. sometimes they mix and match it too - highlight the area THEN describe a change. But that's approximately a million times better than trying to chat to it to describe the area precisely.
There are other scenarios we haven't figured out the best interface for because they're newer workflows. But the chat interface is just so unimaginative. For instance, I spent a long time trying to craft the right prompt to tweak the output of ChatGPT turning a picture of my cat into a human. I couldn't find the right words to get it to understand and execute what I didn't like about the image. I'm not UX inventor, but one simple thing that would've helped would've been an eye-doctor like "here's two options, click the one you like more." (Photoshop has something like this, but it's not so directed, it's more just "choose one of these, or re-roll" but at least it avoids polluting the chat context history as much). Or let me select particular elements and change or refine them individually.
A more structured interface should actually greatly help the model, too. Instead of having just a linear chat history to digest, it would have well-tagged and categorized feedback that it could keep fresh and re-insert into its prompts behind the scenes continually. (You could also try to do this based on the textual feedback, but like I said, it seemed to not be understanding what my words were trying to get at. Giving words as feedback on a picture just seems fundamentally high-loss.)
I find it hard to believe that there is any single field where a chat interface is going to be the gold standard. But: they're relatively easy to make and they let you present your model as a persona. Hard combo to overcome, though we're seeing some good signs!
> It's not an LLM problem, it's a problem of how people use it.
True, but perhaps not for the reasons you might think.
> It feels natural to have a sequential conversation, so people do that, and get frustrated. A much more powerful way is parallel: ask LLM to solve a problem.
LLM's do not "solve a problem." They are statistical text (token) generators whose response is entirely dependent upon the prompt given.
> LLMs can can't tell legitimate concerns from nonsensical ones.
Again, because LLM algorithms are very useful general purpose text generators. That's it. They cannot discern "legitimate concerns" because they do not possess the ability to do so.
Right, or at any rate, the problems they do solve are ones of document-construction, which may sometimes resemble a different problem humans are thinking of... but isn't actually being solved.
For example, an LLM might take the string "2+2=" and give you "2+2=4", but it didn't solve a math problem, it solved a "what would usually get written here" problem.
> Right, or at any rate, the problems they do solve are ones of document-construction, which may sometimes resemble a different problem humans are thinking of... but isn't actually being solved.
This is such a great way to express the actuality in a succinct manner.
You're saying roughly "you can't trust the first answer from an LLM but if you run it through enough times, the results will converge on something good". This, plus all the hoo-hah about prompt engineering, seem like clear signals that the "AI" in LLMs is not actually very intelligent (yet). It confirms the criticism.
Not exactly. Let's say, you-the-human are trying to fix a crash in the program knowing just the source location. You would look at the code and start hypothesizing:
* Maybe, it's because this pointer is garbage.
* Maybe, it's because that function doesn't work as the name suggests.
* HANG ON! This code doesn't check the input size, that's very fishy. It's probably the cause.
So, once you get that "Hang on" moment, here comes the boring part of of setting breakpoints, verifying values, rechecking observations and finally fixing that thing.
LLM's won't get the "hang on" part right, but once you point it right in their face, they will cut through the boring routine like no tomorrow. And, you can also spin 3 instances to investigate 3 hypotheses and give you some readings on a silver platter. But you-the-human need to be calling the shots.
You can make a better tool by training the service (some of which involves training the model, some of which involves iterating on the prompt(s) behind the scene) to get a lot of the iteration out of the way. Instead of users having to fill in a detailed prompt we now have "reasoning" models which, as their first step, dump out a bunch of probably-relevant background info to try to push the next tokens in the right direction. A logical next step if enough people run into the OP's issue here is to have it run that "criticize this and adjust" loop internally.
But it all makes it very hard to tell how much of the underlying "intelligence" is improving vs how much of the human scaffolding around it is improving.
Yeah given the stochastic nature of LLM outputs this approach and the whole field of prompt engineering feels like a classic case of cargo cult science.
That's not how you do it. Ask it first to create exhaustive tests around the first version. Tell it what to test for. Then, ask to change specific things, one at a time, re-run tests between the steps, and ask it to fix things. Rinse-repeat-review. It is faster than doing by hand, but you still need to be calling the shots.
You need to do it in smaller, incremental steps. Outline the overall architecture in your head, ask the AI to create empty structs/classes. Build it. Ask it to implement one part, leaving others empty. Test it. Ask it to add the next thing, and so on.
Every step should only affect a handful of classes or functions, that you can still keep in your head and easily verify. Basically, same thing as if you were doing it by hand, but at a higher abstraction level, so faster and less mentally tiring.
Shameless plug: I am working on a new cross-platform IDE designed for just this kind of workflow. It has basic C/C++ support already: https://sysprogs.com/CodeVROOM/?features=why
I think, there's nuance. If a human can solve a problem without second thoughts and hesitations (Hey, stop, this doesn't look right. Are there other options?), an LLM will tear through it at breakneck speed.
But if there are things worth hesitating and weighing, LLM will fly past them at cruising speed.
The statement you quoted is ambiguous. I'd say "Claude makes the boring parts of coding super fast, tearing through the stuff I didn't want to do and helping me get to the stuff I did want to do."
I'm an experienced dev (this year is my 10 year anniversary of 'officially' being in the industry). I've been using Claude Code over these last two weeks. It's like a magical code generator tool e.g. protoc; if you can describe the rote and boring "do this tedious x->y translation", then it can probably bang that out with tests in a couple minutes, making it easy for you to focus on the logic. It's not that Claude Code is doing the "programming", it's more like Claude is doing the typing (and it types REALLY fast). Literally imagine if you could type as fast as you could think, like just saying "oh I'll need functions covering x/y/z behaviors and cases, similar to the other functions already present but tailored like <way>." All you had to do was type that sentence and it wrote basically what you would have written, but instantly.
> Literally imagine if you could type as fast as you could think
This is exactly how vibe coding has been feeling to me lately, I do read all the code while it writes it, and I command the AI to write/edit/fix the code, so its like faster programming but we're still on the director seat
also focusing on logic, and the problems at hand, and not syntax, API, or function signatures, allows one to build at much faster speed and focusing on architecture and actual UX and not just code
LLMs are a glorified regex engine with fuzzy input. They are brilliant at doing boring repetitive tasks with known outcome.
- Add a 'flags' argument to constructors of classes inherited from Record.
- BOOM! Here are 25 edits for you to review.
- Now add "IsCaseSensitive" flag and update callers based on the string comparison they use.
- BOOM! Another batch of mind-numbing work done in seconds.
If you get the hang of it and start giving your LLMs small, sizable chunks of work, and validating the results, it's just less mentally draining than to do it by hand. You start thinking in much higher-level terms, like interfaces, abstraction layers, and mini-tests, and the AI breeze through the boring work of whether it should be a "for", "while" or "foreach".
But no, don't treat it as another human capable of making decisions. It cannot. It's a fancy machinery for applying known patterns of human knowledge to the locations where you point based on a vague hint, but not a replacement for your judgement.
why is it insulting? It's an incredible piece of machinery for refracting natural language into other language. That itself accounts for a majority of orders people pass on to other people before something actually gets done.
> If you get the hang of it and start giving your LLMs small, sizable chunks of work, and validating the results, it's just less mentally draining than to do it by hand. You start thinking in much higher-level terms, like interfaces, abstraction layers, and mini-tests, and the AI breeze through the boring work of whether it should be a "for", "while" or "foreach".
Isn’t that the proper programming state of mind? I think about keywords the same amount of time a pianist think about the keys when playing. Especially with vim where I can edit larger units reliably, so I don’t have to follow the cursor with my eyes, and can navigate using my mental map.
Ultimately, yes, programming with LLMs is exactly the sort of programming we've always tried to do. It gets rid of the boring stuff and lets you focus on the algorithm at the level you need to - just like we try to do with functions and LSP and IDE tools. People needn't be scared of LLMs: they aren't going to take our jobs or drain the fun out of programming.
But I'm 90% confident that you will gain something from LLM-based coding. You can do a lot with our code editing tools, but there's almost certainly going to be times when you need to do a sequence of seven things to get the outcome you want, and you can ask the computer to prepare that for you.
> If I may ask - how are humans in general different? Very few of us invent new ideas of significance - correct?
Firstly, "very few" still means "a large number of" considering how many of us there are.
Compared to "zero" for LLMs, that's a pretty significant difference.
Secondly, humans have a much larger context window, and it is not clear how LLMs in their current incarnation can catch up.
Thirdly, maybe more of us invent new ideas of significance that the world will just never know. How will you be able to tell if some plumber deep in West Africa comes up with a better way to seal pipes at joins? From what I've seen of people, this sort of "do trivial thing in a new way" happens all the time.
Not only "our context window" is larger but we can add and remove from it on-the-fly, or rely on somebody else who, for that very specific problem, has a far better informed "context window", that BTW they're adding to/removing from on-the-fly as well.
I think if we fully understood this (both what exactly ishuman conciousness and how llm differs - not just experimentally but theoretically) we would then be able to truly create human-AI
I think, they are trying to push back against generated pages. I faced this exact problem myself. We recently published an interactive source code navigation tool [0] where you can find examples for commonly used functions from some embedded SDKs. Google indexed it immediately and almost immediately it got a spike of views.
Then, an interesting thing happened. Most pages simply disappeared from the results. Search console shows them as indexed, no problems, no manual actions, but if you google up those functions, there results are not there.
It took some statistical analysis to figure out that they appear to be capping the number of pages. Out of all the pages Google crawled, it picked some percentage of the "most important" ones and it's showing those. The importance, by the looks of it, was computed from the number of incoming links, prioritizing pages for common stuff like int32_t that nobody googles.
It's not ideal, but it kinda makes sense. It's 2024. You can use AI to generate plausible content for any search query you can think of. And unless they put some kind of limits, we'll get overrun with completely useless LLM-churned stuff.
You have confirmed my fears: I'm publishing a text heavy webpage with separate articles per page as well as a large single page. Google will not like that.
Well, to be perfectly honest, it's hard question for an LLM that reasons in tokens and not letters. Reminds me of that classic test that kids easily pass and grownups utterly fail. The test looks like this: continue a sequence:
0 - 1
5 - 0
6 - 1
7 - 0
8 - 2
9 - ?
Grownups try to find a pattern in the numbers, different types of series, progressions, etc. The correct answer is 1 because it's the number of circles in the graphical image of the number "9".
These tests always make me wonder: What qualifies as a valid pattern rule?
For example, why wouldn't "0" be a correct answer here (rule being "every other number on the right should be 0, other numbers do not have a pattern")?
Exactly, it's completely arbitrary. I like to frame it in terms of fitting n points (the existing data in the problem) to a polynomial of degree n+1, where there's an infinite number of ways to pick a solution and still satisfy the initial data.
Maybe the "solution with the lowest Kolmogorov complexity".
In a sibling comment, I replied that usually a repeating pattern can also be applied, but that one usually requires storing the n-sequence, rarely making it the shortest encodable rule.
I think it's better phrased as "find the best rule", with a tacit understanding that people mostly agree on what makes a rule decent vs. terrible (maybe not on what makes one great) and a tacit promise that the sequence presented has at least one decent rule and does not have multiple.
A rule being "good" is largely about simplicity, which is also essentially the trick that deep learning uses to escape no-free-lunch theorems.
Well, I had to try just for fun of it. The immediate reply was the following:
[...]
Starting with 0 paired with 1. If I think about functions or some formula that could map these values, but that might be too complicated.
Alternatively, perhaps these numbers represent something specific. For example, maybe they're representing the number of certain features in the number itself, like the number of closed loops in the digit's shape.
Let's think about that:
0 has one closed loop.
5 has no closed loops.
6 has one closed loop.
7 has no closed loops.
8 has two closed loops.
9 has one closed loop.
Wait a minute, that seems to match the sequence:
0 (1 loop) - 1
5 (0 loops) - 0
6 (1 loop) - 1
7 (0 loops) - 0
8 (2 loops) - 2
So, if 9 has one closed loop, then it should be paired with 1.
But let's double-check this theory with the given pairs.
0 has one loop, paired with 1 – matches.
5 has no loops, paired with 0 – matches.
6 has one loop, paired with 1 – matches.
7 has no loops, paired with 0 – matches.
8 has two loops, paired with 2 – matches.
Yes, this seems consistent.
Therefore, 9 should have one closed loop, so it should be paired with 1.
[...]
Then it continued trying another approach related to the number of angles in the digit shape, but it discarded it. Then "Alternatively, perhaps it's related to the number of segments that are 'on' in a seven-segment display." and so on.
a possible lesson to infer from this example of human cognition, would be that LLMs that can't solve the strawberry test could not be automatically less cognitive capable that another intelligent entity (humans by default).
An extension of the idea could be that many other similar tests trying to measure and/or evaluate machine cognition, when the LLMs fails, are not precisely measuring and/or evaluating anything else than an specific edge case in which machine cognitions fails (i.e. for the specific LLM / AI system being evaluated).
Maybe the models are actually more intelligent than they seem, like an adult failing the number of circles inside the graphical image of the numbers, in the mentioned problem.
I don't know if this is being done already, but couldn't we add some training data to teach the LLM how to spell? We also teach kids what each letter means and how they combine into words. Maybe we can do this with tokens as well? E.g.:
Token 145 (ar) = Token 236 (a) + Token 976 (r)
Repeat many times with different combinations and different words?
> but couldn't we add some training data to teach the LLM how to spell?
Sure, but then we would lose a benchmark to measure progress of emergent behavior.
The goal is not to add one capability at a time by hand - because this doesn’t scale and we would never finish. The goal is that it picks up new capabilities automatically, all on its own.
Training data is already provided by humans and certainly already does include spelling instruction, which the model is bind to because of forced tokenization.
Tokenizing on words is already an arbitrary capability added one at a time. It's just the wrong one. LLMs should be tokenizing by letter, but they don't, because they aren't good enough yet, so they get a massive deus ex machina (human ex machina?) of wordish tokenization.
LLMs are a very good tool for a particular class of problems. They can sift through endless amounts of data and follow reasonably ambiguous instructions to extract relevant parts without getting bored. So, if you use them well, you can dramatically cut down the routine part of your work, and focus on more creative part.
So if you had that great idea that takes a full day to prototype, hence you never bothered, an LLM can whip out something reasonably usable under an hour. So, it will make idea-driven people more productive. The problem is, you don't become a high-level thinking without doing some monkey work first, and if we delegate it all to LLMs, where will the next generation of big thinkers come from?
I was just messing around with LLMs all day, so had a few test cases open. Asked it to change a few things in a ~6KB C# snippet in a somewhat ambiguous, but reasonable way.
GPT-4 did this job perfectly. Qwen:72b did half of the job, completely missed the other one, and renamed 1 variable that had nothing to do with the question. Llama3.1:70b behaved very similar to Qwen, which is interesting.
OpenCoder:8b started reasonably well, then randomly replaced "Split('\n')" with "Split(n)" in unrelated code, and then went completely berserk, hallucinating non-existent StackOverflow pages and answers.
My best guess is that you shouldn't train it on mostly code. Natural language conversations used to train other models let them "figure out" human-like reasoning. If your training set is mostly code, it can produce output that looks like code, but it will have little value to humans.
Edit: to be fair, llama3.2:3b also botched the code. But it did not hallucinate complete nonsense at least.
Here is quite comprehensive llm for coding leaderboard: https://aider.chat/docs/leaderboards/
And they update it quite quickly with new models releases.
But in many other niches (say embedded), the workflow is different. You add a feature, you get weird readings. You start modelling in your head, how the timing would work, doing some combination of tracing and breakpoints to narrow down your hypotheses, then try them out, and figure out what works the best. I can't see the CLI agents do that kind of work. Depends too much on the hunch.
Sort of like autonomous driving: most highway driving is extremely repetitive and easy to automate, so it got automated. But going on a mountain road in heavy rain, while using your judgment to back off when other drivers start doing dangerous stuff, is still purely up to humans.