No they don't. To be able to "check one's work", implies that they can be held accountable, that they can tell apart right from wrong, when in reality they're merely text predictors.
If you think an LLMs can check their work, then you are doing a terrible job at writing software. Plain and simple.
They even go as far as "cheating", so tests fail, writing incorrect tests, or straight out leaking code (lol) like the latest Claude Code blunder. Is this the tool the original comment "is using wrong, plain and simple"? Or do you have access to some other model that works in a wildly different way than generating text predictions?
in this case to make a local copy of the db, fill it with a set of records with an expected output of the query, then check to see of the query produces what you want.
you could then have it make queries that check the various assumptions that went into that artificial set of data. if it can find the assumptions broken, add records like that to the test set.
same old agentic programming techniques as ever. use your engineering skill to set up feedback loops. stuff that was painful to do as an engineer for checking your work is now straightforward
Feedback loops require a deterministic metric for success. You are doing the equivalent of using a slot machine to decide whether something is right or wrong.
My assumption is the model no longer actually thinks in tokens, but in internal tensors. This is advantageous because it doesn't have to collapse the decision and can simultaneously propogate many concepts per context position.
I would expect to see a significant wall clock improvement if that was the case - Meta's Coconut paper was ~3x faster than tokenspace chain-of-thought because latents contain a lot more information than individual tokens.
Separately, I think Anthropic are probably the least likely of the big 3 to release a model that uses latent-space reasoning, because it's a clear step down in the ability to audit CoT. There has even been some discussion that they accidentally "exposed" the Mythos CoT to RL [0] - I don't see how you would apply a reward function to latent space reasoning tokens.
There’s also a paper [0] from many well known researchers that serves as a kind of informal agreement not to make the CoT unmonitorable via RL or neuralese. I also don’t think Anthropic researchers would break this “contract”.
> If that's true, then we're following the timeline
Literally just a citation of Meta's Coconut paper[1].
Notice the 2027 folk's contribution to the prediction is that this will have been implemented by "thousands of Agent-2 automated researchers...making major algorithmic advances".
So, considering that the discussion of latent space reasoning dates back to 2022[2] through CoT unfaithfulness, looped transformers, using diffusion for refining latent space thoughts, etc, etc, all published before ai 2027, it seems like to be "following the timeline of ai-2027" we'd actually need to verify that not only was this happening, but that it was implemented by major algorithmic advances made by thousands of automated researchers, otherwise they don't seem to have made a contribution here.
> For example, perhaps models will be trained to think in artificial languages that are more efficient than natural language but difficult for humans to interpret.
The first 500 or so tokens are raw thinking output, then the summarizer kicks in for longer thinking traces. Sometimes longer thinking traces leak through, or the summarizer model (i.e. Claude Haiku) refuses to summarize them and includes a direct quote of the passage which it won't summarize. Summarizer prompt can be viewed [here](https://xcancel.com/lilyofashwood/status/2027812323910353105...), among other places.
Are you sure? It would be great to get official/semi-official validation that thinking is or is not resolved to a token embedding value in the context.
I do this a lot. Start by telling the AI to just listen and only provide feedback when asked. Lay out your current line of thinking conversationally. Periodically ask the AI to summarize/organize your thoughts "so far". Tactically ask for research into a decision or topic you aren't sure about and then make a decision inline.
Then once I feel like I have addressed all the areas, I ask for a "critical" review, which usually pokes holes in something that I need to fix. Finally have the AI draft up a document (Though you have to generally tell it to be as concise and clear as possible).
I usually create a document/folder with my thinking on what I want to do, any background information that is relevant, conversations on the topic, technical manuals, links etc. Then enter a conversation and explore the problem space and do something very similar to what you are doing.
The important point I'm trying to reinforce is that LLMs are not capable of calculation. They can give an answer based on the fact that they have seen lots of calculations and their results, but they cannot actually perform mathematical functions.
Do you know what "LLM" stands for? They are large language models, built on predicting language.
They are not capable of mathematics because mathematics and language are fundamentally separated from each other.
They can give you an answer that looks like a calculation, but they cannot perform a calculation. The most convincing of LLMs have even been programmed to recognize that they have been asked to perform a calculation and hand the task off to a calculator, and then receive the calculator's output as a prompt even.
But it is fundamentally impossible for an LLM to perform a calculation entirely on its own, the same way it is fundamentally impossible for an image recognition AI to suddenly write an essay or a calculator to generate a photo of a giraffe in space.
People like to think of "AI" as one thing but it's several things.
What calculations? Do you mean "3+5" or a generic Turing-machine like model?
In either case, this "it's a language model" is a pretty dumb argument to make. You may want to reason about the fundamental architecture, but even that quickly breaks down. A sufficiently large neural network can execute many kinds of calculations. In "one shot" mode it can't be Turing complete, but in a weird technicality neither does your computer have an infinite tape. It just simply doesn't matter from a practical perspective, unless you actually go "out of bounds" during execution.
50T parameters give plenty of state space to do all kinds of calculations, and you really can't reason about it in a simplistic way like "this is just a DFA".
> What calculations? Do you mean "3+5" or a generic Turing-machine like model?
Either one. An LLM cannot solve 3+5 by adding 3 and 5. It can only "solve" 3+5 by knowing that within its training data, many people have written that 3+5=8, so it will produce 8 as an answer.
An LLM, similarly, cannot simulate a Turing machine. It can produce a text output that resembles a Turing machine based on others' descriptions of one, but it is not actually reading and writing bits to and from a tape.
This is why LLMs still struggle at telling you how many r's are in the word "strawberry". They can't count. They can't do calculations. They can only reproduce text based on having examined the human corpus's mathematical examples.
The reason "strawberry" is hard for LLMs is that it sees $str-$aw-$berry, 3 identifiers it can't see into. Can you write down a random word your just heard in a language you don't speak?
Mathematics is a language. Everything we can express mathematically, we can also express in natural language. The real interesting, underlying question is: Is there anything worth knowing that cannot be expressed by language? - That's the theoretical boundary of LLM capability.
>it is fundamentally impossible for an image recognition AI to suddenly write an essay
You can already do this today with every frontier modal. You can give it an image and have it write an essay from it. Both patches (parts of images) and text get turned into tokens for the language the LLM is learning.
This is a really poor take, to try and put a firewall between mathematics and language, implying something that only has conceptual understanding root in language is incapable of reasoning in mathematical terms.
You're also correlating "mathematics" and "calculation". Who cares about calculation, as you say, we have calculators to do that.
Mathematics is all just logical reasoning and exploration using language, just a very specific, dense, concise, and low level language. But you can always take any mathematical formula and express it as "language" it will just take far more "symbols"
This might be the worse take on this entire comment section. And I'm not even an overly hyped vibe coder, just someone who understands mathematics
What is "passion".. for example.. I vibe coded an art display this weekend for myself for a monitor I have on my wall. I am VERY PROUD of it.. it is in GODOT coincedentally. I think it turned out well. Did I spend weeks on it? Did I even learn GODOT?.. No.. but I did spend my weekend late nights figuring out what I wanted and working with an AI to make it.
In some ways the kind of complaining I see is like complaining about a chef's meal because the chef didn't mine the ore to make his knife.
Look in the specific case of this post... none of the games are "good".. however.. one-shoting games WITH ASSETS.. seems pretty impressive to me.
Isn't this just so disingenuous? No disrespect to you, I just see this kind of sophomoric take so much in response to the very normal reaction of the OP. A year ago, it was in vogue to call the OP "ableist" or something. I think the idea that the OP's concern was like the expectation that a chef would "mine the ore" is a bit ridiculous. A better example would be someone having a painting on the wall feeling ownership in it when they asked their artist friend to paint them a picture; at least that is more reasonable. Also, passion means to struggle, since you asked, which I think follow more the idea of learning the craft. This kind of reductionism would deny that craftsmanship exists, as if sculpting David is the same as buy the finished product on the open market. I think we all know this isn't true but there is some kind of forcefield on the Internet that means we have to pretend it is.
Really well said, I hate that every time I say I value craftmanship, skill and effort in art people flock to this reductionism "well did the painter make his own dyes? Did the developer make his own processor to run the game in?"
I am a high quality/craftsmanship person. I like coding and puzzling. I am highly skilled in functional leaning object oriented deconstruction and systems design. I'm also pretty risk averse.
I also have always believed that you should always be "sharpening your axe". For things like Java delelopment or things where I couldn't use a concise syntax would make extensive use of dynamic templating in my IDE. Want a builder pattern, bam, auto-generated.
Now when LLMs came out they really took this to another level. I'm still working on the problems.. even when I'm not writing the lines of code. I'm decomposing the problems.. I'm looking at (or now debating with the AI) what is the best algorithm for something.
It is incredibly powerful.. and I still care about the structure.. I still care about the "flow" of the code.. how the seams line up. I still care about how extensible and flexible it is for extension (based on where I think the business or problem is going).
At the same time.. I definately can tell you, I don't like migrating projects from Tensorflow v.X to Tenserflow v.Y.
> I'm looking at (or now debating with the AI) what is the best algorithm for something.
That line always makes me laugh. There’s only 2 points of an algorithm, domain correctness and technical performance. For the first, you need to step out of the code. And for the second you need proofs. Not sure what is there to debate about.
Not true. There is also cost, money or opportunity. Correctness or performance isn't binary -- 4 or 5 nines, 6 or 7 decimal precision, just to name a few. That drives a lot discussion.
There may be other considerations as well -- licensing terms, resources, etc.
I really should spend some time analyzing what I do to get the good output I get..
One thing that is fairly low effort that you could try is find code you really like and ask the model to list the adjectives and attributes that that code exhibits. Then try them in a prompt.
With LLMs generally you want to adjust the behavior at the macro level by setting things like beliefs and values, vs at the micro level by making "rules".
By understanding how the model maps the aspects that you like about the code to language, that should give you some shorthand phrases that give you a lot of behavioral leverage.
Edit:
Better yet.. give a fresh context window the "before" and "after" and have it provide you with contrasting values, adjectives, etc.
reply