Hacker Newsnew | past | comments | ask | show | jobs | submit | XenophileJKO's commentslogin

You might be surprised, look at Dallas. They have a pretty extensive rail network.

Dallas does not have permissive zoning, even in comparison to a city like Seattle.

If you think a language model can't check their work, then you are using the tools wrong. Plain and simple.

Modern models are quite capable at surfacing and validating their assumptions and checking correctness of solutions.

Oversight helps you build confidence in the solutions. Is it perfect, no.. but way better then most engineers I also ask to check things.


No they don't. To be able to "check one's work", implies that they can be held accountable, that they can tell apart right from wrong, when in reality they're merely text predictors.

If you think an LLMs can check their work, then you are doing a terrible job at writing software. Plain and simple.

They even go as far as "cheating", so tests fail, writing incorrect tests, or straight out leaking code (lol) like the latest Claude Code blunder. Is this the tool the original comment "is using wrong, plain and simple"? Or do you have access to some other model that works in a wildly different way than generating text predictions?


you can have it write test cases though.

in this case to make a local copy of the db, fill it with a set of records with an expected output of the query, then check to see of the query produces what you want.

you could then have it make queries that check the various assumptions that went into that artificial set of data. if it can find the assumptions broken, add records like that to the test set.

same old agentic programming techniques as ever. use your engineering skill to set up feedback loops. stuff that was painful to do as an engineer for checking your work is now straightforward


The point is that you have verify it yourself. Like you wrote: "check to see of the query produces what you want"

Otherwise the LLM can just write tests against whatever it wrote and not what is expected. This happens often with the top models too.

Someone needs to check the tests work, review they cover edge cases etc.


Feedback loops require a deterministic metric for success. You are doing the equivalent of using a slot machine to decide whether something is right or wrong.

I would argue that while you still have failed trials, then we have room to improve trial vetting.

Don't or can't.

My assumption is the model no longer actually thinks in tokens, but in internal tensors. This is advantageous because it doesn't have to collapse the decision and can simultaneously propogate many concepts per context position.


I would expect to see a significant wall clock improvement if that was the case - Meta's Coconut paper was ~3x faster than tokenspace chain-of-thought because latents contain a lot more information than individual tokens.

Separately, I think Anthropic are probably the least likely of the big 3 to release a model that uses latent-space reasoning, because it's a clear step down in the ability to audit CoT. There has even been some discussion that they accidentally "exposed" the Mythos CoT to RL [0] - I don't see how you would apply a reward function to latent space reasoning tokens.

[0]: https://www.lesswrong.com/posts/K8FxfK9GmJfiAhgcT/anthropic-...


There’s also a paper [0] from many well known researchers that serves as a kind of informal agreement not to make the CoT unmonitorable via RL or neuralese. I also don’t think Anthropic researchers would break this “contract”.

[0] https://arxiv.org/abs/2507.11473


If that's true, then we're following the timeline of https://ai-2027.com/

> If that's true, then we're following the timeline

Literally just a citation of Meta's Coconut paper[1].

Notice the 2027 folk's contribution to the prediction is that this will have been implemented by "thousands of Agent-2 automated researchers...making major algorithmic advances".

So, considering that the discussion of latent space reasoning dates back to 2022[2] through CoT unfaithfulness, looped transformers, using diffusion for refining latent space thoughts, etc, etc, all published before ai 2027, it seems like to be "following the timeline of ai-2027" we'd actually need to verify that not only was this happening, but that it was implemented by major algorithmic advances made by thousands of automated researchers, otherwise they don't seem to have made a contribution here.

[1] https://ai-2027.com/#:~:text=Figure%20from%20Hao%20et%20al.%...

[2] https://arxiv.org/html/2412.06769v3#S2


Hilariously, I clicked back a bunch and got a client side error. We have a long way to go. I wouldn't worry about it.

Care to expound on that? Maybe a reference to the relevant section?

Ctrl-F "neuralese" on that page.

You should just read the thing, whether or not you believe it, to have an informed opinion on the ongoing debate.

I did read it a while back. Was curious what parent was referring to specifically

March 2027 -> Neuralese recurrence and memory

> For example, perhaps models will be trained to think in artificial languages that are more efficient than natural language but difficult for humans to interpret.


That's not supposed to happen til 2027. Ruh roh.

Only if you ignore context and just ctrl-f in the timeline.

What are you, Haiku?

But yeah, in many ways we're at least a year ahead on that timeline.


Don't.

The first 500 or so tokens are raw thinking output, then the summarizer kicks in for longer thinking traces. Sometimes longer thinking traces leak through, or the summarizer model (i.e. Claude Haiku) refuses to summarize them and includes a direct quote of the passage which it won't summarize. Summarizer prompt can be viewed [here](https://xcancel.com/lilyofashwood/status/2027812323910353105...), among other places.


No, there is research in that direction and it shows some promise but that’s not what’s happening here.

Are you sure? It would be great to get official/semi-official validation that thinking is or is not resolved to a token embedding value in the context.

You can read the model cards. Claude thinks in regular text, but the summarizer is to hide its tool use and other things (web searches, coding).

Most likely, would be cool yes see a open source Nivel use diffusion for thinking.

Don't. thinking right now is just text. Chain of though, but just regular tokens and text being output by the model.

I do this a lot. Start by telling the AI to just listen and only provide feedback when asked. Lay out your current line of thinking conversationally. Periodically ask the AI to summarize/organize your thoughts "so far". Tactically ask for research into a decision or topic you aren't sure about and then make a decision inline.

Then once I feel like I have addressed all the areas, I ask for a "critical" review, which usually pokes holes in something that I need to fix. Finally have the AI draft up a document (Though you have to generally tell it to be as concise and clear as possible).


I usually create a document/folder with my thinking on what I want to do, any background information that is relevant, conversations on the topic, technical manuals, links etc. Then enter a conversation and explore the problem space and do something very similar to what you are doing.


So this idea that they replay "text" they saw before is kind of wrong fundamentally. They replay "abstract concepts of varied conceptual levels".


The important point I'm trying to reinforce is that LLMs are not capable of calculation. They can give an answer based on the fact that they have seen lots of calculations and their results, but they cannot actually perform mathematical functions.


That is a pretty bold assertion for a meatball of chemical and electrical potentials to make.


Do you know what "LLM" stands for? They are large language models, built on predicting language.

They are not capable of mathematics because mathematics and language are fundamentally separated from each other.

They can give you an answer that looks like a calculation, but they cannot perform a calculation. The most convincing of LLMs have even been programmed to recognize that they have been asked to perform a calculation and hand the task off to a calculator, and then receive the calculator's output as a prompt even.

But it is fundamentally impossible for an LLM to perform a calculation entirely on its own, the same way it is fundamentally impossible for an image recognition AI to suddenly write an essay or a calculator to generate a photo of a giraffe in space.

People like to think of "AI" as one thing but it's several things.


What calculations? Do you mean "3+5" or a generic Turing-machine like model?

In either case, this "it's a language model" is a pretty dumb argument to make. You may want to reason about the fundamental architecture, but even that quickly breaks down. A sufficiently large neural network can execute many kinds of calculations. In "one shot" mode it can't be Turing complete, but in a weird technicality neither does your computer have an infinite tape. It just simply doesn't matter from a practical perspective, unless you actually go "out of bounds" during execution.

50T parameters give plenty of state space to do all kinds of calculations, and you really can't reason about it in a simplistic way like "this is just a DFA".

Let alone when you run it in a loop.


> What calculations? Do you mean "3+5" or a generic Turing-machine like model?

Either one. An LLM cannot solve 3+5 by adding 3 and 5. It can only "solve" 3+5 by knowing that within its training data, many people have written that 3+5=8, so it will produce 8 as an answer.

An LLM, similarly, cannot simulate a Turing machine. It can produce a text output that resembles a Turing machine based on others' descriptions of one, but it is not actually reading and writing bits to and from a tape.

This is why LLMs still struggle at telling you how many r's are in the word "strawberry". They can't count. They can't do calculations. They can only reproduce text based on having examined the human corpus's mathematical examples.


With all due respect, this is just plain false.

The reason "strawberry" is hard for LLMs is that it sees $str-$aw-$berry, 3 identifiers it can't see into. Can you write down a random word your just heard in a language you don't speak?


> In "one shot" mode it can't be Turing complete, but in a weird technicality neither does your computer have an infinite tape

Nor our brains, in fact.


Mathematics and language really aren't fundamentally separated from one another.

By your definition, humans can't perform calculation either. Only a calculator can.


Mathematics is a language. Everything we can express mathematically, we can also express in natural language. The real interesting, underlying question is: Is there anything worth knowing that cannot be expressed by language? - That's the theoretical boundary of LLM capability.


>it is fundamentally impossible for an image recognition AI to suddenly write an essay

You can already do this today with every frontier modal. You can give it an image and have it write an essay from it. Both patches (parts of images) and text get turned into tokens for the language the LLM is learning.


This is a really poor take, to try and put a firewall between mathematics and language, implying something that only has conceptual understanding root in language is incapable of reasoning in mathematical terms.

You're also correlating "mathematics" and "calculation". Who cares about calculation, as you say, we have calculators to do that.

Mathematics is all just logical reasoning and exploration using language, just a very specific, dense, concise, and low level language. But you can always take any mathematical formula and express it as "language" it will just take far more "symbols"

This might be the worse take on this entire comment section. And I'm not even an overly hyped vibe coder, just someone who understands mathematics


Don't listen to these people. Work on your "vision".. figure out what gameplay is "fun".. let the LLMs smooth out the resistance.

Things will change rapidly in the nest 12-36 months and people with vision will outlast "craftsman" 100 to 1.


What is "passion".. for example.. I vibe coded an art display this weekend for myself for a monitor I have on my wall. I am VERY PROUD of it.. it is in GODOT coincedentally. I think it turned out well. Did I spend weeks on it? Did I even learn GODOT?.. No.. but I did spend my weekend late nights figuring out what I wanted and working with an AI to make it.

In some ways the kind of complaining I see is like complaining about a chef's meal because the chef didn't mine the ore to make his knife.

Look in the specific case of this post... none of the games are "good".. however.. one-shoting games WITH ASSETS.. seems pretty impressive to me.


Isn't this just so disingenuous? No disrespect to you, I just see this kind of sophomoric take so much in response to the very normal reaction of the OP. A year ago, it was in vogue to call the OP "ableist" or something. I think the idea that the OP's concern was like the expectation that a chef would "mine the ore" is a bit ridiculous. A better example would be someone having a painting on the wall feeling ownership in it when they asked their artist friend to paint them a picture; at least that is more reasonable. Also, passion means to struggle, since you asked, which I think follow more the idea of learning the craft. This kind of reductionism would deny that craftsmanship exists, as if sculpting David is the same as buy the finished product on the open market. I think we all know this isn't true but there is some kind of forcefield on the Internet that means we have to pretend it is.


By that note, no game producer or designer can have passion.


Really well said, I hate that every time I say I value craftmanship, skill and effort in art people flock to this reductionism "well did the painter make his own dyes? Did the developer make his own processor to run the game in?"

There's levels to it, it's not black and white.


I think there is more to it than that.

I am a high quality/craftsmanship person. I like coding and puzzling. I am highly skilled in functional leaning object oriented deconstruction and systems design. I'm also pretty risk averse.

I also have always believed that you should always be "sharpening your axe". For things like Java delelopment or things where I couldn't use a concise syntax would make extensive use of dynamic templating in my IDE. Want a builder pattern, bam, auto-generated.

Now when LLMs came out they really took this to another level. I'm still working on the problems.. even when I'm not writing the lines of code. I'm decomposing the problems.. I'm looking at (or now debating with the AI) what is the best algorithm for something.

It is incredibly powerful.. and I still care about the structure.. I still care about the "flow" of the code.. how the seams line up. I still care about how extensible and flexible it is for extension (based on where I think the business or problem is going).

At the same time.. I definately can tell you, I don't like migrating projects from Tensorflow v.X to Tenserflow v.Y.


> I'm looking at (or now debating with the AI) what is the best algorithm for something.

That line always makes me laugh. There’s only 2 points of an algorithm, domain correctness and technical performance. For the first, you need to step out of the code. And for the second you need proofs. Not sure what is there to debate about.


Not true. There is also cost, money or opportunity. Correctness or performance isn't binary -- 4 or 5 nines, 6 or 7 decimal precision, just to name a few. That drives a lot discussion.

There may be other considerations as well -- licensing terms, resources, etc.


I really should spend some time analyzing what I do to get the good output I get..

One thing that is fairly low effort that you could try is find code you really like and ask the model to list the adjectives and attributes that that code exhibits. Then try them in a prompt.

With LLMs generally you want to adjust the behavior at the macro level by setting things like beliefs and values, vs at the micro level by making "rules".

By understanding how the model maps the aspects that you like about the code to language, that should give you some shorthand phrases that give you a lot of behavioral leverage.

Edit: Better yet.. give a fresh context window the "before" and "after" and have it provide you with contrasting values, adjectives, etc.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: