Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

also, not really true, right, even though it sounds intellectual and strong to say. these algorithms are trained to generalize as best as they can to unseen text, and most often don't ever see any data point twice, except for data that has accidentally not been filtered. it's totally possible that it gets reasoning abilities that generalize well.


Generalize over their training data—they cannot generalize out of distribution. If they could, they would have already solved most human problems. So no, they do not generalize on unseen text. They will produce what is most statistically probable based on their training data. Things that are still unknown and statistically improbable based on our current knowledge are out of reach for LLMs based on transformers.


You can get them to solve unseen problems just fine. E.g. one example: Specify a grammar in BNF notation and tell it to generate or parse sentences for you. You can produce a more than random enough grammar that it it can't have derived the parsing of it from past text, but necessarily reasons about BNF notation sufficiently well to be able to use it to deduce the grammar, and use that to parse subsequent sentences. You can have it analyse them and tag them according to the grammar to. And generate sentences.

My impression, from seeing quite a few people trying to demonstrate they can't handle out of distribution problems it hat people are very predictable about how they go about this, and tend to pick well known problems that are likely to be overrepresented in the training set, and then tweak them a bit.

At least in one instance the other day, what I got from GPT when I tried to replicate it suggests to me it did the same that humans that have seen these problems before did, and carelessly failed to "pay attention" because it fit a well known template it's been exposed to a lot in training. After it answered wrong it was sufficient to ask it to "review the question and answer again" for it to spot the mistake and correct itself.

I'm sure that won't work for every problem of this sort, but the quality of tests people do on LLMs is really awful, at least because people tend to do very narrow tests like that and make broad pronouncements about what LLM's "can't" do based on it.


> You can get them to solve unseen problems just fine

Prove that the problem wasn't seen by them in other form.

> Specify a grammar in BNF notation and tell it to generate or parse sentences for you. You can produce a more than random enough grammar that it it can't have derived the parsing of it from past text, but necessarily reasons about BNF notation sufficiently well to be able to use it to deduce the grammar, and use that to parse subsequent sentences. You can have it analyse them and tag them according to the grammar to. And generate sentences.

Oh, come on. It's like rewriting the same program in another programming language with different variables. What it can't do is to create a concept of programming language, I'm not talking about a new programming language, I'm talking about the concepts.

> I'm sure that won't work for every problem of this sort, but the quality of tests people do on LLMs is really awful, at least because people tend to do very narrow tests like that and make broad pronouncements about what LLM's "can't" do based on it.

Here, a few papers that show they can't reason:

https://arxiv.org/abs/2311.00871

https://arxiv.org/abs/2309.13638

https://arxiv.org/abs/2311.09247

https://arxiv.org/abs/2305.18654

https://arxiv.org/abs/2309.01809


>It's like rewriting the same program in another programming language with different variables.

Since when has that not required reasoning ? It's really funny seeing people bend over backwards to exclude LLMs from some imaginary "real reasoning" they imagine they are solely privy to. It's really obvious this is happening when they leave well defined criteria and branch into vague, ill-defined statements. What exactly do you mean by concepts ? Can you engineer some test to demonstrate what you're talking about ?

Also, none of those papers show LLMs can't reason.


You clearly didn't read any of these papers. Quote from one of them

"Our results support the hypothesis that GPT-4, perhaps the most capable “general” LLM currenly available, is still not able to robustly form abstractions and reason about basic core concepts in contexts not previously seen in its training data"

Another, recent, good one https://arxiv.org/abs/2407.03321

EDIT: For people who don't want to read the papers, here is a blog post that explains what I'm arguing in more accessible terms https://cacm.acm.org/blogcacm/can-llms-really-reason-and-pla...


"One way of doing this for planning tasks is to reduce the effectiveness of approximate retrieval by obfuscating the names of the actions and objects in the planning problem. When we did this for our test domains, GPT4’s empirical performance plummeted precipitously, despite the fact that none of the standard off-the-shelf AI planners have any trouble with such obfuscation. "

That's a great test. It shows they're matching prior patterns they saw, even down to what words were used, instead of thinking. We can match prior patterns, come up with the equivalences, and then plan that way. People often slow down when they do stuff like that, though. So, the A.I. would have to be able to do it but slowdowns would be acceptable.


Oh i've read them. The claim doesn't match up to reality. It's as simple as that. You can claim anything you want to.

https://arxiv.org/abs/2305.18354

All these papers you keep linking do is at best point out the shortcomings of current state of the art LLMs. They do not in any way disprove their ability to reason. I don't know when the word reason started having different standards for humans and machines but i don't care for it. Either your definition of reasoning also allows for the faulty kind humans display or humans don't reason either. You can't have your cake and eat it.


> Oh i've read them.

It's hard to believe that after reading all the papers and the blog I linked, along with the references there, any reasonable person would come to such strong conclusions as you did. This makes it hard for me to believe that you actually read all of them, especially given your previous questions and comments, which are addressed in those papers and someone that actually read them wouldn't make such comments or ask such questions. And the funniest thing, and further proof of this, is that you linked a paper that is addressed in one of the papers I shared. It seems like not only LLMs can fake things.

> All these papers you keep linking do is at best point out the shortcomings of current state of the art LLMs

They clearly show that they fake reasoning, and what they do is an advanced version of retrieval. Their claims are supported by evidence. What you call "shortcomings" are actually proof that they do not reason as humans do. It seems like your version of "reality" doesn't match reality.


The paper i linked is not addressed by the paper you linked. The paper you linked attempts to give LLMs the same benchmarks in a format they aren't best suited for. I don't know how you can call that "addressed".

>They clearly show that they fake reasoning

Sure and planes are fake flying. The illusive "fake reasoning" that is so apparently obvious and yet does not seem to have a testable definition that excludes humans.

You've still not explained how writing the same program in different languages doesn't require reasoning or how we can test your "correct" version of reasoning which requires "concepts".


> The paper i linked is not addressed by the paper you linked. The paper you linked attempts to give LLMs the same benchmarks in a format they aren't best suited for. I don't know how you can call that "addressed".

What you're writing now is nonsense in context of what I wrote. Once again, you're showing that you didn't read the papers. Which paper are you even referring to now, the one you think addresses the paper you linked?

> You've still not explained how writing the same program in different languages doesn't require reasoning or how we can test your "correct" version of reasoning which requires "concepts".

"Concepts" are explained in one of the papers I linked, which you would know if you had actually read them. As to programming languages they learn to identify common structures and idioms across languages. This allows them to map patterns (latent space representations duh!) from one language to another without reasoning about the underlying logic. When translating code, the model doesn't reason about the program's logic but predicts the most likely equivalent constructs in the target language based on the surrounding context. LLMs don't truly "understand" the semantics or purpose of the code they're translating. They operate on a superficial level, matching patterns and structures without grasping the underlying computational logic. The translation process for an LLM is a series of token-level transformations guided by learned probabilities, not a reasoned reinterpretation of the program's logic. They don't have an internal execution model or ability to "run" the code mentally. They perform translations based on learned patterns, not by simulating the program's behavior. The training objective of LLMs is to predict the next token, not to understand or reason about program semantics. This approach doesn't require or develop reasoning capabilities.


You are making a lot of assumptions that are mostly wrong.

Case in point:

https://arxiv.org/abs/2305.11169

I'm asking for something testable, not some post-hoc rationalization you believe to be true.

I'm not asking you to tell me how you think LLMs work. I'm asking you to define "real reasoning" such that i can test people and LLMs for it and distinguish "real reasoning" from "fake reasoning".

This definition should include all humans while excluding all LLMs. If it cannot, then it's just an arbitrary distinction.


It appears that you are the only person in this discussion making many incorrect assumptions. Based on your comments, I would assume you are actually googling those papers based on their abstracts. Your last linked paper has flawed methodology for what it attempts to demonstrate, as shown in this paper: https://arxiv.org/pdf/2307.02477 The tests you're requesting are provided within the previously linked papers. I'm not sure what you want. Do you expect people to copy and paste entire papers here that show methodology and describe experiments? You wrote, "I'm asking you to define 'real reasoning'," which is actually defined in the blog post linked earlier in this discussion. In fact, the entire blog post is about this topic. It appears that you are not thoroughly reading the material. Your replies resemble those of a human stochastic parrot.


>Your last linked paper has flawed methodology for what it attempts to demonstrate, as shown in this paper: https://arxiv.org/pdf/2307.02477

Genuinely, What's wrong with the methodology?

Your paper literally admits humans would also perform worse at counterfactuals. Worse than a LLM ? Maybe not but it never bothers to test this so...

The problem here is that none of the definitions (those that are testable) so far given actually separate humans from LLMs. They're all tests some humans would also flounder at or that LLMs perform far greater than chance at, if below some human's level.

If you're going to say, "LLMs don't do real reasoning because of x" then x better be something all humans clear if what humans do is "real reasoning".

Humans perform worse at counterfactuals so saying "Hey, see this paper that shows LLMs doing the same, It means they don't reason" is a logical fallacy if you don't extend that conclusion to humans as well.


In these arguments it's always very notable that not only do people not benchmark LLMs against people, but several I've discussed with have argued very strongly for not doing so unless they're benchmarked against above average people. While arguing that these same tests prove LLMs can reason. It never seems to land with them that their standards for "reason" would exclude large portions of the human population to some state of lesser being without the ability to reason.


It's hard to believe that after reading them, any reasonable person would think they support the extremely strong claim you made above.

> They clearly show that they fake reasoning

They do nothing of the sort.


That quote does not support your claim, and if you think it does, then I question your ability to reason.


> Prove that the problem wasn't seen by them in other form.

You can reduce that risk to arbitrarily low levels by trying multiple random grammars of some complexity. This is a weak argument.

> Oh, come on. It's like rewriting the same program in another programming language with different variables.

No, it's like following a grammar, but that requires reasoning about a set of rules it has not seen before. I don't think you understood the task I described as well as ChatGPT does.

> What it can't do is to create a concept of programming language, I'm not talking about a new programming language, I'm talking about the concepts.

Neither can most humans.

And have you tried to ask it about these concepts? I've had it infer semantics of code in programming languages that don't exist based on a hypothetical sample several times, and they're pretty good at coming up with semantics that makes sense. In one instance I gave it a sample with an idea about what made sense to me but it inferred a better set of semantics.

None of the papers you linked supports your claim.


"generalize to its dataset" is a contradiction, especially as these models are trained in the one epoch regimen on datasets of the scale of all of the internet. if you think being able to generalize in ways similar to the whole of the internet does not give your meaningful abilities to reason, I'm not sure what I can tell you


> "generalize to its dataset" is a contradiction

Not "to" but over, example the same code written in one language over the other language.

> if you think being able to generalize in ways similar to the whole of the internet does not give your meaningful abilities to reason, I'm not sure what I can tell you

If after reading papers below that show empirically that they can't reason, you will still think they can reason, then I don't know what I can tell you.

https://arxiv.org/abs/2311.00871

https://arxiv.org/abs/2309.13638

https://arxiv.org/abs/2311.09247

https://arxiv.org/abs/2305.18654

https://arxiv.org/abs/2309.01809


Couldn't they show up new as yet unknown things, if they are statistically probable given the training data


No, none of the Millennium Problems or other math problems (unsolved by humans for decades or centuries) have been solved solely by LLMs, even though they possess all the knowledge in the world.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: