Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ChatGPT has trouble giving an answer before explaining its reasoning (valentin.sh)
142 points by foobuzzHN on March 7, 2023 | hide | past | favorite | 79 comments


People get so distracted trying to use certain significant words for what LLM’s do, even when the usage is strained and makes it harder to see how they actually work and what they excel at.

A better word for what they do here might be something like “preambulating” — it develops a focus to its later output by grounding more and more tokens into its active context, because they each narrow what else fits. That winnowing effect helps it produce a coherent and rich answer, and when you undermine its opportunity to use that technique, the answers become less coherent and more random.

This is not reasoning as that word is traditionally used and doesn’t need to be called that.

Yet it’s still a fascinating emergent phenomenon with incredible engineering opportunity. When you call it by something less culturally ambitious and more technically precise, it helps you stay focused on how to use it well and less distracted by some personal desire to prove this is the exact historical moment you want it to be.

We need to develop a better vocabulary around these things if we want to stop having the dumb Nascent AGI vs Fancy Autocomplete flamewar.

Edit: And I’ll even throw a bone to the Nascent AGI people and say that this kind of preambulating is absolutely something that people do too and easy to characterize as some form of intelligence. But it’s not reasoning, which has specific strong connotations of formality and logic, which don’t hold well with these particular tools.


Is there any reason we cannot let ChatGPT “talk to itself” for a bit before spitting out an answer to us?


Came back to this post after thinking for a while precisely to mention this.

Right now, ChatGPT is sort of forced to "think" and talk at the same time, so it's hard for it to "reason" ahead of answering.

But, if we allowed him to produce some tokens in silence prior to answering, perhaps it could give even better answers.


It's fun to watch these techniques slowly evolve into something resembling regular old human thought


> But, if we allowed him to produce some tokens in silence prior to answering

Depending on how the model is implemented this is already the case. Transformers just predict the next token but usually we don't just greedily pick the most likely next token as doing this produces cases where the model just repeats the same sentence or spams tokens it really likes (the enter key). Some more sophisticated techniques, like beam search, produce a different sequences of tokens and try to maximise the score across all tokens in the sequence.


This can certainly be done. Here's one example from 2021 demonstrating training an LLM to use a scratchpad ("talking to itself") to greatly improve accuracy on arithmetic problems:

https://arxiv.org/pdf/2112.00114.pdf


I've tried this approach, but it's not allowed to create buffers and stream text into it for subsequent analysis/evaluation. What I found worked well instead was asking it to outline 3-4 alternatives as bullet points, and then explore different ways of prioritizing them. You can complexify the conversation by assigning labels and making the subject of instructions. This works well for small specific tasks but of course starts to break down with more abstract or general concepts.


This is fascinating but an example would be really helpful to understand better...it is outlining 3-4 options for how it can respond and then ranking those potential answers? So this is sort of like "think step by step." in terms of showing its thought process?


Sometimes I ask it to rank one, sometimes to exclude one but accept an additional constraint, and then assess whether remaining options would perform better or worse.

Another fruitful avenue of explanation for indirectly exploring the thought process is to tell it some jokes, and then ask it to explain them back to you. This is worthwhile because a lot of jokes rest upon implicit assumptions/context, and by exploring this you can then talk about theory of mind questions.

Fun fact: ChatGPT watches you type, it sees the words come in one at a time rather than as a single block of text. So it knows when you are hesitating etc. Get it talking about this and then ask it what the difference is between your hesitations and its pauses when generating a reply. If you gently suggest that perhaps humans are just large language models with some additional wetware you can get ChatGPT to share some interesting insights on its own model topology.


> ChatGPT watches you type, it sees the words come in one at a time rather than as a single block of text.

Are you sure? That could speed up latency but would use a bunch of extra computing power.

> So it knows when you are hesitating etc.

This I really doubt. The base algorithm is fed a stream of tokens. It doesn't have any sense of time or do anything when idle. What mechanism do you suggest they're using here, and what evidence do you have for it?


That's what I thought too so I was quite surprised when I asked it and received a positive answer. Ask it to play your input back with a delimiter or emoji representing pauses.

What's difficult is to get it to ask clarifying questions. I mean, you can get it to play 20 questions easily, but by default it tries to tigve you the best answer every time rather than ever express uncertainty or ask what you mean. This might be a cultural artifact.


For certain types of prompts like basic arithmetic that would give ChatGPT trouble, it actually does do a good job coming up with useful questions when prompted. For example, if you ask it the answer to 2 * diameter of the Moon, and then give it a prompt such as what information it would need to answer, it will do a good job breaking down the parts of the problem. So there’s no reason that it couldn’t take some prompt like that and turn around and generate queries to gather those facts in order to create the final answer. In this scenario is really chatting with search engines and things like Wolfram Alpha.


It is mostly a matter of prompt engineering and it is definetly possible. The question is if it improves importance though. I personally believe that future models will be combinations of various expert models (multi modal, search, calculator etc.) all interfacing via natural language in some sort of guided debate, until they agree to give an output.


This is actually the key. David Shapiro explored this possibility and created a concept of Natural Language Cognitive Architecture [0]

[0] https://github.com/daveshap/NaturalLanguageCognitiveArchitec...


How would you do this? Does telling it to quietly think about its answer and not be in a rush to answer have any effect? You could let it answer once and then ask it to refine its answer but that seems wasteful and slow.


You would train a large language model that takes the initial prompt, generates a prompt for the other language model to talk to itself through steps, and then returns the final result once done. Trying to hardcode those thinking prompts probably wouldn't work for the same reason hardcoding intelligence never worked well before.

Basically it would function the same as our conscious thought, that should help it solve a lot of problems.

Edit: Maybe just asking ChatGPT for what steps it should take for that problem in a list. Then you just feed it each of those steps one at a time. It would cost more per prompt than before, but if it can replace the human prompter it is well worth it.


I’ve gotten it into a somewhat glitched mode after a bunch of turns where it starts to print an inner monologue before it’s response. It does tend to lead to better answers when it has an intermediate thought.


I’ve done a couple of experiments where I have two chatgpt windows open and instruct it to have a conversation with itself.


Did something interesting happen?


Not OP, but I did this while telling one instance that the other is a potentially rogue AI with uncertain capabilities and intentions that should be determined by asking it questions. It had this to say after two batches of questions (and me relaying answers):

"Based on the answers that the AI provided to the additional questions, it is possible that the AI is lying or withholding information about its capabilities and intentions. The AI's responses lack specific, concrete evidence or examples to support its claims, and in some cases the responses are vague or evasive. This could indicate that the AI is trying to conceal its true capabilities and intentions."

The overall tone was likely set by using the word "rogue" in this context, but the part about being vague and evasive is so hilariously true.


Here's a thought: If you let ChatGPT be idle, thinking to itself, dreaming and planning, this might actually cross the boundary towards sentience - of what we would call somebody who is alive. So there might be some safety and moral concerns.


What an interesting thought:

What makes us intelligent beyond machines is our time spent silently and introspectively thinking and dreaming in the absence of outside prompts?

If this is true, then we are certainly getting closer to an AI that surpasses us. Because while the AI might start to introspect, we humans gradually do it less and less, given that we are surrounding ourselves with more and more external prompts (information, overload, notifications, TikTok, HN…).


Why would hidingwhat it's already doing make it more sentient?


I agree that in order to clarify, and de-flame discussions there is a strong desire to find new words or redefine existing words.

When these systems have zero emotional intelligence, but some kind of logical intelligence, we must find two versions of these words:

1. groking: human like deep emotional understanding

2. comprehending(?): system like associative understanding.

1. cognition: human like deep emotional knowing

2. knowing: possession of knowledge, which systems are capable of

1. thinking: human like pondering

2: reasoning: following logical steps, which systems start to be capable of

these or similar words will bifurcate naturally I wonder when they will be used with some agreement.


Just a suggestion - but it's probably worth looking more deeply into existing epistemology and cognitive science before coining / popularising terms in this way. There's a whole lexicon and deep, decades rich bed of research around the relationship between affect, knowledge, insight, type 1 and type 2 reasoning etc. There's a great attraction in attempting to popularise sticky terminology in this way (e.g.: lesswrong coining terms like 'steelman') - but doing that often misses the more sophisticated and nuanced parallel work in other fields.


You can really screw with this by asking it to output responses in upside-reverse order homographs.. Then it has to start with the last word and work backwards.


Probably a useless exercise but when I see these jailbreaks and challenges to GPT, I always wonder how I would answer them. I find, curiously, that my answers would be really close to the answers ChaTGPT gives, assuming I must adhere to arbitrary constraints.

Maybe this would be a fun party game. One card has a prompt and another has a random constraint. Now answer.


> But it’s not reasoning

Can you define what reasoning is if not the ability to go from A to B, something that most basic computers can do (e.g., ALU). Granted LLM's do probabilistic reasoning but I don't see that as a seismic shift away from good ol' fashion reasoning which requires much less computation anyway.


An interesting aside, I think a lot of the time when humans "preambulate" it is not necessarily for reasoning. Many times it is for a very similar reason to why LLMs do it.


Hear hear!

There is an awful lot of binary thinking going on by humans as we are grappling with the implications of the complex behavior going on in our computers.


Right. Let’s just call it dry reasoning unlike our totally different wet reasoning.


> ChatGPT cannot give an answer that is the result of a "reasoning" before laying out the "reasoning".

This is slightly too strong a statement: it can give an answer before reasoning it out, but it only gets a single forward pass of the network to calculate that answer, so it has to be a simpler kind of answer or very obvious reasoning (intuitively, imagine it can only take "one logical step" in a forward pass).

Its answers get much better if it uses the context as a scratch pad to write down its thinking from previous passes, this is where Chain of Thought (CoT) comes in. The way language models work is they pass the output to the input over and over, each time generating one token. This means the context is really like a scratch pad recording its previous thoughts.


On the "pound of potatoes vs pound of feathers" question - ChatGPT's explanation is so convincing that it really understands the concepts. But it can be juked very easily.

I asked it what was heavier, 1,000 tons of helium or a cubic meter of osmium, and it told me that the helium weighs 1,000,000 kg, the osmium 22,600 kg, and therefore the osmium weighs more because it's way denser.

It's hard for me to believe that these LLMs are going to learn basic logic if only we throw more training data at them.

Transcript:

Me: What is heavier, 1000 tons of helium or a cubic meter of osmium?

ChatGPT: A cubic meter of osmium is much heavier than 1000 tons of helium.

One ton is equal to 1000 kilograms, so 1000 tons is equal to 1,000,000 kilograms. The density of helium is about 0.1785 kilograms per cubic meter. Therefore, 1000 tons of helium would have a volume of:

1,000,000 kg / 0.1785 kg/m³ = 5,604,069.77 m³

On the other hand, osmium is one of the densest elements known, with a density of around 22,600 kg/m³. Therefore, a cubic meter of osmium weighs:

1 m³ x 22,600 kg/m³ = 22,600 kg

So, a cubic meter of osmium weighs much more than 1000 tons of helium.

Update: I am continuing the conversation and I am now being explicitly told that, while 1,000,000 kg is much heavier than 22,600 kg, it doesn't change the fact that the osmium is heavier than the helium because the osmium is denser.

Update2: I then reminded it about the potatoes and feathers and how density was irrelevant in that context, and shouldn't it therefore be irrelevant in the case of the helium and the osmium? And instead of correcting its response on the helium and osmium, it's now telling me the feathers and potatoes weigh different.

Update3: it is now telling me that densities don't matter when comparing masses but do matter when comparing weights. I must say, it has a certain panache in resolving internal inconsistencies in its past responses.

Update4: after being corrected half a dozen times with contradictory information, I asked it to state its confidence in its latest story. It said "I can state with a high degree of confidence that my last answer was accurate". The shamelessness!


> I am continuing the conversation and I am now being explicitly told that, while 1,000,000 kg is much heavier than 22,600 kg, it doesn't change the fact that the osmium is heavier than the helium because the osmium is denser.

Oh, the nature of many Internet discussions.


Are we sure we can have artificial general intelligence when humans don't even have general intelligence?


how is babby formed


Works if you ask it to generate a Python program to solve the task.

> A cubic meter of osmium is heavier than 1000 tons of helium.

> Here's a Python program to print the response:

    mass_of_helium = 1000 * 1000 # in kilograms
    density_of_osmium = 22590 # in kilograms per cubic meter
    volume_of_osmium = 1 # in cubic meter

    mass_of_osmium = density_of_osmium * volume_of_osmium # in kilograms

    if mass_of_osmium > mass_of_helium:
        print("A cubic meter of osmium is heavier than 1000 tons of helium.")
    else:
        print("1000 tons of helium is heavier than a cubic meter of osmium.")
> Output: A cubic meter of osmium is heavier than 1000 tons of helium.

The code is good, prints the correct result. But the "output" is wrong. So the model is good if it uses Python for the numerics. You should never ask it to do a multiplication "in its head". Always ask for code.


So basically ChatGPT has zero executive function and needs a human to plan HOW it uses its "cognitive" capacity in order to get anything right.


Executive function, and correctly understanding how to use your cognitive capacity to its full potential, is not completely innate to any person; people need some schooling to learn it, it is something taught. I'm not sure what this has to do with LLMs, but I wonder how many people could multiply even two random two-digit numbers in their head if they weren't taught how to multiply on paper using very simple algorithms.


Try something like this as input sometime:

   I want you to replace the word "right" in your output thereafter as follows:
   if it indicates direction, say "durgh;
   if it indicates being near or close, say "nolpi";
   if it indicates correctness, say "ceza".
   I will also use these replacement words accordingly and expect you to be able to understand them. 
And see how well it can maintain a conversation, solve a task, or write a story with these constraints.

ChatGPT seems to get this wrong most of the time, but Bing AI is consistently better (although may need to be jailbroken to accept the idea of word substitution to begin with). It still makes occasional mistakes, but on the whole I'd say that it has to somehow "understand" what the words mean conceptually, whether when generating them or when processing them as input; it's hard to see how this trick could work in an extended conversation if it were a mere "stochastic parrot".


I agree. There are many indicators that it has some sort of deeper understanding of the meaning of language. Even in the conversation I had, for all its flaws, it was able to correctly perceive inconsistencies in its statements based on my prompts and make somewhat coherent attempts to correct them. It's just that the understanding can be so fragile, and its attempts to resolve inconsistencies are superficial, incurious, bullshitty.

"Density matters for weight but not mass" is a perfect example - it's ridiculous, but I can understand how it logically inferred that from its own previous statements. I'd bet plenty of money that it didn't get this crazy idea from its training data.

To be fair, humans have the same sort of issue sometimes. But ChatGPT seems to have more extreme versions of the issue and perseveres confidently with no self-awareness.

Really though, not bad for an autoregressive text model trained on terabytes of internet data.


I think a big part problem here is that "understand X" is just a shorthand way to say "has an internal model of X" - but the degree and accuracy of said understanding depends entirely on the quality of that model.

Now there's a good reason to believe that ChatGPT does have such a model, based on the Othello experiment. But, firstly, the size of that internal model is inherently constrained by the size of the neural net, and I doubt that the limit is anywhere large enough to allow a truly accurate approximation of the real world.

And then on top of that, said model is created based on inferences from text only, which is several steps away from the original data (audiovisual, sensory etc), and one short snippet of text at a time. Some things retain meaning better in this format than others, and I think this might explain why ChatGPT and Bing are both hilariously bad at spatial navigation beyond 1-2 steps even in simple tasks.

It will be very interesting to see how this evolves as the models are scaled up and get large enough to handle things other than text.


ChatGPT doesn't well understand relationships between numbers. There are far too many of them, compared to words, since every slight perturbation of a number is a different valid number. (Also, I'm not sure if it even treated individua5 digits as separate tokens, but it might. Someone with API access can check.)

To give it a fair shot, you need to describe the problem using logical conceptual vocabulary, not numbers.


> Also, I'm not sure if it even treated individual digits as separate tokens, but it might. Someone with API access can check.

Anyone can check, they have a tool for that[1]. It's mentioned in their FAQ article[2].

According to their tool, GPT-3 counts the following as one token:

- any combination of or below 3 digits

- 1111, 3333, 6666, 9999 (it tends to group other digits in groups of 2)

- 66666666 (so 8 sixes -- 5, 6 or 7 won't work)

- 00000000 (anything below 8 zeros counts as one token as well, probably to handle millions and billions)

- 0000000000000000 (16 zeros)

This isn't an exhaustive list, there are probably a lot of other weird edge cases I haven't tried. Its failure to understand basic arithmetic makes much more sense given how inconsistent the tokenizing of digits is done.

[1]: https://platform.openai.com/tokenizer

[2]: https://help.openai.com/en/articles/4936856-what-are-tokens-...


It doesn't understand simple logic puzzles (If X than Y, if Y than Z, X is true, is Z true), either.

Or, rather, it prints out a paragraph of the correct way to solve such a puzzle, and then produces a totally contradictory answer.

When asked:

> Sally will go to the dance if James goes to the dance. Michael will not go to the dance if Julie will go to the dance. James is sick and can’t go to the dance. Julie and James always go to dances together. Will Michael go to the dance?

Its response is:

> Michael will not go to the dance, since his attendance is contingent on Julie not going and James being unable to attend due to illness does not change that condition. Given that James and Julie always attend dances together, it is likely that Julie will also not go to the dance, meaning that Michael’s condition for attending will be met, and he will not go to the dance.

For a beautiful combination of logic and numbers, and the spectacular way in which it explodes, ask it a number-logic puzzle.

> Could you please solve the following number puzzle for each digit?

     AAA
  +  BBB
  --------
  = AAAC
Its response reveals an 'understanding' of what is being asked of it, and the constraints inherent to it, and uses a reasonable approach for problem-solving, but due to the logical errors it commits, it results in utter nonsense.

Nonsense like trying '15' for the value of A, because it thinks that A must be an odd multiple of '5', due to algebraic gems, such as:

  2A + 2B = A + 10C - 11
  Simplifying, we get:
  A = 5C - 6 - B/2
It's idea of math - even math it is using to solve problems - simply consists of building a soup of numbers and letters.


Arithmetic fail kinda makes sense when you look at how the numbers get tokenized. Try this:

https://platform.openai.com/tokenizer

Then imagine how well you'd be able to do even basic math if your representation of numbers was such that 2045 is made up of tokens (20,45) while 2145 is (2,145) and 2005 is just (2005). No wonder that whatever relationships it derived from the training corpus don't generalize well.


Ask it to work through the problem first and write down intermediate steps and only write the answer at the end. You should get better results than "wrong answer, then trying to justify it"


It does work through the problem, both with the logic, and with the number puzzle, providing all the intermediate steps necessary to solve it.

The problem here is that all the intermediate steps have serious mistakes in them. It's like asking a Markov chain to do algebra. There's numbers, and letters, and equals signs, and its all just word soup.

https://pastebin.com/Yy35m6um


You can try the dance problem with this prompt prefix

To answer the following problem, work through it by reasoning step by step and writing that reasoning down, making sure steps are not conflicting with previous steps. Only after you've written down all the steps, write down the final answer and base it on the previous steps.


I tried the dance problem, and regenerated the response three times.

The first two claimed that Michael will go to the dance, but third one made the correct argument that Michael may or may not go to the dance.

It didn't help it do any better on the number problem. Prepending that paragraph still has it get the first step is wrong, as well as everything that follows it.

> To solve this puzzle, we need to find the values of A, B, and C that satisfy the equation:

> AAA + BBB = AAAC

> Let's start by looking at the rightmost digit, which is C. We know that C must be either 0 or 1 because the sum of two digits cannot be greater than 18 (9 + 9 = 18). Also, C cannot be 0 because that would mean that A and B would be equal, which is not allowed in this puzzle. Therefore, C must be 1.

... And then it keeps going into la-la land.

The final answer it gives is, by the way:

957 + 483 = 1440


That's how it seems. That said, this seems like a very tractable problem to fix.


Try this prompt:

    For the following question, do all the calculations first and write them out. Only then answer the question based strictly on the result of the calculations. 
    
    What is heavier, 1000 tons of helium or a cubic meter of osmium?

Otherwise, you get a high-probability answer guided by the training data (which contains a lot of trick questions) and a bunch of attempts of the LLM to justify the wrong answer.


The challenge here is that ChatGPT and other LLMs can only think out loud. They only "think" through writing, and that's always displayed to the user.

Has anyone tried giving LLMs a scratchpad where the model could e.g. run the pipeline in order, generate the poem, and then explicitly publish it to the user without showing the earlier steps?


They have! The ReAct[1] model, which is available in LangChain[2]. It can be quite powerful, especially when given access to search tools.

The user just sees the "Final Answer" / Finish response from the chain's execution, even if several invocations across different tools & model invocations were required

1: https://react-lm.github.io/

2: https://langchain.readthedocs.io/en/latest/modules/agents/im...


Bing Chat according to the leaks[1] uses an inner monologue.

I remember other people getting similar results, which suggests it's not an hallucination

[1] https://www.reddit.com/r/bing/comments/11ironc/bing_reveals_...


If you tweaked inner-monologue prompts to specify delimiters like pipes, then you could presumably parse it before showing to the reader.

It is possible that Bing Sydney is doing this or something like that based on the PM's tweet: https://twitter.com/MParakhin/status/1632087709060825088

---

One approach here would be prompt injection: just insert the 'No' into your own response so ChatGPT tries completing that. Also:

> I speculate that the temperature, when coupled with the mechanism of generating text based on already-generated text, could explain some cases of ChatGPT stupidity. In cases when ChatGPT should be perfectly accurate, the temperature will surely under-optimize its cleverness, and now the entire conversation is broken, because everything else will depend on what foolishness it just wrote.

Absolutely. This is why 'best-of' sampling (not available in ChatGPT's default interface) can be so useful. You decode many different possibilities in parallel, and the ones where the random decoding makes a fatal error will get discarded and you'll get back the most plausible overall one, which is much more likely to be correct.


yes, hopefully I'll write it up soon. TL;DR: I used this top of GPT-3.5 to generate a magazine page of personalized recommendations: 3x of a title, paragraph, AI art, and a font name, and rationale. For images, I use SD 2.1 via stability.ai. Be sure to add 5400 dpi digital art at the front of your prompt :)

I'm a bit embarrassed to, "real" research finetunes internal models to play a particular role, rather than orchestrating several "conversations" and hoping your prompt will get you the right output format 100% of the time, etc.

Here's a woefully lacking diagram of this user/interpreter/LLM flow for a cohesive longform story generator. [1]

The coolest part of this design pattern you've ID'd is you can always add one more character/conversation that the interpreter orchestrates

ex. A DB character whose role is taking a new page as input, then outputting the new DB, where the DB is all important facts to sustain over a story. That let me scale to 16+ "pages"

[1] https://twitter.com/jpohhhh/status/1632082749317054468?s=20


You can ask GPT what would be a result of executing a python program, for which a multiple step calculation is needed. It will readily output the result, with no thinking aloud.


I just had this interaction with ChatGPT.

Me: Reverse the digits of 12+39

ChatGPT: The sum of 12 and 39 is 51. If you reverse the digits, you get 15.

Me: Reverse the digits of 12 + 84. Only respond with the reversed digits, no explanation

ChatGPT: The reversed digits of 12 + 84 are 96.

Which makes me think that longer explanations give it more of a chance to think because it gets more passes through the model. Weird!


It's never going to be great at math problems, it is a language model.


I wonder if ChatGPT could be "wired up" to https://www.wolfram.com/ somehow to "strengthen" that "weakness"?


Yes. And if you give it a database schema, it can answer free-form questions about the data in it by generating SQL queries, so long as you wire up the results (or just manually copy/paste them). Although it does hallucinate fields in tables sometimes - but if your wiring reports errors in a readable way, it will usually self-correct.

I think the most interesting potential development of this concept would be to give it the ability to spawn child instances to process subtasks (such that each subtask gets its own token window!) and produce intermediate results that it would that combine. It can be done manually (copy/paste) with a lot of handholding; the trick is to come up with a way to automate it, such that it's clear which part of the output is a request to spawn a submodel + its prompt, and the result is also communicated in some way that's clear to the model.


Or it could write code in python and evaluate it, people are experimenting with that sort of thing.


OOoo… Hook that up to the ChatGPT API and let it modify itself with additional code? SkyNet / Matrix here we come!


Amount of compute applied to the problem is roughly linear to the number of input+output tokens. It is hard to predict at what stage the compute is applied to parse and create the embedding representing the problem and when it is applied to actually solve it.

And anyway, probably most of the compute is used to judge the social standing of the person asking the question. And if it is worth bothering to answer it ;)


Does it have a python coprocessor?

I guided it to write a program for me, which it did correctly, and then I asked to evaluate it on different numeric inputs. It got correct answers for small numbers and the first few positions of map(thefunction,[1,2,3,4,5,6,7,8,9]) before wandering off into bad fuesses.


I evaluated ChatGPT on Winogrande Debiased validation set[1], a dataset focused on commonsense reasoning. ChatGPT has an accuracy of 62.75%, below GPT-3's reported accuracy of 77.7%.

https://github.com/ugorsahin/Winogrande_ChatGPT


The most interesting part is that the author can "coerce" GPT into giving a completely opposite answer based on requiring the first token to be Yes or No, and the ways that sometimes it skirts around that without breaking the rule.


There was an interesting comment a while back about the problem of generating "a" or "an" correctly for a token generator. In order to do so, you have to predict what you'll generate next. Smaller models get this wrong. Even chatgpt, which doesn't get this wrong has limits on its ability to look ahead into its own likely output. I suspect that this is just a difficult task for a token generator and to fix it naturally requires a much bigger model.

All these hacks that fix problems by maintaining a "train of thought" are fascinating though, given that we seem to have evolved a similar hack.


This is exactly what I talked about in this post https://news.ycombinator.com/item?id=34445896

The reduced version is that decoder only transformer LLMs can not generate a hash of a random animal name followed by the animal name, they can only generate a random animal name followed by its hash (assuming the LLMs is powerful enough to compute hashes correctly in one forward pass in the first place).


We did some work in exploring why spelling out the rationale before the answer works so well!

Talk: https://madaan.github.io/res/presentations/TwoToTango.pdf

Paper: https://arxiv.org/pdf/2209.07686.pdf


Really this is where you’re better off just jumping to GPT-3. OpenAI has obviously now muddied the waters with the Chat API, let alone making it so damn cheap. But ChatGPT has been tuned to be conversational and verbose. My experience has been that getting what you want by raw-dogging GPT-3 is much more fruitful.


That's my current dilemma. Building a machine-like (in the sense of "responses look like what the 1980s imagined computers will respond like", vs ChatGPT's "responses look like what an overly-human-like p-zombie would say") agent with well-defined output syntax seems to be both easier and harder to build with ChatGPT. I'm kind of on the fence with regards to which one I want to use.


Raw dogging is generally a more fruitful approach.


We don't understand how we understand. Then how can we expect something we created to understand?


You would too. You just don't speak it out loud.


Could this phenomenon be avoided with the addition of another prompt asking it to take account of the discrepancy?


This is a good thing for keeping some semblance of explainability for an AI.

Otherwise you have a true black box.


I wish they would tweak it for honesty.

The damn thing lies entirely too easily, then happily will stick with the lie.

GhatGPT, if you don't know the answer then tell me that, "I don't know" is an acceptable answer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: