More

krackers · 2026-03-30T18:15:57 1774894557

>as if I were no longer able to judge, to decide, without consulting the AI

"the Whispering Earring" – https://gwern.net/doc/fiction/science-fiction/2012-10-03-yva...

krackers · 2026-03-30T04:41:33 1774845693

This needs a story. What did you say to him?

harikb · 2026-03-30T04:45:49 1774845949

He was told, he had a call from Pope

not_your_vase · 2026-03-30T05:08:58 1774847338

Heh, classic Mike (not the Pope one)

MikeNotThePope · 2026-03-30T07:45:17 1774856717

This was maybe 20 years ago. I was looking for a job as a recruiter and just called him. He referred me to an HR rep and I did get an interview from it. Didn’t get the job, but hey, I got a shot!

krackers · 2026-03-29T06:04:51 1774764291

You must be a shape rotator.

krackers · 2026-03-29T00:11:49 1774743109

Beautiful explanation, thanks!

krackers · 2026-03-28T23:59:41 1774742381

Yeah the ERD result was known before I think, see "The Remarkable Robustness of LLMs: Stages of Inference?" https://arxiv.org/abs/2406.19384

But the fact that the intermediary circuits are generic and robust enough that you can just loop them is unexpected. I mean maybe it sort of makes sense in retrospect, the above and other papers showed the middle layers of an LLM behave more like "iterative refinement", so to use a signal processing analogy maybe you just keep applying filters and suppress the noise.

But by that same analogy, I'd predict that you can't just keep repeating layers, at some point you'll suppress the signal as well. Not sure if there was an experiment conducted with how many times you can repeat RYS layers before performance goes back down.

krackers · 2026-03-28T23:50:01 1774741801

Platonic representation hypothesis? https://arxiv.org/abs/2405.07987

Maybe in the same way fourier/wavelet basis is just the most natural way to work with certain signals or images, there's a certain representation for language and a representation for "thinking" that's natural. Maths could itself be said to be an abstract representation of the latter, and even if the idea of a "universal grammar" is dead, maybe there's still some natural space to work with all human languages owing to the shared biological priors we have as humans.

krackers · 2026-03-28T18:32:19 1774722739

On intel macs there used to be single user mode, but even then I don't think you ever had control over the framebuffer.

krackers · 2026-03-27T03:14:52 1774581292

> this uses a harness

This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.

fermentation · 2026-03-27T05:43:41 1774590221

Right, fair, but look at the prompt. For the purpose of testing general intelligence, this seems kind of pointless.

UltraSane · 2026-03-27T06:13:56 1774592036

It isn't arbitrary. They want measure the capability of the general LLM

fc417fc802 · 2026-03-27T08:56:09 1774601769

So if I say "I want to measure your capability as a mechanic" but then also "to ensure an accurate score you're forbidden to use any tools" how are you the human mechanic planning to diagnose and fix the engine problem without wrenches and jack stands and the like? It makes no sense.

That said their harness isn't generic. It includes a ridiculously detailed prompt for how to play this specific game. Forbidding tool use is arbitrary and above all pointless hoop jumping but that doesn't make the linked "achievement" any less fraudulent.

UltraSane · 2026-03-27T10:00:49 1774605649

It is more like restricting the mechanic to only using commercially available tools and not allow them to create CUSTOM tools.

fc417fc802 · 2026-03-27T10:48:37 1774608517

No, that would be analogous to disallowing customized harnesses, ie tooling specially crafted by someone else for the specific task at hand. Insisting that an LLM solve something without the ability to make use of any external tooling whatsoever is almost perfectly analogous to insisting that a human mechanic work on a car with nothing but his own bare hands.

The wrench is to the mechanic as the stock python repl is to the LLM.

UltraSane · 2026-03-27T10:51:27 1774608687

They want the LLM that does the ARC-AGI-3 to be the same LLM that everyone uses.

fc417fc802 · 2026-03-27T11:14:55 1774610095

Rephrase that in terms of the human mechanic and hopefully you can see the error of that reasoning. LLMs that perform tasks (as opposed to merely holding conversations) use tools just like we do. That's literally how we design them to operate.

In fact the LLMs that everyone uses today typically have access to specialized task specific tooling. Obviously specialized tools aren't appropriate for a test that measures the ability to generalize but generic tools are par for the course. Writing a bot to play a game for you would certainly serve to demonstrate an understanding of the task.

UltraSane · 2026-03-27T17:24:35 1774632275

I'm pretty sure the LLM can use tools while doing arc-agi-3 but it has to the same tools available all the time not an incredibly elaborate custom harness.

fc417fc802 · 2026-03-27T21:15:00 1774646100

To quote someone else from upthread, tool use requires a harness. Without one an LLM as commonly understood is a bare model that receives inputs and directly produces outputs the same as talking to an unaided person.

UltraSane · 2026-03-28T08:51:40 1774687900

Then the LLM has to write the harness.

fc417fc802 · 2026-03-28T11:01:33 1774695693

I'd like to suggest that prior to expressing disagreement you really ought to reread the comment you're replying to and make sure your understanding is correct.

Quoting this for the second time now - tool use requires a harness.

Without a harness the LLM has no ability to interact with the world. It has no agency. It's just spitting out text (or whatever else) into the void. There's no programming tools, no filesystem, no shell, nothing.

UltraSane · 2026-03-28T16:52:36 1774716756

And by the rules of arc-agi-3 the LLM will have to write any harness it needs. I'm not sure what we are even arguing about this point.

krackers · 2026-03-26T05:56:24 1774504584

Isn't this basically what javascript went through with Promise chaining "callback hell" that was cleaned up with async/await (and esbuild can still desugar the latter down to the former)

mgaunard · 2026-03-26T19:03:21 1774551801

This is literally what coroutines are, syntactic sugar to generate nested lambdas.

Except in C++ this removes a fair amount of control given how low-level it is.

krackers · 2026-03-26T05:37:39 1774503459

LLMs already do this and have a system role token. As I understand in the past this was mostly just used to set up the format of the conversation for instruction tuning, but now during SFT+RL they probably also try to enforce that the model learns to prioritize system prompt against user prompts to defend against jailbreaks/injections. It's not perfect though, given that the separation between the two is just what the model learns while the attention mechanism fundamentally doesn't see any difference. And models are also trained to be helpful, so with user prompts crafted just right you can "convince" the model it's worth ignoring the system prompt.

marcus_holmes · 2026-03-26T06:08:31 1774505311

Thanks that's useful.

So it's still one stream of tokens as far as the LLM is concerned, but there is some emphasis in training on "trust the system prompt", have I got that right?

veganmosfet · 2026-03-26T05:55:37 1774504537

This! And even more, the role model extends beyond system and user: system > user > tool > assistant. This reflects "authority" and is one of the best "countermeasure": never inject untrusted content in "user" messages, always use "tool".