Can Large Language Models Understand Symbolic Graphics Programs?

abernard1 · on Aug 16, 2024

Since people are not reading the paper, here were the conclusions:

> Is it really easy to answer semantic reasoning questions over symbolic graphics programs? We provide an intriguing experiment to demonstrate that SVG programs can be quite difficult for LLMs to understand such that even if the corresponding rendered images are fairly easy for humans to recognize, all these powerful LLMs still fail dramatically, only reaching a chance-level accuracy.

They then fine tune the model for this problem, and it achieves better results.

brap · on Aug 16, 2024

Crazy that LLMs are able to do that when they’re not even trained on visual data.

dmd · on Aug 16, 2024

But... they are? What are you saying here?

katzinsky · on Aug 17, 2024

I think he probably means that the LLMs never see the rasterized image output.

skybrian · on Aug 16, 2024

It’s difficult to tell what they’re proposing from the title or abstract, but asking questions about an SVG image, given only the source code, seems like an interesting challenge.

There are a lot of ways to generate the same image in different ways, or to slightly change the source code, resulting in a very different image. This could be used to train a deeper understanding of program-like text.

slashdave · on Aug 16, 2024

In the title and abstract there is a proposal that LLMs "understand", "reason", and "imagine." It's off putting to me, since LLMs are not capable of any of these.

mcbuilder · on Aug 16, 2024

Not in the traditional sense, but how are we to possibly gauge an LLMs "intelligence" without framing questions like this?

Kuinox · on Aug 16, 2024

Can you come up with a proof with your bold statements, since currently leading researcher on the subject have no idea if that's the case ?

slashdave · on Aug 16, 2024

Hard to prove a negative. On the other hand, relying on a poorly defined idea of "emergent behavior" when there is no architectural rational is rather poor form IMO. It's more wordy but safer to qualify these terms appropriately. That is, to test the "appearance" or "simulation" of human-like abilities.

warkdarrior · on Aug 16, 2024

That's all I'm hoping for, an ML model that appears to be as good as any human in existence, for all human skills.

To be fair, this is what we do with humans as well, we test their skills. We (rarely!) take humans apart, inspect their brains, and declare that they have reasoning or understanding capabilities.

slashdave · on Aug 17, 2024

Sounds like you are looking for AGI, along with many other folks. Although, be careful of the usage of "as good as", which is hard to quantify (think of artistic endeavors).

stevenhuang · on Aug 17, 2024

The absence of an architectural rational does not preclude it's formation either. Not understanding how something works doesn't mean that something isn't there.

It simply means we don't yet understand, so you should be agnostic in your position. That's why these studies are trying to find answers to these questions, not pretending to know the answer like you are.

Strange you don’t recognize you’re making the very mistake you criticize others for.

slashdave · on Aug 17, 2024

What do we not understand? LLM algorithms are well understood. There is no mechanism that can provide these skills.

Kuinox · on Aug 17, 2024

No, LLMs algorithms are not well understood at all, it's a black box and we don't know a lot how things go inside.

slashdave · on Aug 17, 2024

We know precisely what is inside. It is just computer code.

stevenhuang · on Aug 17, 2024

So I'll just say it: you lack the knowledge foundations for an informed opinion on this topic.

We know that DNA is "just" a sequence of nucleotides, yet life emerges. How does that happen, going from the world of nonliving to the living? We don't know. There are theories, but abiogenesis remains one of the most profound mysteries of science. But at the core of these theories, it is that the process of abiogenesis is informational.

Computer code is information. So from an information theoretic perspective, it's quite incoherent to say something can't happen because "computer code". As in fact information may well be the entire reason we can reason, or have the complexities of life.

Consider:

1) we know LLMs are universal function approximators

2) suppose the ability to reason, and the degree of reasoning ability, is a functional process that can be represented algorithmically, and can be learned

3) then might LLMs have learned some degree of reasoning ability?

If you insist 3) is impossible, you must also insist 2) is impossible. And since we really don't have a definitive answer for 2) much like we don't have one for abiogenesis, your position is not really defensible. At least not to the degree of conviction you seem to hold.

slashdave · on Aug 17, 2024

This is just silly. You can trace precisely how an LLM produces its output. You have access to all of the weights. We know precisely how the model is trained. There is no mystery here.

stevenhuang · on Aug 17, 2024

An empty mind is the most peaceful of all.

Kuinox · on Aug 17, 2024

Two things, first you are thinking about proving as a negative, it means you think something think or do not, understand or do not, imagine or do not.

It's wrong, it's all a spectrum.

Pigeons are stupid, relatively to a human, still, a pigeon have some intelligence.

That's the same thing for worms. Do you think that a worm have more intelligence than an LLM ?

The second thing, you says "hard to prove a negative", that's a fallacy that hide the main point: you don't know how to prove that something can understand.

Prove that you understood a concept, any of your choosing, or prove that you are not showing the appearance of intelligence or ability. Good Luck.

slashdave · on Aug 17, 2024

Except, pigeons do have a mechanism (as do worms) that may confer intelligence.

Kuinox · on Aug 17, 2024

You avoided the questions as a whole. LLMs theorically can emulate any functions, so the mechanism is also here. If I understand correctly, you are saying that worms are smarter than LLMs ?

slashdave · on Aug 17, 2024

Intelligence is not a mathematical function

Kuinox · on Aug 17, 2024

What part of intelligence cannot be computed ? What part of the worm brain cannot be simulated ?

slashdave · on Aug 17, 2024

Computation cannot be reduce to a mere function. A NN is not a Turing device.

Kuinox · on Aug 17, 2024

A single invocation of an NN isn't turing complete, but as soon as you loop it, it is. https://en.wikipedia.org/wiki/Universal_approximation_theore...

slashdave · on Aug 18, 2024

Yeah, exactly. A network of NN functions is a model for AGI.

My point is that to say that a NN can approximate a mathematical function is correct. But to now equate such a function to a generalized procedure (like a simulation of a brain) is a mistake.

Kuinox · on Aug 18, 2024

> But to now equate such a function to a generalized procedure

You never said that. Are you saying only a brain can understand?

stevenhuang · on Aug 17, 2024

Yeah at this point the guy is either trolling or very confused.

It's some crazy stuff framing your personal opinion on a controversial subject like that as if it's an established fact when the subject itself is under active studied and hardly "solved".

slashdave · on Aug 17, 2024

There are some big issues with the ML community today. One is brazen assumptions. If the subject is under study, why write your title and abstract in such a fashion that you assume that LLMs can "understand". That is the source of my original (fairly mild) objection. If the problem isn't "solved", why assume it is?

stevenhuang · on Aug 17, 2024

> Can Large Language Models Understand Symbolic Graphics Programs?

The title is framed as a question. And even throughout the study, the authors are careful to refer to this apparent reasoning ability with the appropriate qualifiers.

> To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content

> We find that this task distinguishes existing LLMs and models considered good at reasoning perform better

In short no such assumption is made.

You are so overwhelmed by your personal biases you're not even comprehending what you're reading.

slashdave · on Aug 18, 2024

You make an excellent point.

stevenhuang · on Aug 17, 2024

What a sorry world it would be if we avoided studying anything that someone found "off-putting".

Science seeks truth and challenges perspectives. It's not about studying what comforts you.

Some people just aren't cut out for the sciences.

magicalhippo · on Aug 16, 2024

I was recently playing with some LLMs regarding a relay driver circuit. ChatGPT generated a circuit diagram for me in ASCII art style, but it contained some errors. For fun I tried to ask some others if they could identify the errors.

Here's the diagram (formatting errors were from ChatGPT):

         +5V
           |
           |
          ( ) Relay Coil
           |
           +----->|----+
           |      Diode |
           |            |
           |            |
           +------------+
           |
           |
         Drain (D) of MOSFET
            |
          Source (S)
            |
          -----
           GND
            |
         Gate (G)---[220Ω]--- GPIO (3.3V MCU)
            |
          [10kΩ]
            |
           GND

Here's what Llama 3.1 70B said:

After reviewing the diagram, I've found a few potential issues:

- The diode is not connected to the relay coil in a way that would prevent back-EMF from damaging the MOSFET. Typically, the diode is connected in parallel with the relay coil, with the anode connected to the positive terminal of the coil and the cathode connected to the negative terminal. In this diagram, the diode appears to be connected in series with the relay coil, which is not correct.

- The MOSFET is not properly biased. The gate of the MOSFET is connected to the GPIO pin through a 220Ω resistor, but there is no pull-down resistor to ensure the gate is at a known state when the GPIO pin is not driving it. The 10kΩ resistor connected to GND is likely intended to be a pull-down resistor, but it's not clear if it's connected to the gate or not.

- The relay coil is not specified, but it's likely that it requires a higher current than the MOSFET can handle. Typically, a relay driver circuit would include a current-limiting resistor or a transistor to handle the high current required by the relay coil.

So it correctly identified one of the main issues, the flyback diode is wrong, but failed to correctly identify the issue with the MOSFET gate being directly tied to ground. I admit, it's not a typical way to draw things so fair enough, after all it did mention there was something fishy with the pull-down.

Claude 3 Haiku provided a very similar answer. It was a bit more explicit and included more details. For instance it thought there might be an issue driving the MOSFET from 3.3V, though failed to suggest a logic-level MOSFET, instead suggesting a level-shifter. It also suggested the 220 Ohm gate resistor might not be enough to protect the microcontroller. Though it also tried to correct the diagram but failed: it changed the diode to ->|<-, and didn't put it in parallel with the coil.

All in all, somewhat impressed.

However they get easily confused. I tried to improve the diagram, and asked again. Here's the updated diagram:

         +5V
           |
           |          
           |
           +----->|--------+
           |      Diode    |
           |               |
          ( ) Relay Coil   |
           |               |
           +---------------+
           |
           |
    Drain (D) of MOSFET
           |
           |]- Gate (G)-----+-[220Ω]--- GPIO (3.3V MCU)
           |                |
           |             [10kΩ]
       Source (S)           |
           |                |
           +----------------+
           |
         -----
          GND

And Llama 3.1 now says:

After examining the diagram, I notice a few potential issues:

- The diode is not connected to the relay coil in a way that would prevent back-EMF damage to the MOSFET. Typically, a flyback diode is connected in parallel with the relay coil, with the anode connected to the coil's positive terminal and the cathode connected to the coil's negative terminal. In this diagram, the diode appears to be connected in series with the relay coil, which is not correct. - The MOSFET's source (S) is not connected to GND. In a typical relay driver circuit, the source of the MOSFET is connected to GND to provide a return path for the current. - The 10kΩ resistor is not connected to anything. It's possible that it's intended to be a pull-down resistor for the MOSFET's gate, but it's not clear where it's supposed to be connected.

So it totally fails to "see" that the 10k is now correctly connected to ground, and that it shares the ground point (in the diagram) with the source of the MOSFET. At least it figures it's meant to be a pull-down.

However it did notice the diode is now correctly placed in parallel with the coil, and did again correctly see the diode was the wrong way around. So that's something.

edit: I changed the diode direction and asked if that fixed it. Llama 3.1 did not "see" that and suggested a change.

Claude 3 did much better and noticed the direction change and was happy.

As I said, was just fooling around with this and thought I'd share. They're confident, but sometimes oh-so-wrong. Still, using a service where you can ask the same question to multiple models is nice. That way it can be easier to see when they're confabulating.

Lichtso · on Aug 16, 2024

What an awful paper title, saying "Symbolic Graphics Programs" when they just mean "vector graphics". I don't understand why they can not just use the established term instead. Also, there is no "program" here, in the same way that coding HTML is not programming, as vector graphics are not supposed to be Turing complete. And where they pulled the "symbolic" from is completely beyond me.

merlincorey · on Aug 16, 2024

I'm more curious how they think LLM's can imagine things:

> To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content

To my understanding, LLMs are predictive engines based upon their tokens and embeddings without any ability to "imagine" things.

As such, an LLM might be able to tell you that the following SVG is a black circle because it is in Mozilla documentation[0]:

    <svg viewBox="0 0 100 100" xmlns="http://www.w3.org/2000/svg">
      <circle cx="50" cy="50" r="50" />
    </svg>

However, I highly doubt any LLM could tell you the following is a "Hidden Mickey" or "Mickey Mouse Head Silhouette":

    <svg viewBox="0 0 175 175" xmlns="http://www.w3.org/2000/svg">
      <circle cx="100" cy="100" r="50" />
      <circle cx="50" cy="50" r="40" />
      <circle cx="150" cy="50" r="40" />
    </svg>

- [0] https://developer.mozilla.org/en-US/docs/Web/SVG/Element/cir...

ctoth · on Aug 16, 2024

> However, I highly doubt

I never understand why people don't just test this?

Claude 3.5 sonnet, first test, prompt: Any idea what this SVG is describing?:

This SVG describes a simple geometric composition consisting of three circles: 1. The largest circle is centered at coordinates (100, 100) with a radius of 50 units. 2. A smaller circle is positioned at the top-left, centered at (50, 50) with a radius of 40 units. 3. Another smaller circle is positioned at the top-right, centered at (150, 50), also with a radius of 40 units. The overall effect, if rendered, would resemble a basic face-like structure or a simplified Mickey Mouse silhouette, with two smaller circles (potentially representing ears or eyes) above a larger central circle (potentially representing a face or head). The SVG uses a viewBox of "0 0 175 175", which defines the coordinate system and aspect ratio for the image. Would you like me to create a visual representation of this SVG for you?

merlincorey · on Aug 17, 2024

> I never understand why people don't just test this?

I never understand why people take things out of context instead of responding to the totality of a message, since I started off by saying:

> To my understanding,

You have responded not by providing your understanding but by belittling me. Nonetheless, I appreciate you taking the time to query Claude-3.5 and sharing the results.

According to Amazon Bedrock adds Claude 3.5-sonnect[0] is a multimodal model with state-of-the-art vision processing, which tells me that Claude can "cheat" by rendering the SVG and then passing to the image model for the results which is perhaps one way of describing "imagining" things, but it does not describe a strict Large Language Model which predicts the next token.

> The multimodal Claude 3.5 Sonnet also excels at processing images with state-of-the-art vision, particularly when interpreting charts and graphs—helping to get faster, deeper insights from data. It can accurately decipher text from imperfect images—for example, poorly scanned documents—and in doing so, glean more insights than from text alone.

[0] https://www.aboutamazon.com/news/aws/amazon-bedrock-anthropi...

smallnamespace · on Aug 16, 2024

> I never understand why people don't just test this?

One reason might be that you get the benefit of feeling superior to LLMs without any cost or risk of finding yourself mistaken.

jaymzcampbell · on Aug 16, 2024

GPT4o certainly gave it a good go (prompt: "What cartoon character would you say this SVG code renders most closely to?"). That's maybe a little too "prompting" tbf:

""" The SVG code you provided creates an image with three circles:

A large circle at the center (at coordinates 100, 100) with a radius of 50. Two smaller circles at the top left (at coordinates 50, 50) and top right (at coordinates 150, 50) with a radius of 40 each. This configuration closely resembles the head and ears of Mickey Mouse. The larger circle represents the head, and the two smaller circles represent the ears. """

Slightly more generic prompt: "What is this code drawing? Can you suggest what it might be representing?"

""" What It Might Represent: Mickey Mouse Head: The arrangement resembles a simplified "Mickey Mouse" head, where the large central circle represents the head, and the two smaller circles represent the ears.

Abstract Art: Alternatively, it could be an abstract design with three overlapping circles of different sizes, suggesting some form of simple geometric art.

The circles are positioned such that the smaller ones are symmetrically placed above the larger central one, which is a typical representation of a head with ears. """

And more generic "I've been sent this code, what is it drawing?" ends with:

""" You will see three overlapping circles. The largest circle is near the center, while two smaller circles are positioned above and slightly to the left and right of the center. The arrangement may resemble a basic representation of a Mickey Mouse head shape, with the large circle as the face and the two smaller circles as ears. """

kevindamm · on Aug 16, 2024

Were those all within the same conversation, or the beginnings of separate conversations?

jaymzcampbell · on Aug 16, 2024

Separate conversations, I wanted to avoid any prehistory, though all under my logged in user so can't exclude any possible inference that way

merlincorey · on Aug 17, 2024

I appreciate you giving multiple examples including an extremely surprising failure given the other successes without assuming malice on my part in asking.

CaptainFever · on Aug 16, 2024

I'm impressed. How does it do that?

randomdata · on Aug 16, 2024

> without any ability to "imagine" things.

What's imagining, then? The way LLMs explore different predictive branches in order to find an optimal solution doesn't seem all that different than what I consider imagining: Thinking about what could be and considering different variations on that idea.

An LLM isn't a brain, so there is no implication of it being said in the truest human sense, but it seems like a decent analogy to me.

merlincorey · on Aug 17, 2024

As I mentioned elsewhere, all the modern models are multimodal models so I think it would be fair to say rendering the SVG and then having the dedicated image model classify it is similar to "imagining", but it is definitely not strictly part of a Large Language Model which only predicts the next token based on previous tokens.

westurner · on Aug 16, 2024

If teh LLM saves the SVG vector graphic to a raster image like a PNG and prompts with that instead, it will have no trouble labeling what's depicted in the SVG.

So, the task is "describe what an SVG depicts without saving it to a raster image and prompting with that"?

merlincorey · on Aug 17, 2024

I believe this is exactly what the modern multimodal models are doing -- they are not strictly Large Language Models any more.

westurner · on Aug 18, 2024

Which part is the query preprocessor?

/? LLM stack app architecture https://www.google.com/search?q=LLM+stack+app+architecture&t...

https://cobusgreyling.medium.com/emerging-large-language-mod... ; Flexibility / Complexity,

montebicyclelo · on Aug 16, 2024

Chat GPT:

> Given the arrangement of three overlapping circles, it resembles the classic depiction of a *Mickey Mouse* head silhouette:

> The two smaller circles represent Mickey's ears.

> The larger circle represents his head.

> This is a stylized version of the iconic Mickey Mouse logo.

Imo: In order to predict the next token for non-trivial tokens, (of which there are many on the training data), you do have to do some more complex thinking/reasoning than just a lookup of past training data.

Lichtso · on Aug 16, 2024

Can you challenge it and e.g. double the radius of the last circle? That should make the rendered output unintelligible. Let's see if it still thinks it represents the iconic logo.

montebicyclelo · on Aug 16, 2024

It passes the test you've proposed. For the below results, the prompt is the same for A and B, except for in B the radius of the last circle is twice as big to make the rendered output unintelligible.

A. Valid Mickey is detected by the model. "...This arrangement might resemble a basic version of a Mickey Mouse shape, where the two smaller circles represent the ears and the larger circle represents the head...", https://chatgpt.com/share/3999859a-b6db-4671-8b69-0ec6a5bac3...

B. Invalid Mickey is not misclassified as Mickey by the model and is correctly described. "...these circles will overlap, creating a pattern where the largest circle (Circle 3) dominates the right side of the canvas, with the other two smaller circles overlapping it and each other in the middle...", https://chatgpt.com/share/df3c57ac-495b-4e4c-b00c-bae31781c4...

jaymzcampbell · on Aug 16, 2024

Changing the last one to r=80 instead of r=40, GPT4o ends up saying:

> The circles overlap in certain areas, depending on their size and position, creating a layered visual effect where the largest circle (third one) dominates most of the canvas space.

Though tbf to it, I'm not sure I'd say this looks like MM either: https://i.imgur.com/0VHdocf.png (unless I knew this was the intent prior).

rel_ic · on Aug 16, 2024

Check out Act One of this This American Life episode https://www.thisamericanlife.org/803/transcript

TLDR: it seems like an LLM might be able to tell you your SVG is a "Mickey Mouse Head Silhouette"

kgen · on Aug 16, 2024

I was just about to post the same thing -- quite a fascinating test of gpt's capabilities

jchw · on Aug 16, 2024

> Also, there is no "program" here, in the same way that coding HTML is not programming, as vector graphics are not supposed to be Turing complete.

I think the reason why we don't view HTML as a programming language is because it is explicitly designed to be a markup language that declares content rather than a series of instructions that is interpreted as a program. A program needn't demonstrate turing completeness to be a "computer program", it just needs to be a sequence of instructions that a computer executes. To me, that suggests that there's a degree of abstractness and subjectivity involved. For example, any SVG document could also be rewritten 1:1 with no loss in fidelity as a series of commands that has the same effect, as can pretty much any declarative markup language; what is actually happening during parsing is hard to distinguish from an interpreter. Humans can "know it when they see it", but I doubt there's an exact criteria that can go along with the human "feel" of what makes a program, a program.

tines · on Aug 16, 2024

> Also, there is no "program" here, in the same way that coding HTML is not programming, as vector graphics are not supposed to be Turing complete. And where they pulled the "symbolic" from is completely beyond me.

Aren't HTML and vector graphics descriptions both data structures that could be interpreted via a Turing-complete interpreter? I don't see the difference between HTML and a C AST in this regard.

jlarocco · on Aug 16, 2024

There's a slippery slope there. Is a Word document a program? Is a PNG file?

A computer program is a data structure, but data structures are not necessarily computer programs.

tines · on Aug 16, 2024

True, I suppose HTML the better example, it's a tree description language, whereas PNG files, Word documents, etc. aren't.

justsomehnguy · on Aug 16, 2024

> . Is a Word document a program?

Even if not there is always an OLE embedding.