Since people are not reading the paper, here were the conclusions:
> Is it really easy to answer semantic reasoning questions over symbolic graphics programs? We provide an intriguing experiment to demonstrate that SVG programs can be quite difficult for LLMs to understand such that even if the corresponding rendered images are fairly easy for humans to recognize, all these powerful LLMs still fail dramatically, only reaching a chance-level accuracy.
They then fine tune the model for this problem, and it achieves better results.
It’s difficult to tell what they’re proposing from the title or abstract, but asking questions about an SVG image, given only the source code, seems like an interesting challenge.
There are a lot of ways to generate the same image in different ways, or to slightly change the source code, resulting in a very different image. This could be used to train a deeper understanding of program-like text.
In the title and abstract there is a proposal that LLMs "understand", "reason", and "imagine." It's off putting to me, since LLMs are not capable of any of these.
Hard to prove a negative. On the other hand, relying on a poorly defined idea of "emergent behavior" when there is no architectural rational is rather poor form IMO. It's more wordy but safer to qualify these terms appropriately. That is, to test the "appearance" or "simulation" of human-like abilities.
That's all I'm hoping for, an ML model that appears to be as good as any human in existence, for all human skills.
To be fair, this is what we do with humans as well, we test their skills. We (rarely!) take humans apart, inspect their brains, and declare that they have reasoning or understanding capabilities.
Sounds like you are looking for AGI, along with many other folks. Although, be careful of the usage of "as good as", which is hard to quantify (think of artistic endeavors).
The absence of an architectural rational does not preclude it's formation either. Not understanding how something works doesn't mean that something isn't there.
It simply means we don't yet understand, so you should be agnostic in your position. That's why these studies are trying to find answers to these questions, not pretending to know the answer like you are.
Strange you don’t recognize you’re making the very mistake you criticize others for.
So I'll just say it: you lack the knowledge foundations for an informed opinion on this topic.
We know that DNA is "just" a sequence of nucleotides, yet life emerges. How does that happen, going from the world of nonliving to the living? We don't know. There are theories, but abiogenesis remains one of the most profound mysteries of science. But at the core of these theories, it is that the process of abiogenesis is informational.
Computer code is information. So from an information theoretic perspective, it's quite incoherent to say something can't happen because "computer code". As in fact information may well be the entire reason we can reason, or have the complexities of life.
Consider:
1) we know LLMs are universal function approximators
2) suppose the ability to reason, and the degree of reasoning ability, is a functional process that can be represented algorithmically, and can be learned
3) then might LLMs have learned some degree of reasoning ability?
If you insist 3) is impossible, you must also insist 2) is impossible. And since we really don't have a definitive answer for 2) much like we don't have one for abiogenesis, your position is not really defensible. At least not to the degree of conviction you seem to hold.
This is just silly. You can trace precisely how an LLM produces its output. You have access to all of the weights. We know precisely how the model is trained. There is no mystery here.
Two things, first you are thinking about proving as a negative, it means you think something think or do not, understand or do not, imagine or do not.
It's wrong, it's all a spectrum.
Pigeons are stupid, relatively to a human, still, a pigeon have some intelligence.
That's the same thing for worms. Do you think that a worm have more intelligence than an LLM ?
The second thing, you says "hard to prove a negative", that's a fallacy that hide the main point: you don't know how to prove that something can understand.
Prove that you understood a concept, any of your choosing, or prove that you are not showing the appearance of intelligence or ability. Good Luck.
You avoided the questions as a whole.
LLMs theorically can emulate any functions, so the mechanism is also here.
If I understand correctly, you are saying that worms are smarter than LLMs ?
Yeah, exactly. A network of NN functions is a model for AGI.
My point is that to say that a NN can approximate a mathematical function is correct. But to now equate such a function to a generalized procedure (like a simulation of a brain) is a mistake.
Yeah at this point the guy is either trolling or very confused.
It's some crazy stuff framing your personal opinion on a controversial subject like that as if it's an established fact when the subject itself is under active studied and hardly "solved".
There are some big issues with the ML community today. One is brazen assumptions. If the subject is under study, why write your title and abstract in such a fashion that you assume that LLMs can "understand". That is the source of my original (fairly mild) objection. If the problem isn't "solved", why assume it is?
> Can Large Language Models Understand Symbolic Graphics Programs?
The title is framed as a question. And even throughout the study, the authors are careful to refer to this apparent reasoning ability with the appropriate qualifiers.
> To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content
> We find that this task distinguishes existing LLMs and models considered good at reasoning perform better
In short no such assumption is made.
You are so overwhelmed by your personal biases you're not even comprehending what you're reading.
I was recently playing with some LLMs regarding a relay driver circuit. ChatGPT generated a circuit diagram for me in ASCII art style, but it contained some errors. For fun I tried to ask some others if they could identify the errors.
Here's the diagram (formatting errors were from ChatGPT):
After reviewing the diagram, I've found a few potential issues:
- The diode is not connected to the relay coil in a way that would prevent back-EMF from damaging the MOSFET. Typically, the diode is connected in parallel with the relay coil, with the anode connected to the positive terminal of the coil and the cathode connected to the negative terminal. In this diagram, the diode appears to be connected in series with the relay coil, which is not correct.
- The MOSFET is not properly biased. The gate of the MOSFET is connected to the GPIO pin through a 220Ω resistor, but there is no pull-down resistor to ensure the gate is at a known state when the GPIO pin is not driving it. The 10kΩ resistor connected to GND is likely intended to be a pull-down resistor, but it's not clear if it's connected to the gate or not.
- The relay coil is not specified, but it's likely that it requires a higher current than the MOSFET can handle. Typically, a relay driver circuit would include a current-limiting resistor or a transistor to handle the high current required by the relay coil.
So it correctly identified one of the main issues, the flyback diode is wrong, but failed to correctly identify the issue with the MOSFET gate being directly tied to ground. I admit, it's not a typical way to draw things so fair enough, after all it did mention there was something fishy with the pull-down.
Claude 3 Haiku provided a very similar answer. It was a bit more explicit and included more details. For instance it thought there might be an issue driving the MOSFET from 3.3V, though failed to suggest a logic-level MOSFET, instead suggesting a level-shifter. It also suggested the 220 Ohm gate resistor might not be enough to protect the microcontroller. Though it also tried to correct the diagram but failed: it changed the diode to ->|<-, and didn't put it in parallel with the coil.
All in all, somewhat impressed.
However they get easily confused. I tried to improve the diagram, and asked again. Here's the updated diagram:
After examining the diagram, I notice a few potential issues:
- The diode is not connected to the relay coil in a way that would prevent back-EMF damage to the MOSFET. Typically, a flyback diode is connected in parallel with the relay coil, with the anode connected to the coil's positive terminal and the cathode connected to the coil's negative terminal. In this diagram, the diode appears to be connected in series with the relay coil, which is not correct.- The MOSFET's source (S) is not connected to GND. In a typical relay driver circuit, the source of the MOSFET is connected to GND to provide a return path for the current.- The 10kΩ resistor is not connected to anything. It's possible that it's intended to be a pull-down resistor for the MOSFET's gate, but it's not clear where it's supposed to be connected.
So it totally fails to "see" that the 10k is now correctly connected to ground, and that it shares the ground point (in the diagram) with the source of the MOSFET. At least it figures it's meant to be a pull-down.
However it did notice the diode is now correctly placed in parallel with the coil, and did again correctly see the diode was the wrong way around. So that's something.
edit: I changed the diode direction and asked if that fixed it. Llama 3.1 did not "see" that and suggested a change.
Claude 3 did much better and noticed the direction change and was happy.
As I said, was just fooling around with this and thought I'd share. They're confident, but sometimes oh-so-wrong. Still, using a service where you can ask the same question to multiple models is nice. That way it can be easier to see when they're confabulating.
What an awful paper title, saying "Symbolic Graphics Programs" when they just mean "vector graphics". I don't understand why they can not just use the established term instead. Also, there is no "program" here, in the same way that coding HTML is not programming, as vector graphics are not supposed to be Turing complete. And where they pulled the "symbolic" from is completely beyond me.
I'm more curious how they think LLM's can imagine things:
> To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content
To my understanding, LLMs are predictive engines based upon their tokens and embeddings without any ability to "imagine" things.
As such, an LLM might be able to tell you that the following SVG is a black circle because it is in Mozilla documentation[0]:
I never understand why people don't just test this?
Claude 3.5 sonnet, first test, prompt: Any idea what this SVG is describing?:
This SVG describes a simple geometric composition consisting of three circles:
1. The largest circle is centered at coordinates (100, 100) with a radius of 50 units.
2. A smaller circle is positioned at the top-left, centered at (50, 50) with a radius of 40 units.
3. Another smaller circle is positioned at the top-right, centered at (150, 50), also with a radius of 40 units.
The overall effect, if rendered, would resemble a basic face-like structure or a simplified Mickey Mouse silhouette, with two smaller circles (potentially representing ears or eyes) above a larger central circle (potentially representing a face or head).
The SVG uses a viewBox of "0 0 175 175", which defines the coordinate system and aspect ratio for the image.
Would you like me to create a visual representation of this SVG for you?
> I never understand why people don't just test this?
I never understand why people take things out of context instead of responding to the totality of a message, since I started off by saying:
> To my understanding,
You have responded not by providing your understanding but by belittling me.
Nonetheless, I appreciate you taking the time to query Claude-3.5 and sharing the results.
According to Amazon Bedrock adds Claude 3.5-sonnect[0] is a multimodal model with state-of-the-art vision processing, which tells me that Claude can "cheat" by rendering the SVG and then passing to the image model for the results which is perhaps one way of describing "imagining" things, but it does not describe a strict Large Language Model which predicts the next token.
> The multimodal Claude 3.5 Sonnet also excels at processing images with state-of-the-art vision, particularly when interpreting charts and graphs—helping to get faster, deeper insights from data. It can accurately decipher text from imperfect images—for example, poorly scanned documents—and in doing so, glean more insights than from text alone.
GPT4o certainly gave it a good go (prompt: "What cartoon character would you say this SVG code renders most closely to?"). That's maybe a little too "prompting" tbf:
"""
The SVG code you provided creates an image with three circles:
A large circle at the center (at coordinates 100, 100) with a radius of 50.
Two smaller circles at the top left (at coordinates 50, 50) and top right (at coordinates 150, 50) with a radius of 40 each.
This configuration closely resembles the head and ears of Mickey Mouse. The larger circle represents the head, and the two smaller circles represent the ears.
"""
Slightly more generic prompt: "What is this code drawing? Can you suggest what it might be representing?"
"""
What It Might Represent:
Mickey Mouse Head: The arrangement resembles a simplified "Mickey Mouse" head, where the large central circle represents the head, and the two smaller circles represent the ears.
Abstract Art: Alternatively, it could be an abstract design with three overlapping circles of different sizes, suggesting some form of simple geometric art.
The circles are positioned such that the smaller ones are symmetrically placed above the larger central one, which is a typical representation of a head with ears.
"""
And more generic "I've been sent this code, what is it drawing?" ends with:
"""
You will see three overlapping circles. The largest circle is near the center, while two smaller circles are positioned above and slightly to the left and right of the center. The arrangement may resemble a basic representation of a Mickey Mouse head shape, with the large circle as the face and the two smaller circles as ears.
"""
I appreciate you giving multiple examples including an extremely surprising failure given the other successes without assuming malice on my part in asking.
What's imagining, then? The way LLMs explore different predictive branches in order to find an optimal solution doesn't seem all that different than what I consider imagining: Thinking about what could be and considering different variations on that idea.
An LLM isn't a brain, so there is no implication of it being said in the truest human sense, but it seems like a decent analogy to me.
As I mentioned elsewhere, all the modern models are multimodal models so I think it would be fair to say rendering the SVG and then having the dedicated image model classify it is similar to "imagining", but it is definitely not strictly part of a Large Language Model which only predicts the next token based on previous tokens.
If teh LLM saves the SVG vector graphic to a raster image like a PNG and prompts with that instead, it will have no trouble labeling what's depicted in the SVG.
So, the task is "describe what an SVG depicts without saving it to a raster image and prompting with that"?
> Given the arrangement of three overlapping circles, it resembles the classic depiction of a *Mickey Mouse* head silhouette:
> The two smaller circles represent Mickey's ears.
> The larger circle represents his head.
> This is a stylized version of the iconic Mickey Mouse logo.
Imo: In order to predict the next token for non-trivial tokens, (of which there are many on the training data), you do have to do some more complex thinking/reasoning than just a lookup of past training data.
Can you challenge it and e.g. double the radius of the last circle? That should make the rendered output unintelligible. Let's see if it still thinks it represents the iconic logo.
It passes the test you've proposed. For the below results, the prompt is the same for A and B, except for in B the radius of the last circle is twice as big to make the rendered output unintelligible.
A. Valid Mickey is detected by the model. "...This arrangement might resemble a basic version of a Mickey Mouse shape, where the two smaller circles represent the ears and the larger circle represents the head...", https://chatgpt.com/share/3999859a-b6db-4671-8b69-0ec6a5bac3...
B. Invalid Mickey is not misclassified as Mickey by the model and is correctly described. "...these circles will overlap, creating a pattern where the largest circle (Circle 3) dominates the right side of the canvas, with the other two smaller circles overlapping it and each other in the middle...", https://chatgpt.com/share/df3c57ac-495b-4e4c-b00c-bae31781c4...
Changing the last one to r=80 instead of r=40, GPT4o ends up saying:
> The circles overlap in certain areas, depending on their size and position, creating a layered visual effect where the largest circle (third one) dominates most of the canvas space.
Though tbf to it, I'm not sure I'd say this looks like MM either: https://i.imgur.com/0VHdocf.png (unless I knew this was the intent prior).
> Also, there is no "program" here, in the same way that coding HTML is not programming, as vector graphics are not supposed to be Turing complete.
I think the reason why we don't view HTML as a programming language is because it is explicitly designed to be a markup language that declares content rather than a series of instructions that is interpreted as a program. A program needn't demonstrate turing completeness to be a "computer program", it just needs to be a sequence of instructions that a computer executes. To me, that suggests that there's a degree of abstractness and subjectivity involved. For example, any SVG document could also be rewritten 1:1 with no loss in fidelity as a series of commands that has the same effect, as can pretty much any declarative markup language; what is actually happening during parsing is hard to distinguish from an interpreter. Humans can "know it when they see it", but I doubt there's an exact criteria that can go along with the human "feel" of what makes a program, a program.
> Also, there is no "program" here, in the same way that coding HTML is not programming, as vector graphics are not supposed to be Turing complete. And where they pulled the "symbolic" from is completely beyond me.
Aren't HTML and vector graphics descriptions both data structures that could be interpreted via a Turing-complete interpreter? I don't see the difference between HTML and a C AST in this regard.
> Is it really easy to answer semantic reasoning questions over symbolic graphics programs? We provide an intriguing experiment to demonstrate that SVG programs can be quite difficult for LLMs to understand such that even if the corresponding rendered images are fairly easy for humans to recognize, all these powerful LLMs still fail dramatically, only reaching a chance-level accuracy.
They then fine tune the model for this problem, and it achieves better results.