How much of this is expectations setting by the heights models reach? i.e. of we could assess a consistent floor of model performance in a vacuum, would we say it's better at "AGI" than the bottom 0.1% of humans?
Not sure how to answer because we were off on a tangent there about mental models.
I think AGI is two things. Intelligence at a given task, which can be scored versus humans or otherwise. And generalization which is entirely separate. We already have superhuman non-general models in a few domains.
So I don't think that "better than AGI at % of humans" is a sensible statement, at least not initially.
Right now humans generalize to all integers while AI companies keep manually adding additional integers to a finite list and bystanders make claims of generality. If you've still got a finite list you aren't general regardless of how long the list is.
If at some point a model shows up that works on all even integers but not odd ones then I guess you could reasonably claim you had AGI that was 50% of what humans achieve. If a model that generalizes to all the reals shows up then it will have exceeded human generality by an infinite degree. We'll cross those bridges when we come to them - I don't think we're there yet.
Interestingly, I find that the models generalize decently well as long as the "training" (more analogous to that for humans) fits in (small enough) context. That's to say, "in-context learning" seems good enough for real use.
Given that models don't currently learn as they go isn't that exactly what this benchmark is testing? If the model needs to either have been explicitly trained in a similar environment or else to have a human manually input a carefully crafted prompt then it isn't general. The latter case is a human tuning a powerful tool.
If it can add the necessary bits to its own prompt while working on the benchmark then it's generalizing.
(no idea but) I feel like changing the first number has a psychological issue, but the 2nd number feels more important than just "minor" sometimes. So may as well let the schema set the mind free?
Interesting. I've felt like it's never been easier to learn things, but I suppose that's not quite the same as "acquiring new skills". I don't know if it applies, but it's always been easy to take the easy way out?
I feel like AI has made it a bit easier to do harder things too.
I don't think lived experience matters too much to me.
In some sense, AI has very unique "lived" experience, which is what creates the voice it uses ("doesn't have a voice" seems like an impossibility to me by definition).
I find AI very "human-esque", and its "self-reported" phenomenology is very entertaining to me, at least.
I also think AI writing might feel trashy also because most human writing is trashy.
Really interesting. I was thinking about something similar regarding the shape of code. I have no qualms recommending my agents take static analysis to the extreme, though it would cumbersome for most people.
That's an interesting question ... how should a less experienced developer use AI productively, and learn while developing? Certainly using it as a magic genie and vibe coding something you are in no position to evaluate is not the way to go, nor is that a good way for anyone to use AI if you care about the quality or specifics of the end result!
There's always going to be some overlap, wanting to use a new skill/library in a production system, but maybe in general it's best to think of learning and writing/generating production code as two separate things. AI is great for learning and exploration, but you don't want to be submitting your experiments as PRs!
A good rule of thumb might be can you explain any AI-generated design and code as well as if you had written it yourself? If you don't fully understand it, then you are not in a good position to own it and take responsibility for it (bugs, performance, edge case behavior, ease of debugging, flexibility for future enhancement, etc).
Linear walkthrough: I ask my agents to give me a numbered tree. Controlling tree size specifies granularity. Numbering means it's simple to refer to points for discussion.
Other things that I feel are useful:
- Very strict typing/static analysis
- Denying tool usage with a hook telling the agent why+what they should do (instead of simple denial, or dangerously accepting everything)
reply