I've seen this claimed, but I'm not sure it's been true for my use cases? I should try a more involved analysis but so far open models seem much less even in their skills. I think this makes sense if a lot of them are built based on distillations of larger models. It seems likely that with task specific fine tuning this is true?
> I've seen this claimed, but I'm not sure it's been true for my use cases?
I'd be surprised if it isn't true for your use cases. If you give GLM-5.1 and Optus 4.6 the same coding task, they will both produce code that passes all the tests. In both cases the code will be crap, as no model I've seen produces good code. GLM-5.1 is actually slightly better at following instructions exactly than Optus 4.6 (but maybe not 4.7 - as that's an area they addressed).
I've asked GLM-5.1 and Opus 4.6 to find a bug caused by a subtle race condition (the race condition leads to a number being 15172580 instead of 15172579 after about 3 months of CPU time). Both found it, in a similar amount of time. Several senior engineers had stared at the code for literally days and didn't find it.
There is no doubt the models do vary in performance at various tasks, but we are talking the difference between Ferrari vs Mercedes in F1. While the differences are undeniable, this isn't the F1. Things take a year to change there. The performance of the models from Anthropic and OpenAI literally change day by day, often not due to the model itself but because of the horsepower those companies choose to give them on the day, or them tweaking their own system prompts. You can find no end of posts here from people screaming in frustration the thing that worked yesterday doesn't work today, or suddenly they find themselves running out of tokens, or their favoured tool is blocked. It's not at all obvious the differences between the open-source models and the proprietary ones are worse than those day to day ones the proprietary companies inflict on us.
If you don't know C, in older versions that can be a catastrophic failure. (The issue is so serious in modern C `free(NULL)` is a no-op.) If it's difficult to get a `FOO == NULL` without extensive mocking (this is often the case) most programmers won't do it, so it won't be caught by unit tests. The LLMs almost never get unit test coverage up high enough to catch issues like this without heavy prompting.
But that's the least of it. The models (all of them) are absolutely hopeless at DRY'ing out the code, and when they do turn it into spaghetti because they seem almost oblivious to isolation boundaries, even when they are spelt out to them.
None of this is a problem if you are vibe coding, but you can only do that when you're targeting a pretty low quality level. That's entirely appropriate in some cases of course, but when it isn't you need heavy reviews from skilled programmers. No senior engineer is going to stomach the repeated stretches of almost the "same but not quite" code they churn out.
You don't have to take my word for it. Try asking Google "do llm's produce verbose code".
`free(NULL)` is harmless in C89 onwards. As I said, programmers freeing NULL caused so many issues they changed the API. It doesn't help that `malloc(0)` returns NULL on some platforms.
If you are writing code for an embedded platform with some random C compiler, all bets on what `free(NULL)` does are off. That means a cautious C programmer who doesn't know who will be using their code never allows NULL to be passed to `free()`.
In general, most good C programmers are good because they suffer a sort of PTSD from the injuries the language has inflicted on them in the past. If they aren't avoiding passing NULL to `free()`, they haven't suffered long enough to be good.
> That means a cautious C programmer who doesn't know who will be using their code never allows NULL to be passed to `free()`.
If your compiler chokes on `free(NULL)` you have bigger problems that no LLM (or human) can solve for you: you are using a compiler that was last maintained in the 80s!
If your C compiler doesn't adhere to the very first C standard published, the problem is not the quality of the code that is written.
> If they aren't avoiding passing NULL to `free()`, they haven't suffered long enough to be good.
I dunno; I've "suffered" since the mid-90s, and I will free NULL, because it is legal in the standard, and because I have not come across a compiler that does the wrong thing on `free(NULL)`.
So what would be the best practice in a situation like that? I would (naively?) imagine that a null pointer would mostly result from a malloc() or some other parts of the program failing, in which case would you not expect to see errors elsewhere?
> imagine that a null pointer would mostly result from a malloc() or some other parts of the program failing, in which case would you not expect to see errors elsewhere?
Oh yes, you probably will see errors elsewhere. If you are lucky it will happen immediately. But often enough millions of executed instructions later, in some unrelated routine that had its memory smashed. It's not "fun" figuring out what happened. It could be nothing - bit flips are a thing, and once you get the error rate low enough the frequency of bit flips and bugs starts to converge. You could waste days of your time chasing an alpha particle.
I saw the author of curl post some of this code here a while back. I immediately recognised the symptoms. Things like:
if (NULL == foo) { ... }
Every 2nd line was code like that. If you are wondering, he wrote `(NULL == foo)` in case he dropped an `=`, so it became `(NULL = foo)`. The second version is a syntax error, whereas `(foo = NULL)` is a runtime disaster. Most of it was unjustified, but he could not help himself. After years of dealing with C, he wrote code defensively - even if it wasn't needed. C is so fast and the compilers so good the coding style imposes little overhead.
Rust is popular because it gives you a similar result to C, but you don't need to have been beaten by 10 years of pain in order to produce safe Rust code. Sadly, it has other issues. Despite them, it's still the best C we have right now.
C is fundamentally a bad target for LLMs. Humans get C wrong all the time, so we can not hope the nascent LLM, which has been trained on 95% code that does automatic memory management, to excel here.
I always found myself writing verbose copypasta code first, then compress it down based on the emerging commonalities. I think doing it the other way around is likely to lead to a worse design. Can you not tell the LLM to do the same? Honest question.
> I always found myself writing verbose copypasta code first, then compress it down based on the emerging commonalities. I think doing it the other way around is likely to lead to a worse design.
I do pretty much the same thing, which is to say I "write code using a brain dump", "look for commonalities that tickle the neurons", then "refactor". Lather, rinse, and repeat until I'm happy.
> Can you not tell the LLM to do the same?
You can tell them until you're blue in the face. They ignore you.
I'm sure this is a temporary phase. Once they solve the problem, coding will suffer the same fate as blacksmiths making nails. [0] To solve it they need to satisfy two conflicting goals - DRY the code out, while keeping interconnections between modules to a minimum. That isn't easy. In fact it's so hard people who do it well and can do it across scales are called senior software engineers. Once models master that trick, they won't be needed any more.
By "they" I mean "me".
[0] Blacksmiths could produce 1,000 or so a day, but it must have been a mind-numbing day even if it paid the bills. Then automation came along, and produced them at over a nail per second.
a) The agent doesn't need to read the implementation of anything - you can stuff the entire projects headers into the context and the LLM can have a better birds-eye view of what is there and what is not, and what goes where, etc.
and
b) Enforcing Parse, don't Validate using opaque types - the LLM writing a function that uses a user-defined composite datatype has no knowledge of the implementation, because it read only headers.
Write code? No. Use frontier models. They are subsidized and amazing and they get noticably better ever few months.
Literally anything else? Smaller models are fine. Classifiers, sentiment analysis, editing blog posts, tool calling, whatever. They go can through documents and extract information, summarize, etc. When making a voice chat system awhile back I used a cheap open weight model and just asked it "is the user done speaking yet" by passing transcripts of what had been spoken so far, and this was 2 years ago and a crappy cheap low weight model. Be creative.
I wouldn't trust them to do math, but you can tool call out to a calculator for that.
They are perfectly fine at holding conversations. Their weights aren't large enough to have every book ever written contained in them, or the details of every movie ever made, but unless you need that depth and breadth of knowledge, you'll be fine.
I just mean is the claim that the open source models where the closed models were 12 to 6 months ago true? They do seem to be for some specific tasks which is cool, but they seem even more uneven in skills than the frontier model. They're definitely useful tools, but I'm not sure if they're a match for frontier models from a year ago?
Frontier models from a year ago had issues with consistent tool calling, instruction following was pretty good but could still go off the rails from time to time.
Open weight models have those same issues. They are otherwise fine.
You can hook them up to a vector DB and build a RAG system. They can answer simple questions and converse back and forth. They have thinking modes that solve more complex problems.
They aren't going to discover new math theorems but they'll control a smart home and manage your calendar.
I don't think this person belongs in prison, but the internet also isn't the place it was in 2004? You do bear responsibility for what you do online, and this was irresponsible. We should encourage kids and others to experiment and make mistakes, but the kid shouldn't have put up this website and should have taken it down as a responsible member of the community
I've tried to use langchain. It seemed to force code into their way of doing things and was deeply opinionated about things that didn't matter like prompt templating. Maybe it's improved since then, but I've sort of used people who think langchain is good as a proxy for people who haven't used much ai?
I think not many people are arguing that we shouldn’t exclude people based on testosterone in elite events, but none of these were trans women, these were all women who lived their entire lives as women from the moment they were born
I'd argue about testosterone. High testosterone happens in some woman naturally, why exclude them? They still are woman, they should have a right to participate.
Height is also an advantage in sports, and women statistically are much shorter then man, should we ban tall woman from sports? Should we say "she exhibits a male amount of height, it isn't fair to let her participate with 'normal' woman"?
The more "fair" we make woman competition the narrower our definition of a woman gets.
If you want to make it fair, let's pick a random chemical in man exclude people from competition based on their readings. That surely would make sport career look more fun for everyone, training all your life only to find out that some committee doesn't consider you a man. And then we can celebrate equality by noticing that man-to-woman sport participation ratio got closer to 50-50
My view is that testosterone is a reasonable thing to discriminate on because:
1. It is causally connected to primary and secondary sex characteristics
2. It has a large impact on performance in many sports
3. It's easy to explain to most people and somewhat matches people's intuitions around fairness
But, yes, it is true that there are cis women with high T levels and it is somewhat unfair and arbitrary to include them when not excluding other random advantages that people have. I'm just not sure if I have a better solution
It's dumb because there are two types of hyper/hypo-gonadism. "Primary" hypergonadism is where you have way more of the hormone in your blood stream. You're advocating testing for only "primary hypergonadism" in women.
Secondary hypergonadism is where someone has a normal concentration of the hormone in their blood, but they have an unusual abundance of hormone receptors.
The effects are the same, but currently we can only measure secondary hypergonadism during an autopsy/dissection.
It’s interesting how the evidence based analysis switched as soon as the republicans came into power. Maybe this is less about evidence and more about opinion actually?
When I’ve researched this it’s turned out that among elite athletes it tended to be a bit higher since some of these intersex conditions can confer benefits
> There is a category called woman, it’s defined by something that’s identify related.
But that’s not how it’s defined. People have been using that word in every language humans ever invented for thousands of years to mean biological female. If you want to argue that there is something else that isn’t biological sex and you want to invent a word for it, go nuts, but “woman” is already defined. Words can and do change definitions over time, of course. If it’s your contention that the definition by consensus has already changed, say so, but there are billions of people on this earth who haven’t got the message, which seems odd for something determined by consensus of the people who use language.
Putting that aside, since sports are about physicality and accomplishing things in the real world, it makes no sense to base them on “identity” - something that cannot be detected or defined by anyone but the self identifier - rather they should be based on physical aspects of reality.
I’m not defending this definition, but I will point out that gender has never been about the chromosomes you were born with. It has been about how people around you perceived you and people often have overly simplistic ideas about exactly what that meant.
Plus it’s totally normal for words to have more technical detail than they first appeared. The idea of a sex binary doesn’t fully exist so we’d need something to deal with that anyway.
I personally support segregation based on hormones as the fairest option available. Otherwise if you use purely a genetic test there are plenty of women with high t levels without an sry gene and no one disputes that high t levels confer a biological advantage in many sports
Going even further back, gender denoted, originally, a linguistical construct associated with sex but not strictly dependent on it, as seen on romance languages like Spanish, Portuguese, etc. [1] There, words have their own gender and, sometimes, the gender of the word and the sex/social gender of the subject may disagree. Ex.: "ant" in Spanish is "hormiga", but this noun is exclusively feminine with no masculine form.
> It has been about how people around you perceived you and people often have overly simplistic ideas about exactly what that meant.
I don't know any culture which defined gender by how you dress and how long your hair is rather than what is between your legs. You would be called a girly boy or a boyish girl.
So girly and boyish is how you are perceived, girl and boy is your sex, that is how almost every culture defined it through all time.
>except that to remove perverse incentives it’s reasonable to require hrt
"I took a drug, therefore I am now a woman" is not a reasonable position to hold. The debate starts out with one based on an identity, and then in the very next formulation reduces that identity to which medicines you take.
No, but that’s not what the statement is saying. It’s arguing that we should add the minimum restrictions we can to the women’s sports category and that hormones might be a reasonable one
This started out with a claim that “trans women are women full stop”, which implies that there’s no difference in the categories, and has since retreated to “in order for trans women to compete as women, they have to take these medicines”.
This implies that males who identify as women but do not undergo HRT are not women in the context of sports (and their gender in other contexts remains ill defined, especially in the absence of perverse incentive). This is a form of misgendering, which is what we were trying to avoid in the first place.
This is a position that one could take up, but it comes
at a steep cost. It holds the societal acceptance of
transgenderism hostage to a biological account of
sex-gender. This is problematic for several reasons.
Moreover, it is worth highlighting the problems with
suggesting that sex, as biologically based, determines
the gender with which one psychologically identifies
[...] Second, whatever criterion is offered to ground
this similarity would inevitably disqualify many women,
for not all women share the same hormone levels,
reproductive capacity, gonadal structure, genital
makeup, and so on. (Tuvel 2017)
Again I don’t take it be saying that. It’s saying that encouraging women to be forced to be in emotional distress to succeed at sport is problematic so we should require hrt so that elite sport doesn’t require trans women to skip hrt
Such a common pattern, I'm tired of seeing it. "That's not what it's saying, those words actually mean..." again and again, ad infinitum. A perverse form of moving the goalposts. Your reply has no relation whatsoever to what was previously stated, it is a new argument entirely.
> It’s saying that encouraging women to be forced to be in emotional distress to succeed at sport is problematic
This was never said by anyone until you came along with that comment, which is a totally different idea (effectively a non sequitur). Can you quote who echoed the same argument?
I said "Sports should only be segregated by this <gender identity> category, except that to remove perverse incentives it’s reasonable to require hrt"
That was trying to elaborate on citruscomputing's argument where they said "Otherwise you have trans women having to choose between being more competitive and receiving necessary medical care."
I'm rephrasing those two points. Apologies if I initially described that badly, but I'm just restating the perverse incentive they were talking about
Because then trans men will dominate the "women"'s category. What's frustrating about this entire subject is that many of these things were tried. After finding that too many cis athletes were being disqualified they switched to the current rules that in most cases split things based on testerone levels. You can choose to do it some other way, but all of them come with some problems that people won't like
You're making scientific claims, but with the only evidence that I'm aware of contradicting the claim. The usual approach with puberty blockers is prescribing them around the onset of natural puberty and one way or another stopping them around the age of 16. While there are sadly some cases of people who started hormone therapies and later regretted it, I'm aware of no cases of long term health impacts that are attributed to delaying puberty until 16. If you do know of some reports please let me know.
I asked Claude to see if it could find anything and the only reports it could find was some long term bone density issues, but only in trans women and it seemed potentially related to estrogen dosing
> You're making scientific claims, but with the only evidence that I'm aware of contradicting the claim.
> I asked Claude...
There are no double-blind studies, RCTs, or otherwise on this topic because it's not a situation that lends itself to that type of study. Please don't try to ask AI to summarize the situation because its training set is guaranteed to have far more discussion about it from Reddit and news articles than the limited scientific research
Of the papers out there, many are either case reports or they're studies that look into the case where people go from puberty blocker therapy into gender-affirming care, not the cases where they change their mind and discontinue with hope of returning to their baseline state.
Above I was addressing the implication that puberty blockers are a safe way to press pause on puberty until much later without consequence. That's simply not true.
Those studies you found about bone density also note that they can reduce height, and along with it other growth changes that occur during those ages in conjunction with puberty. Someone who takes puberty blockers until 16-18 will have a different physical anatomy than someone who does not. You cannot resume growth in adulthood after discontinuing the medications.
So the studies you found are consistent with what I'm saying: You cannot delay puberty without also impacting the growth that happens during that phase. That's one of the main reasons why people take the puberty blockers! As someone gets older, the window for that growth does not stay open forever.
I'm not asking for a double blind study. I'm asking for examples of someone who took puberty blockers, regretted it and stopped, and then went on to not be able to live the life they wanted to live. I'm not aware of any such stories and I'm pretty familiarly with the population of people who regret taking hormones. When I double checked with Claude it also failed to find anything accept the issue around bone density I mentioned.
There are plenty of studies that point to strong evidence that this protocol results in better mental health outcomes because for whatever potential consequence there is for delaying natural puberty, there are plenty of known irreversible impacts of allowing it to progress.
If you have other evidence, even just observational studies it would be good to share that.
And again the recommendation is to continue until 15 or 16, not until 18
reply