A) Release the data, and if it ends up causing a privacy scandal, at least you can actually call it open this time.
B) Neuter the dataset, and the model
All I ever see in these threads is a lot of whining and no viable alternative solutions (I’m fine with the idea of it being a hard problem, but when I see this attitude from “researchers” it makes me less optimistic about the future)
> and the credulity that otherwise smart people have been exhibiting in this space has been baffling to me
Remove the “otherwise” and you’re halfway to understanding your error.
This isn't a dilemma at all. If Facebook can't release data it trains on because it would compromise user privacy, it is already a significant privacy violation that should be a scandal, and if it would prompt some regulatory or legislative remedies against Facebook for them to release the data, it should do the same for releasing the trained model, even through an API. The only reason people don't think about it this way is that public awareness of how these technologies work isn't pervasive enough for the general public to think it through, and it's hard to prove definitively. Basically, if this is Facebook's position, it's saying that the release of the model already constitutes a violation of user privacy, but they're betting no one will catch them
If the company wants to help research, it should full-throatedly endorse the position that it doesn't consider it a violation of privacy to train on the data it does, and release it so that it can be useful for research. If the company thinks it's safeguarding user privacy, it shouldn't be training models on data it considers private and then using them in public-facing ways at all
As it stands, Facebook seems to take the position that it wants to help the development of software built on models like Llama, but not really the fundamental research that goes into building those models in the same way
> If Facebook can't release data it trains on because it would compromise user privacy, it is already a significant privacy violation that should be a scandal
Thousands of entities would scramble to sue Facebook over any released dataset no matter what the privacy implications of the dataset are.
It's just not worth it in any world. I believe you are not thinking of this problem from the view of the PM or VPs that would actually have to approve this: if I were a VP and I was 99% confident that the dataset had no privacy implications, I still wouldn't release it. Just not worth the inevitable long, drawn out lawsuits from people and regulators trying to get their pound of flesh.
I feel the world is too hostile to big tech and AI to enable something like this. So, unless we want to kill AGI development in the cradle, this is what we get - and we can thank modern populist techno-pessimism for cultivating this environment.
Translation: "we train our data on private user data and copyrighted material so of course we cannot disclose any of our datasets or we'll be sued into oblivion"
There's no AGI development in the cradle. And the world isn't "hostile". The world is increasingly tired of predatory behavior by supranational corporations
Lmao what? If the world were sane and hostile to big tech, we would've nuked them all years ago for all the bullshit they pulled and continue to pull. Big tech has politicians in their pockets, but thankfully the "populist techno-pessimist" (read: normal people who are sick of billionaires exploiting the entire planet) are finally starting to turn their opinions, albeit slowly.
If we lived in a sane world Cambridge Analytica would've been the death knell of Facebook and all of the people involved with it. But we instead live in a world where psychopathic pieces of shit like Zucc get away with it, because they can just buy off any politician who knocks on their doors.
> normal people who are sick of billionaires exploiting the entire planet
Don't understand what big tech does for humanity and how much they rely on it in the day to day. Literally all of their modern conveniences are enabled by big tech.
Crowdstrike merely shows how much people depend on big tech and they don't even realize how much they rely on it.
I think you have too much faith in the average person. They scarcely understand how nearly everything in their life has been manufactured on or designed on something powered by big tech.
This post demonstrates a willful ignorance of the factors driving so-called "populist techno-pessimism" and I'm sure every time a member of the public is exposed to someone talking like this, their "techno-pessimism" is galvanized
The ire people have toward tech companies right now is, like most ire, perhaps in places overreaching. But it is mostly justified by the real actions of tech companies, and facebook has done more to deserve it than most. The thought process you just described sounds like an accurate prediction of the mindset and culture of a VP within Facebook, and I'd like you to reflect on it for a sec. Basically, you rightly point out that the org releasing what data they have would likely invite lawsuits, and then you proceeded to do some kind of insane offscreen mental gymnastics that allow this reality to mean nothing to you but that the unwashed masses irrationally hate the company for some unknowable reason
Like you're talking about a company that has spent the last decade buying competitors to maintain an insane amount of control over billions of users' access to their friends, feeding them an increasingly degraded and invasive channel of information that also from time to time runs nonconsensual social experiments on them, and following even people who didn't opt in around the internet through shady analytics plugins in order to sell dossiers of information on them to whoever will pay. What do you think it is? Are people just jealous of their success, or might they have some legit grievances that may cause them to distrust and maybe even loathe such an entity? It is hard for me to believe Facebook has a dataset large enough to train a current-gen LLM that wouldn't also feel, viscerally, to many, like a privacy violation. Whether any party that felt this way could actually win a lawsuit is questionable though, as the US doesn't really have signficant privacy laws, and this is partially due to extensive collaboration with, and lobbying by, Facebook and other tech companies who do mass-surveillance of this kind
I remember a movie called Das Leben der Anderen (2006) (Officially translated as "the lives of others") which got accolades for how it could make people who hadn't experienced it feel how unsettling the surveillance state of East Germany was, and now your average American is more comprehensively surveilled than the Stasi could have imagined, and this is in large part due to companies like facebook
Frankly, I'm not an AGI doomer, but if the capabilities of near-future AI systems are even in the vague ballpark of the (fairly unfounded) claims the American tech monopolies make about them, it would be an unprecedented disaster on a global scale if those companies got there first, so inasmuch as we view "AGI research" as something that's inevitably going to hit milestones in corporate labs with secretive datasets, I think we should absolutely kill it to whatever degree is possible, and that's as someone who truly, deeply believes that AI research has been beneficial to humanity and could continue to become moreso
> Release the data, and if it ends up causing a privacy scandal...
We can't prove that a model like llama will never produce a segment of its training data set verbatim.
Any potential privacy scandal is already in motion.
My cynical assumption is that Meta knows that competitors like OpenAI have PR-bombs in their trained model and therefore would never opensource the weights.
A) Release the data, and if it ends up causing a privacy scandal, at least you can actually call it open this time.
B) Neuter the dataset, and the model
All I ever see in these threads is a lot of whining and no viable alternative solutions (I’m fine with the idea of it being a hard problem, but when I see this attitude from “researchers” it makes me less optimistic about the future)
> and the credulity that otherwise smart people have been exhibiting in this space has been baffling to me
Remove the “otherwise” and you’re halfway to understanding your error.