You assume that Reddit forms the core of OpenAI's data mining. While it may be large, I suspect that OpenAI has read alot more of the internet than that.
What would be great is an LLM trained on that pirate library we all use, Z Lib I think, with all the books of the world, not just forum opinions.
To me, the data cat is out of the bag, and no single corp will ever put it back again.
> You assume that Reddit forms the core of OpenAI's data mining
Not at all, it's the only source mentioned simply for the fact that no others bear any relevance to the story.
Frustrating other parties' capability to access as much source material as possible makes sense if you forget about common moral values for a second and reduce everything down to a zero sum game: their loss equals your gain.
For additional evil comedic value, regarding your mention of ZLib: I just recalled it was taken down (for a while anyway) by US authorities not too long ago, and it would be extremely sadfunny if that takedown ever turns out to have coincided (taking into account government/bureaucratic slowness) with OpenAI having finished downloading or processing all of the library's content.
They'd have to hack it, or pay/donate a lot to get all those books, though. Z Lib only allows you to download 10 books a day as a free user. The problem now is that the only way to donate seems to be crypto, or some Chinese gift cards. I'm not sure how much of this is because of US authorities directly vs. how many "high risk processors" were taken down by disconnecting Russia from SWIFT, but either way, it's not convenient to support them anymore. Not that people don't, of course.
To me, the data cat is out of the bag, and no single corp will ever put it back again.
laughs evilly for sweet ideas none the less