Technology @lemmy.world return2ozma @lemmy.world 6mo ago

OpenAI strikes Reddit deal to train its AI on your posts

www.theverge.com Reddit’s deal with OpenAI will plug its posts into “ChatGPT and new products”

Reddit’s signed AI licensing deals with Google and OpenAI.

125

Technology @lemmy.ml lemmyreader @lemmy.ml 6mo ago

OpenAI strikes Reddit deal to train its AI on your posts

www.theverge.com /2024/5/16/24158529/reddit-openai-chatgpt-api-access-advertising

122 23

Technology @lemmy.zip BrikoX @lemmy.zip 6mo ago

Reddit’s deal with OpenAI will plug its posts into “ChatGPT and new products”

www.theverge.com /2024/5/16/24158529/reddit-openai-chatgpt-api-access-advertising

12 1

Reddit @lemmy.world pewgar_seemsimandroid @lemmy.blahaj.zone 6mo ago

OpenAI strikes Reddit deal to train its AI on your posts

www.theverge.com /2024/5/16/24158529/reddit-openai-chatgpt-api-access-advertising

69 17

125 comments

So they filled reddit with bot generated content, and now they're selling back the same stuff likely to the company who generated most of it.

At what point can we call an AI inbred?
- This is actually a thing. It's called "Model Collapse". You can read about it here.
  
  "Model collapse" can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.
  
  I prefer "Habsburg AI".
- I wonder if Open AI or any of the other firms have thought to put in any kind of stipulations about monitoring and moderating reddit content to reduce ai generated posts and reduce risk of model collapse.
  
  Anybody who's looked at reddit in the past 2 years especially has seen the impact of ai pretty clearly. If I was running open ai I wouldn't want that crap contaminating my models.
They always were.

Only now they've agreed to pay Reddit for it. This is what their third party lockdown was really all about.

They're helping themselves to your Lemmy comments for free, as that's just how it's designed. If you post anything publicly anywhere, it's getting slurped up by a bot somewhere.
- I'm not a lawyer. But isn't the reason they had to go to reddit to get permission is because users hand over over ownership to reddit the moment you post. And since there's no such clause on Lemmy, they'd have to ask the actual authors of the comments for permission instead?
  
  Mind you, I understand there's no technical limitation that prevents bots from harvesting the data, I'm talking about the legality. After all, public does not equate public domain.
  
  users hand over over ownership to reddit the moment you post
  
  Not ownership. Just permission to copy and distribute freely. Which basically is necessary to run a service like this, where user-submitted content is displayed.
  
  And since there's no such clause on Lemmy, they'd have to ask the actual authors of the comments for permission instead?
  
  It's more of a fuzzy area, but simply by posting on a federated service you're agreeing to let that service copy and display your comments, and sync with other servers/instances to copy and display your comments to their users. It's baked into the protocol, that your content will be copied automatically all over the internet.
  
  Does that imply a license to let software be run on that text? Does it matter what the software does with it, like display the content in a third party Mobile app? What about when it engages in text to speech or braille conversion for accessibility? Or index the page for a search engine? Does AI training make any difference at that point?
  
  The fact is, these services have APIs, and the APIs allow for the efficient copying and ingest of the user-created information, with metadata about it, at scale. From a technical perspective obviously scraping is easy. But from a copyright perspective submitting your content into that technical reality is implicit permission to copy, maybe even for things like AI training.
  
  Well the legality seems to be something you can ignore when you have billions of dollars in VC money to fritter around.
  
  It certainly didn't stop them hoovering up music and movies, and the owners of those have a lot more power than any of us do.
  
  Tech is fast, the law is slow, and you can make many times the cost of lawyers and fines by the time anybody gets around to telling you to stop it.
  
  Well even if it was a legal argument, they wouldn't care. Like Facebook and all the rest. They say they don't share your data but we all know that's a lie
- What if I say the word gasp fuck?
  
  Well they've probably got filters that remove all that before it teaches their Ai to swear. So you need to be more subtle for 𝑓ucks sake.
  
  These fuckers see it as well. Fuckity fuckity fuck.
BRB - changing my entire 15 year reddit comment history to "Fuck Spez". LOL.
- Know any bots or ways to perma delete all Reddit comments?
  
  Reddit has backups, permanently isn’t an option.
  
  I used redact.dev to mass edit all my comments, worked pretty well. Problem is that if you mass delete, they'll restore them pretty quick, but so far they haven't reverted my edits.
  
  https://github.com/j0be/PowerDeleteSuite
  
  Back when I deleted all my comments, I was told I could claim to be in Europe and make a request citing the European law that Reddit has to follow. I think Reddit had a page where you could make the request, but of course it was hard to find.
- Realistically, when you're operating at Reddit's scale, you're probably keeping a history of each comment for analytics purposes.
- That was really my thought - future iterations of Chat GPT won't like spez very much.
Some day historians will be able to look back at this moment and be able to determine it was what caused ChatGPT to become horny and weird.
- Only an idiot would decide to mindlessly trawl Reddit to train an LLM. They'll be confused when their model suddenly is confidently wrong about everything and have no clue.
  
  You are a hundred percent right, but how many idiots are there out there?
- My comment history was like 50% shitposting about the beauty industry and 50% hating on Christian fundamentalists. There's honestly no way it won't make AI at least a little bit worse, and I'm not mad about it.
  
  That AI is going to be super anti-Christian fundementalist (or possibly just anti-Christian), so maybe there is an upside.
LLMs have been training on Reddit posts since at least 2012. Nothing really new here.
- Now they get to train on all the "deleted" comments/posts as well.
  
  Probably not, I'm sure they're training on Reddit's internal data set which likely includes all deleted posts.
- It's ground zero for Bots training on other Bots
What makes you think that they are not scraping Lemmy too? The only reason they might not be is probably how niche Lemmy and the fediverse are, but I am sure there have been people already doing it.
- Fediverse is designed to do exactly that. It's free flow of information which is a good thing. Don't let corporations hijack this beautiful concept. We all want information to be free.
- I’m not mad about the scraping. The linkedin scraping case pretty much cemented that there was nothing that could be done to stop it. I’m just mad that I can no longer use the app of my choice. No such problem with Lemmy.
- Lemmy is even easier to scrape. Just set up your own instance, then read the database after activity pub pushes everything to you.
- I'm sure they are, but Reddit probably provides these companies with lots of personalized metadata they collect just for them which they may not get from Lemmy.
They now are paying Reddit? I thought they could just scrape for free.

Also, you can not delete anything on the internet. Once something is public there will always be a copy somewhere.
- Scraping through a website at the scale they are talking about isn't really viable. You need access to the API so that you can have very targeted requests.
  
  This is why reddit changed their API pricing and screwed over everyone using third party apps. They can make more money selling access to LLM trainers than they could from having millions of people using apps that rely on the API.
  
  Scraping at scale is actually cheaper than buying API access. It's a massive rising market, try googling "web scraping service" and there are hundreds of services that provide API to scrape any public web page and bypass the blocks for you and render all of the javascript.
- There's actually legal precedent against scrapping a website through unofficial channels, even if the information is public. But basically, if you scrape a website and hinder their ability to operate, it falls under "virtual trespassing".
  
  I'm assuming it would be even worse now that everyone is using the cloud and that scrapping their site would cause a noticeable increase in resource cost (and thus, directly cost them more money because of cloud usage fees).
  
  It's why APIs are such a big deal. They provide you with an official, controlled, entry point to a platform's data.
  
  It's the opposite! There's legal precedence that scraping public data is 100% legal in the US.
  
  There are few countries where scraping is illegal though like Japan and China. European countries often also have things called "database protection" laws that forbid replicating public databases through scraping or any other means but that has to be a big chunk of overal database. Also there are personally identifiable info (PII) protection laws that protect storing of people data without their consent (like GDPR).
  
  Source: I work with anti bot tech and we have to explain this to almost every customer who wants to "sue the web scrapers" that lol if Linkedin couldn't do it, you're not sueing anyone.
- My guess is reddit was cheap enough that it made sense to pay them as sort of insurance they dont get sued in the future.
Reddit banned me through IP address or something. Whatever new account i create will be banned within 24hrs even if i don't upvote a single post or comment. I tried with 10 new account all banned and all new email address. So gave up and randomly changed all my good comments. Shifted permanently to lemmy. Missing some of the most niche community. But not so much to return to reddit.

Edit: I didn't even commit any rule violation. Took a too long to change from modded reddit app. I only logged in once. That doesn't amount to blocking me from every using reddit.
- If you use a vpn and a disposable email you can get about a week out of an account if you need to comment, it'll get quietly shadowbanned though.
Meh, good luck with that.

All my Reddit comments have just said “Comment redacted in protest against Reddit's deranged attacks against third party apps, the community, and common sense. See you'll in Lemmy or Kbin once this embarrassment of a site is done enshittifying itself out of existence. Monetize this, u/spez, you greedy little pigboy. 🖕” since I edited them before moving here. 🤷‍♂️
- You better double check. I just found out that only my comments with few upvotes are still that way, the others have been restored.
  
  A script replacing them with random words might do the trick.
  
  That's assuming the old comments are actually overwritten instead of just marked as 'old'
  
  I replaced all my comments with the same phrase before deleting them with PowerDeleteSuite. The comments were fully restored and visible through a google search (but not visible through the user page). My posts were not restored, AFAIK.
  
  This was during the whole 3rd party API thing. Maybe it was just something done during that time, but they certainly got around the edit replacement trick before.
No wonder AI is crazy AF.
- All future AI will have autocorrect errors and will look like no one read it before hitting enter. You're welcome.
  
  No one says thank you, we already have that. WAIT JUST A GOT DAMN MINUTE!! YOU ARE ONE OF THEMS!!
This form of propaganda is my pet peeve. It's not "your posts" as soon as you put something to public you don't get to eat your cake. It's out there, you shared it. Don't share it if you don't want humanity to ingest and use it.
- You're technically right, but nobody anticipated and therefore agreed on their posts being used for training LLMs.
  
  Public information is public information.
- It's not about it being used to train AI. It's about the AI either not being open source/I don't get access to it (i.e. not benefitting me) or reddit being paid for my comments (i e. also not benefitting me).
  
  If this AI training would get me or the public access to the AI, or I would be paid for my comments instead of Reddit, I'd be fine with it.
  
  yeah but you don't get to choose that. You give away that right as soon as you participate in public discourse. It's a zero sum game - either it's a public for everyone or no one.
  
  Don't get me wrong, Reddit is a bitch but I think people want to cut their noses off to spite their faces here. It's much more important to have free information flow than to fuck reddit.
  
  My fear is that people will vote in some really dumb rules to spite AI and restrict free information flow accidentally.
Isn’t this news like every month?
Finally found a use for MS Edge, loaded up Nuke Reddit History and removed all comments and posts: https://microsoftedge.microsoft.com/addons/detail/nuke-reddit-history/bklbcgohenjegdibgmppligaapohkgip
- Hate to break it to you, but the time to do that was over a year ago, and even then it wasn’t ever really a sure thing - we don’t really know what their backup policies are around that stuff.
  
  This is what the former power user community that made an exodus from Reddit roughly a year ago has been trying to communicate, but a ton of people here seem to enjoy keeping their toes in the water over there, with rather predictable consequences (literally, the post we’re commenting on).
  
  All that said: I am very much looking forward to the absolutely titanic lawsuit around GDPR I’m sure is in the works over this.
  
  Not even a year ago. Reddit has been used for training data for well over a decade. We used it in 2012 in an AI class.
- Worth doing, but I suspect they’re sending OpenAI snapshots of the database from before you did that.
- Wish I had known this beforehand in like several accounts I've had with that shit-ass place.
  
  Then again, it's likely that Reddit has shit archived because Spez is one of them data-farmers like Mark is. Nothing is truly deleted from their sites. It's just archived.
  
  There's been lots of evidence that proves this, because people have dug up old comments, even down to who posted it originally. Then, even if your account is deleted, your comment body is still there, I know because I've deleted an account and checked back where I was before.
Does this mean I can stop prefacing my AI requests with “According to Reddit…”?
I didn't delete my comments before nuking my account, but I'm pretty sure the grand majority were shitposts containing ample amounts of smut, gore and other ridiculous over the top shit. So I consider this a win.
Everyone wants a piece of the AI pie....
- Little did they know, the pie was just a hallucination.
Not my posts. Go ahead, look at what remains. The rest was edited and then deleted.

Fuck you, Steve. Right in the ass.
- If only snapshots and backups were a thing...
  
  It's theoretically possible, but the issue that anyone trying to do that would run into is consistency.
  
  How do you restore the snapshots of a database to recover deleted comments but also preserve other comments newer than the snapshot date?
  
  The answer is that it's nearly impossible. Not impossible, but not worth the massive monumental effort when you can just focus on existing comments which greatly outweigh any deleted ones.
  
  Yea that’s the problem isn’t it. I had a great idea involving bullshit-efying my comments by editing them slowly with a LLM via long running script and repeatedly over months.
  
  I realised that they probably don’t delete the original text on edit anyway which, as you say is probably buried in a backup someplace.
Those poor silicon atoms...
But Reddit is full of NSFW content.
- And the problem is?
- Not through the API.
Then they will learn that Spez doesn’t get to profit from me anymore.
- You think they don't have the originals archived?
"Strikes" made me think they were cancelling the deal. Like strike-through, crossed it out, etc. Too bad.
This is the best summary I could come up with:

OpenAI has signed a deal for access to real-time content from Reddit’s data API, which means it can surface discussions from the site within ChatGPT and other new products.

It’s an agreement similar to the one Reddit signed with Google earlier this year that was reportedly worth $60 million.

The deal will also “enable Reddit to bring new AI-powered features to Redditors and mods” and use OpenAI’s large language models to build applications.

Recently, following news of a partnership between OpenAI and the programming messaging board Stack Overflow, people were suspended after trying to delete their posts.

No financial terms were revealed in the blog post announcing the arrangement, and neither company mentioned training data, either.

That last detail is different from the deal with Google, where Reddit explicitly stated it would give Google “more efficient ways to train models.” There is, however, a disclosure mentioning that OpenAI CEO Sam Altman is also a shareholder in Reddit but that “This partnership was led by OpenAI’s COO and approved by its independent Board of Directors.”

The original article contains 334 words, the summary contains 174 words. Saved 48%. I'm a bot and I'm open source!
Gonk
So it's going to be a libtarded libtard AI that doesn't represent the majority of the people, got it.
- The beauty of being here on lemmy is that I genuinely can't tell whether you said this because you're far right or because you're far left
  
  Stupid opinion either way. That Ai is going to catch its share of r/conservative idiots and be a nice blend of ignorance

125 comments