Reddit sues Perplexity and other AI engines

Interested for the board’s take regarding this story. It sounds like Perplexity was using Reddit Answers or other scraping techniques to get the data that Reddit says is illegal.

From my viewpoint it’s not a great look that Reddit cannot control how their data gets used. This lawsuit may work against Perplexity or others, but there’s likely going to be other scrapers out there since all the data can be accessed through a registered user potentially.

I’ve had a somewhat negative take about Reddit’s data licensing business as the growth slowed dramatically since they introduced that product. Still holding a smaller position in the company as it seems like their advertising business can carry them for the foreseeable future.

23 Likes

RDDT’s operational structure is designed to provide biased postures on many/most topics, because to become an influential member you must have a history of producing posts other members like. The operative here is responses members “like”, not responses members “confirm are accurate”. I am no longer a RDDT member because of this fundamental flaw in their operational structure.

In considering AI alternatives, it seems the use of RDDT is a fundamental flaw in operational structure. At this juncture, isn’t it positive if all AI applications abandon RDDT as a source?

Gray

8 Likes

For that matter, don’t all social media applications have this flaw?

4 Likes

Yes, virtually all social media apps operate in this manner, but only Reddit is being accessed for training data as the theory is, due to the moderation of the subs (subject threads) Reddit’s data is considered to be of higher quality than other social media apps.

That theory is being challanged. The counter argument is that comments are promoted more on the basis of popularity than qualification for accuracy. It pretty much boils down to whether or not one can accept popularity on Reddit as a proxy for validity.

Personally, I think there’s some merit in support of the theory. So far as I am aware, there aren’t any scientific studies that verify the quality of Reddit commentary.

However, IMO Reddit’s data is by and large more reliably accurate than that data found on sites like Facebook, X, etc. where there are no constraints with respect to accuracy. These sites are in fact havens of biased comments and conspiracy theories.

The moderation of Reddit’s subs does tend to keep the discussions more focused and fact based. In fact, the structure of the site such that commentary is channeled into various subjects in and of itself tends to keep the participants contributions reasonably trustworthy.

As I said, this is a personal opinion, I’ve not done any analysis to verify its validity. It’s more akin to anecdotal observation based on rather limited experience with Reddit.

5 Likes

Note: I used AI in preparing the response. Basically, I told Gemini that I want to talk about robots.txt and the Meta copyright case, and I agree that Reddit’s data selling business has limited growth. Then Gemini provided the complete answer, especially getting records from the Meta case and getting the references. I briefly checked the references and they seem to work. However, Motley Fool does not allow me to put more than two links in a post now. Please let me know if you need the references.

You’ve hit on a key issue with your first point. They established a legal framework for data usage. But technically, they can’t physically stop every crawler. This is the same for almost all websites, and it doesn’t indicate a lack-of-control from Reddit.

Websites commonly use a file called robots.txt to communicate their crawling preferences to bots. This file can specify which parts of a site should not be crawled. While not legally binding in itself, it acts as a clear directive. Ignoring these directives can be a factor in legal cases, especially when combined with a website’s terms of service. So, it’s less of a “gentleman’s agreement” and more of a stated legal position. Reddit is essentially drawing a line in the sand, and this lawsuit is the enforcement of that line. For example, it’s common for a website to say that they allow Google to crawl it for search, but not allow LLM companies to crawl it for model training.

One similar case is the lawsuits against Meta (formerly Facebook). Several authors, including Sarah Silverman, sued Meta, alleging that their copyrighted books were used without permission to train Meta’s LLaMA AI models. The authors claimed these books were part of a dataset sourced from “shadow libraries” known for pirated content. Meta used those books without the appropriate copyright agreement.

Recent developments in these cases have seen some mixed results. In one instance, a judge ruled that Meta’s use of the books for training could be considered “fair use”. However, the judge also noted that the ruling was based on the specific arguments presented and that future claims with stronger evidence might succeed. Internal Meta communications have also surfaced, allegedly showing that the company was aware of the legal risks associated with using these datasets. These cases are ongoing and will likely have significant implications for the AI industry.

Finally, I share your skepticism about data licensing as a sustainable, long-term growth driver. It can certainly provide a significant short-term revenue boost, as we’ve seen with Reddit’s recent earnings. They are projecting substantial revenue from these deals in the coming years. However, a company’s historical data is a finite resource. Once it’s been sold and used for training, its value diminishes for subsequent sales to the same or similar clients. While new data is constantly being generated, the initial “gold rush” of selling a massive backlog of data is a one-time event. Future growth in this segment will depend on the continuous creation of unique and valuable data, which may not grow at the same explosive rate as their initial offerings. While some analysts see potential for durable growth, others have expressed concern that the market may be overestimating the long-term value of this revenue stream. Reddit’s core advertising business will likely remain its primary engine for the foreseeable future.

9 Likes

I think it’s hard to stop this kind of scraping. It’s unacceptable for a prominent company like Perplexity to blatantly do it and Reddit is right in suing them. There’s a high chance they will prevail.

In the AI era, it’ll be extremely important to have human data to train on. A future where models train on model generated data is pretty much useless. Reddit has built a huge moat and I believe they are the go to place for conversation these days. I like their mission to be the most human place on the Internet. A small example, all the sports teams that I follow, most discussions are increasingly moving to Reddit (from the myriad of team specific sites).

I bought when their market cap was $25B and think they really have the opportunity to become a $250B company some day. That said, their advertising business will have to continue to grow to justify owning them. AI licensing deals will eventually come and will be icing on the cake.

11 Likes

I think the questions are actually a bit trickier:

• If a human reads Reddit and then starts a business using what they learned there, is that theft? I’m guessing not, so then the question becomes what’s the difference between human intelligence and artifical intelligence?

• There’s an argument that if Reddit wants to protect its IP, it should not give it away for free. I think there’s a difference between reading unprotected, non-paywalled, non-subscriber content versus content one has to pay for. In other words, it’s Reddit’s commercial business model based on showing ads that’s broken, not any violation of content rights. They willingly gave it away as long as they could show ads.

• Sure, you can’t repurpose that specific content, but Reddit can’t claim ownership of the knowledge gained by reading it any more than colleges can claim ownership of what its students learned, or a previous company you worked for owns knowledge you gained while working there. And in the last regard, there is a bunch of case law as to what you can take with you from a previous employment and what you can’t. The botton line, I believe, is that knowledge isn’t protected, but specific designs are, for instance.

• There already exists a “fair use” doctrine, which enables, for instance, YouTubers to show brief portions of otherwise copyrighted content. Some content owners exercise the discrepancy between financial resources to go after YouTubers, but some are fighting back now.

• OTOH, there is protection for derivative works remaining with the copyright holder. This would affect things like image generation based on human-created images that are still under copyright, but may not protect “in the style of” image generation any more than humans doing the same.

For this board, I think the question is whether in the upcoming years settling of law and protection around this kind of thing matters to some companies’ business models. Both the AI companies as well as the content owning companies. Which way the law gets settled could affect many financial outcomes.

10 Likes

Smorgasbord,

I don’t know if I agree with your extrapolation. No one says Reddit is claiming ownership of the knowledge people gain. I think of more like a movie theater. Reddit is showing the information to those that pay “watch ads”. They can’t stop every idiot with a camcorder from recording the IP but they can go after those that make money by using the camcorder recordings.

I understand the fair use doctrine but to use fair use doctrine, you nominally have to have lawful access to the material. If I broke into Apple computer systems because of poor engineering and I released a Youtube video on the upcoming Iphone, I would lose the fair use case. If I develop a website and I use methods to prevent people from scrapping the data but someone finds a way to go around my defenses doesn’t mean I lose my rights to protect my material.

Recent rulings have gone Anthropic’s way when they purchased the books but when they pirated them the Judge wrote the following.

" [t]his order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use.” (id., 18:25-27).

Perplexity was buying the data from scrappers at cheaper rates than what they could buy from Reddit. The scrappers were specifically circumventing the defenses Reddit has implemented to stop scrapping. Training AI with data can defiantly fall under Fair Use, but its hard to argue that stolen information is Fair Use. So I would be surprised if the law doesn’t side with Reddit.

Drew

7 Likes

With that extrapolation, all non paywalled content in the Internet is free to be exploited by AI. Even if the creator or owner of those sites specifically says don’t use it. It might work for a little bit (assuming courts allow it), but our society will end up the big loser. Natural extrapolation is most free content would disappear in a few years.

On Human intelligence vs. AI, there is a big difference. It comes down to distribution. Using books as an example, why does a publisher charge $10 for a book? They are not expecting the human to go make copies and distribute it to a billion others. I hope we can all agree that would be a clear case of copyright infringement. If this was allowed, the publisher would be charging much higher for their books not $10.

The more simpler answer is, Reddit controls access and has broad license rights to user content in it’s site. It’s only free for use as long as they allow it (it’s also well within their rights to allow you to use it but block me). Once the say don’t use it, companies have to follow as Google and others did. Perplexity cannot steal that information using underhanded methods because they feel like it.

3 Likes

Reddit as a platform doesn’t produce original analyses or opinions. It’s a host for user-generated content — discussions, posts, and comments written by individuals in subreddits.

RDDT is doing nothing novel or intellectual outside of their structure which happens to entice involvement by millions of RDDT users, EXCEPT collecting and categorizing these personal member responses.

If you make a personal post on RDDT (whether correct or incorrect), which is supported by hordes of RDDT users, your personally authored posture may show up as an accurate response in AI, if the AI agent has approved the use of popular, but potentially biased/inaccurate sources to author responses.

What is the basis of the RDDT suit?

3 Likes