Note: I used AI in preparing the response. Basically, I told Gemini that I want to talk about robots.txt and the Meta copyright case, and I agree that Reddit’s data selling business has limited growth. Then Gemini provided the complete answer, especially getting records from the Meta case and getting the references. I briefly checked the references and they seem to work. However, Motley Fool does not allow me to put more than two links in a post now. Please let me know if you need the references.
You’ve hit on a key issue with your first point. They established a legal framework for data usage. But technically, they can’t physically stop every crawler. This is the same for almost all websites, and it doesn’t indicate a lack-of-control from Reddit.
Websites commonly use a file called robots.txt to communicate their crawling preferences to bots. This file can specify which parts of a site should not be crawled. While not legally binding in itself, it acts as a clear directive. Ignoring these directives can be a factor in legal cases, especially when combined with a website’s terms of service. So, it’s less of a “gentleman’s agreement” and more of a stated legal position. Reddit is essentially drawing a line in the sand, and this lawsuit is the enforcement of that line. For example, it’s common for a website to say that they allow Google to crawl it for search, but not allow LLM companies to crawl it for model training.
One similar case is the lawsuits against Meta (formerly Facebook). Several authors, including Sarah Silverman, sued Meta, alleging that their copyrighted books were used without permission to train Meta’s LLaMA AI models. The authors claimed these books were part of a dataset sourced from “shadow libraries” known for pirated content. Meta used those books without the appropriate copyright agreement.
Recent developments in these cases have seen some mixed results. In one instance, a judge ruled that Meta’s use of the books for training could be considered “fair use”. However, the judge also noted that the ruling was based on the specific arguments presented and that future claims with stronger evidence might succeed. Internal Meta communications have also surfaced, allegedly showing that the company was aware of the legal risks associated with using these datasets. These cases are ongoing and will likely have significant implications for the AI industry.
Finally, I share your skepticism about data licensing as a sustainable, long-term growth driver. It can certainly provide a significant short-term revenue boost, as we’ve seen with Reddit’s recent earnings. They are projecting substantial revenue from these deals in the coming years. However, a company’s historical data is a finite resource. Once it’s been sold and used for training, its value diminishes for subsequent sales to the same or similar clients. While new data is constantly being generated, the initial “gold rush” of selling a massive backlog of data is a one-time event. Future growth in this segment will depend on the continuous creation of unique and valuable data, which may not grow at the same explosive rate as their initial offerings. While some analysts see potential for durable growth, others have expressed concern that the market may be overestimating the long-term value of this revenue stream. Reddit’s core advertising business will likely remain its primary engine for the foreseeable future.