Reddit has filed to go public in a closely watched initial public offering and one of its potential revenue streams is data licensing. The big question is whether enterprises want that data in a large language model. Perhaps the bigger question is whether enterprises will have a choice.

The social community has made a deal with Google to provide it model training data. Reddit also indicated that it'll provide training data more broadly. Reddit has sentiment, conversational data, and a lot of useful stuff to go along with the ridiculous. In some ways, Reddit data is likely to make LLMs a bit more serendipitous.

Reddit's data licensing plan highlights how training data from media-ish companies is valuable. The plan also highlights Hobson’s Choice facing these companies. Reddit gets revenue by providing training data, but also could get squashed by an LLM in the future.

According to Reddit's S-1 filing with the SEC:

"Redditors may choose to find information using LLMs, which in some cases may have been trained using Reddit data, instead of visiting Reddit directly. We believe that our ability to compete effectively for users depends upon many factors both within and beyond our control."

Not that Reddit has a choice. Data licensing is a high-margin business and Reddit lost $90.8 million in 2023 on revenue of $804.03 million. The company did pare losses from 2022 and may be headed in the right direction if data licensing works out. Advertising is a rough business.

Reddit specifically names ChatGPT (OpenAI CEO Sam Altman is a shareholder of Reddit by the way), Gemini (Google paid Reddit for data) and Anthropic as competition. Reddit's list of competitors for your attention is also extensive: Google, Meta, Wikipedia, X, Snap, Roblox, Discord and any other company that can give you an answer. Marketplaces are also cited as rivals.

In other words, every Reddit competitor will also likely have a data licensing business. It's clear that users have become the product and the data licensing model.

As these data licensing models proliferate, data lineage is going to become critical. Ultimately, I'd want to know what data the LLM was trained on and be able to back out some sources. This ability is especially critical for media sources. Media isn't objective and you need a few outlets (and preferably source documents) to even figure out what the actual truth is. Media has become like navigating a divorce. There's the first spouse's truth. There's the other spouse's truth. And then THE truth.

For companies like Shutterstock, which is going to provide training data for models on data marketplaces, the business looks more straightforward. Other providers can supply data, but in the end the LLM buyer may not actually want that data set. How do you reverse learning? How do you back out a data set? Do I want my customer service LLM dishing out some snarky comment it learned on Reddit, X or Meta properties? Is Reddit's training data really just a small subset of the company's more than 500 million visitors and 73 million daily active users that actually comment?

We should be asking these questions now because the data licensing parade is just about to leave the station. In its IPO filing, Reddit said it has "one of the internet’s largest corpuses of authentic and constantly updated human-generated experience" and is in the early stages of data licensing. Reddit said its platform is good for real-time perspective on products, market sentiment and signals. It'll provide data API access and model training from Reddit's real-time content.

Ideally, I'd want to see a nutrition label on LLMs, and a list of data sources used for training. I'd also want to check off sources too. It's unlikely we'll get those options, so enterprises are likely to find small language models aimed at specific use cases and industries more valuable.