In the fast-evolving landscape of AI Data has emerged as the new currency (alongside access to Nvidia H100 GPU). Data serves as the fuel that drives AI.
AI systems solving complex problems require an immense amount of data to deliver high quality services. This is especially true in a use cases that don’t have a human-in-the-loop (e.g. Level 5 autonomous driving), use cases delivering partial pr full automation with a high degree of trust and accuracy in a consumer facing scenario (e.g. tier 1 customer support chatbots), or systems automatically executing transactional API calls to other services.
Proprietary data is not a technical topic but a business one. Proprietary data serves as a moat that helps companies differentiate and justify the (often significant) investments associated with building product based on AI models. By training AI models on proprietary data, companies can develop unique capabilities which others can’t develop (simply because others don’t have the data), deliver high quality predictions (typically measured in performance metrics like recall – the percentage of data samples correctly identified as belonging to a class of interest out of the total samples for that class), or leverage a foundation AI model doing a better job fine-tune these model for a given set of use-cases and verticals.
Most people think about proprietary data simply as a unique, exclusive information, collected or generated. Often that is indeed the case, but there are other types of “proprietary” advantages and data strategies that can deliver a significant moat. Here are a few more examples to consider:
- Leveraging customers’ data sources – Some companies excel at accessing their customers proprietary datasets and obtain rights from their customers to leverage data derivatives for machine learning purposes. This helps both the vendor and the customer by delivering higher quality services. One example is Cherre, which helps customers connect all your real estate data (1st party and 3rd party) and better understand data quality.
- Partnerships and data consortiums – Business Development partnerships can aid with obtaining and scaling proprietary data sources. This is a method that has been used extensively in online advertising, transactional data, and Location datasets. Other companies deploy data consortiums in which every additional partner benefits from a network effect. Deduce is one example of a data consortium that helps derive more signals from a network of participants, benefitting of all participants. Another great example is Placer, which has an exclusive data acquisition agreement with Life360, locking out significant part of the market
- Customer led labeling – Many AI solutions sit at the intersection of Human-Machine interface. Collecting customer feedback through the actual use of the system in continuous and smart ways can help can generate data to “debug” models and better understand underdamping, data distribution issues, and mislabeling. Designing the right user experience can lead to customers (including experts in those companies) doing quite a bit of labeling heavy lifting, in turn resulting in higher quality labeled data.
- Intelligent expert labeling – Having raw data is the first step, but labeling data for training purposes could range from a simple repetitive task to an herculean one requiring specialists and experts. Some companies build tools to leverage experts very efficiently or have tools that leverage limited expert labeled data with various deep learning and transfer learning methods to build models. Watcful.io is an example of a company that helps other companies with expert labeling techniques
- Unique data mapping – Products built to serve specific verticals (e.g. Law, CyberSecurity) can benefit from mapping data inputs and model outputs to specialty built Data Models (typically built and maintained by humans)or leveraging Knowledge graphs as a way to transform and include relevant tokens into a prompt into an LLM. In specific verticals, this can help minimize model hallucination by adding context and producing model outputs that are more inline with customer expectations
- Data collection through devices and Hardware – Some companies deploy hardware devices to collect real world data, or are given access to such datasets derived from devices others deploy. Any connected device can help facilitate “real world” data that would be proprietary, including IoT devices, Sensors, Smartphones, etc,
To summarize, possessing proprietary data serves as a business moat, offering protection against rivals and fostering long-term sustainability. Proprietary data and proprietary labeled data sets can comes in various shapes and forms.
A key question to consider is whether a company has a hard to replicate approach to obtaining data, at scale, or labeling it in a way that would make it harder for a new entrant (or even a incumbent that has existing data) to enter the market and deliver AI systems that perform as well. At Recursive Ventures we call this “AI Moat” and it’s inherent to how think about long term value creation in the budding AI eco-system.