Products and services powered by machine-learning models need training data that is often obtained from customers. However, this creates a frustrating cycle for innovation, where a good product needs a good model, which in turn needs lots of data coming from customers who need a good product. There are, of course, creative ways to bootstrap a product into the market — buying data elsewhere, using heuristics, starting with pre-trained models, etc. The data problem is, nevertheless, far too familiar to any machine-learning team and is one of the key deterrents against using machine-learning in a commercial setting. What if we could have “no-data ML”, where training data is obtained externally, in a scalable way, and machine-learning models are deployed to production from day one, without requiring any in-house labels?
In our previous post, we outlined our master plan for a data network. At scale, it will allow machine-learning models to query collections of datasets distributed across organizations, providing a way to grow the amount of data available to commercial machine-learning models over time and allowing anyone to deploy a model without bringing their own data. Below, we will discuss some of the components of this network and introduce our first product built on top of it.
Why a data network?
One of the key premises of machine-learning is to automate and often improve human decisions. Humans can compute a set of inferred features by combining reasoning in the brain with an observed state, common sense, and specialized knowledge. This information can then be transferred to a dataset in the form of labels, and used to train a model that interpolates that discrete label space. However, the process of obtaining such labels is slow, expensive, unreliable and typically involves privacy risks.
Several techniques have been developed to combat these challenges. Active learning, unsupervised learning, transfer learning and data augmentation are all capable of increasing the amount of information that a model can extract from a given set of data. One could argue that transfer learning doesn’t belong in that list, as it draws upon information from outside the dataset. However, in doing so, it restricts the search space for the model, resulting in a more effective use of each label.
If each label only lives in the dataset it is applied to, then no matter how sophisticated a model is, the total amount of “knowledge” at its disposal will be limited by the number of observations in that specific dataset. To accelerate scientific progress in machine-learning and grow the pool of available data over time, it is essential to have access to labels from other datasets.
This problem would be solved if we could construct a single data-store with
A more realistic solution is to enable anyone to use and consume data in a scalable way, by providing an information transfer protocol that allows data to travel across organizational barriers. Note, that such a protocol does not need world-wide adoption to provide value, and even local networks, specialized for just a single type of data, can be very powerful.
There are a number of challenges to overcome when getting a data network off the ground. Below, we list three of the main ones: privacy, data encoding and bootstrapping and share how we address them in our first product built on top of such a network.
The key challenges to solve for a data network.
Machine-learning models do not typically rely on personally identifiable information (PII). Instead, models use non-identifying, statistically relevant features that are often more important for pattern extraction. However, even for PII-less models, inference attacks are feasible given a sufficient number of queries, and privacy must be taken seriously.
The holy grail of privacy is algorithmic privacy, which has seen immense progress in the last few years, in the form of homomorphic encryption, zero-knowledge proofs, functional encryption, secure multi-party computation and differential privacy.
Implementing one of these solutions at scale would require all parties to trust the protocol, but SSL-level standardization is still years away. In the meantime, traditional information-security certifications like SOC2, ISO27001 and PCI-DSS are the norm for any sort of commercial data exchange, especially with larger organizations who are themselves certified. Each of these certifications requires strict standards around data handling and protection and are enforced through regular audits. If obtained, however, they are a reliable way to establish trust around data at a commercial level.
An important part of any communication protocol is data encoding. For human-to-human communication, it is a “natural” language, for human-to-computer communication it is a programming language, for computer-to-computer communication it is a data protocol, and for dataset-to-dataset communication, it is a set of features.
Historically, machine-learning models only used a single dataset to train on, with a feature encoding often specific to that particular source of data. However, attempting to aggregate datasets by enforcing a cross-organizational schema would bring more problems than possibilities, as well as constrain the representational power of the data by reducing the features to the lowest common denominator and introducing bias inherent to a human-defined schema.
A viable solution is to describe each feature using natural language. This way, a map can be built between any two features (and hence the data they describe) for which there exists a map between their definitions and the natural language domain. Of course, there are exceptions to this and some features are so specific to a particular product that it is not feasible to use them in other domains. However, that can be addressed over time in local versions of data networks. We will expand more on this approach in our next post.
Bootstrapping the network
For a data network to provide value to its first users, it needs to have relevant data in it. If the network is open to receiving data from different industries with no immediate connection to each other, building it requires bootstrapping multiple two-sided marketplaces simultaneously. Such a flywheel is extremely difficult to start.
Furthermore, multiple levels of uncertainty are involved in deciding whether a dataset is useful to solve a high-level problem. One must weigh the costs of compute and developer time, explore the counter-party risk of relying on a particular service and estimate how much the model will improve from the additional data. When bootstrapping the network, it is key to reduce the number of steps in this decision-making process and minimise perceived risk for the first users.
The Ntropy network
In this post, we introduced the idea of “no-data”. If widely adopted, we believe it will significantly accelerate the development and adoption of machine-learning in the industry. This will enable powerful models to be bootstrapped with less and less local data requirements, while giving more complex models access to a rapidly growing pool of data.
At Ntropy, we are starting our data network with just one type of data: financial transactions. To maximize the density of relevant data from day one, we are also using other information sources, including external APIs, public databases, and embeddings transferred from pre-trained language models. In the next post, we will dive deeper into our first product and how it is used to enrich financial transactions for traditional banks, fintechs, insurance companies and others to enable a new generation of products and services.
Thanks to my fellow ntropians (in alphabetical order) David Buchmann, Chady Dimachkie, Jonathan Kernes, and Nare Vardanyan for their inputs to this post. If you are an ML or backend engineer excited by our mission, we are hiring and would love to hear from you.