Technical
24 Apr 2020
The unreasonable effectiveness of combining datasets
Author
Ilia Zintchenko
Co-founder and CTO
A machine learning model generally gets more and more accurate with more data. However, each new observation also adds less and less new information. The model starts to overtrain on our specific dataset, becoming increasingly blind to outliers. This is an especially dangerous game to play in a rapidly changing macroeconomic environment.
To increase the generalization ability of our model and make it robust against outliers, a simple idea would be to train on multiple datasets at once. The more datasets a single model can perform well on, the less of a “surprise” an outlier observation would be. Let’s look at a toy example.
Imagine two different datasets of pictures of dogs playing with different color balls. The goal is to train a model to locate the ball in each picture. In the first dataset, dogs are playing with blue balls
A model trained on this dataset will falsely learn that all balls must be blue. In the second dataset, all dogs are playing with yellow balls.
Equivalently, this model will falsely learn that all balls must be yellow. When a dog playing with an orange ball comes along, both of these models fail to locate it, as neither model has seen a ball of a different color.
If we, however, train a model on both of these datasets simultaneously,
it will learn that balls can come in a variety of colors, and be able to locate a ball the orange ball correctly.
In a commercial setting, different companies tend to have different distributions of customers and corresponding datasets. Variations in the distributions depend on multiple factors, including geography, market sector, positioning, etc. By training across datasets of multiple organizations, a model can be much better prepared to handle customers it has not seen before.
Ntropy is building a network to allow data consumers to train machine learning models on multi-organization data with minimal engineering overhead and data producers to seamlessly monetise their data pool at no additional privacy risk.