In the last decade with the advances in open banking and the rapid growth of data aggregators, transaction data has turned from a plain record of financial activity into a rich enabling layer for multiple products and services, offered by fintech challengers and even tech companies outside of the finance space.
Financial transactions are one of the richest sources of information on both consumer and business behavior. Yet, understanding this data at scale is still an unsolved problem. Here is why.
Let’s look at an example:
Here we show only some of the main features for a single transaction, the raw description, direction and amount. Although cryptic at first sight, it is technically possible for a human to read this, given enough expertise, access to the internet, merchant databases and various lookup tables. However, a machine learning model can be many orders of magnitude faster, cheaper, more reliable and, if the training data does not contain systematic noise, even more accurate.
For a model to parse and extract meaningful information from it, in addition to knowing about various merchants, payment processors, people names, addresses, services words, etc., it needs to be able to adapt to arbitrary changes in the format; resolve between entities with similar looking names, depending on the context; from a given list of possible labels, assign the relevant ones; guess the meaning of abbreviations it has not seen before, and much more.
Although most work for 75% of cases with one of the top few thousand merchants, no solution has so far been successful for the mom-and-pop shops, the corner-stores, the newly-opened businesses and other smaller and less popular institutions. Indeed, with access to only a single source of data inside a fintech company, it is near impossible to avoid brittleness in the long tail. Only covering the main merchants and transaction types might seem workable, but costs from the misses are much bigger than gains from the hits. Losing money on a misallocated business loan, showing an individual the wrong spending breakdown or miscalculating a quarterly tax return, can all result in significant setbacks for financial companies and their customers. To be reliable, we need consistent > 95% accuracy on model outputs not only for the common cases, but also in the long tail.
To achieve this level of performance at Ntropy, we are addressing the problem by combining data from across our network of customers with pre-trained natural language embeddings, data from search engines, contextual user features and other auxiliary information. Much of this, has only very recently become possible, in parallel with advancements in task-generalization, active learning, scalable data labeling services and privacy-preserving techniques.
Below is an overview of our current architecture:
The pipeline converts raw streams of transactions into contextualized, structured information that is directly parseable by both humans and machines. It provides a much wider contextual understanding of each transaction than is possible with just a single dataset or heuristic and has on some benchmarks shown to outperform even human verifiers.
In our previous post, we introduced the natural language domain as a reliable carrier of information between machine-learning models across organizations and privacy barriers. At scale, it can enable models to access relevant data in nearly every industry, from agriculture to healthcare. As the model matures, we aim to abstract away the rigid schema for all input and output features to make it trivial to generalize to other domains.
To accelerate the adoption of this approach, we will be open-sourcing parts of our stack over time, including the active learning framework, privacy tools, trained models, approaches for prompt engineering and more.
Thanks to my fellow ntropians (in alphabetical order) David Buchmann, Chady Dimachkie, Jonathan Kernes, and Nare Vardanyan for their inputs to this post. If you are an ML or backend engineer excited by our mission, we are hiring and would love to hear from you. We recently launched our API to the public and a key can be requested here.