For a notebook version, please check out our Colab tutorial: https://colab.research.google.com/drive/1j2BtkDm75xROSZ9wbJ5acyp3eis9kb6f?usp=sharing
For an example of the absolute minimum code needed to create a model, please see our quickstart tutorial: https://colab.research.google.com/drive/1xRPV2aXFTli0SFuk2o--VWO3Tzt_uno3?usp=sharing
This post was authored by Ntropy’s Head of Product.
In this post, we will walk through how to build custom models using Ntropy. In doing so, we will also present our own findings on the effectiveness of customization, measured across a variety of benchmarks and conditions. Finally, we will end with a real-world case study, in which one of our customers was able to achieve a +12% 🚀 increase in accuracy over using our core models alone!
Let’s square a few things away. Ntropy customization is NOT a direct mapping from the labels in our core models, but rather a separate model build on top of our core model that can be customized for individual sets of labels and transactions. In a future post we will explain why a label mapping is insufficient, but for now we hope the numbers will convince you of this fact. Customized models are also not trained from absolute scratch, but instead adapted from our core models; this combines the generality and robustness of Ntropy categorization with the accuracy of user specification.
Here’s what Ntropy customization endpoint is:
The other details and nuances we will handle during the tutorial. Let’s dive in.
For a notebook version, please check out our Colab: https://colab.research.google.com/drive/1j2BtkDm75xROSZ9wbJ5acyp3eis9kb6f?usp=sharing
Broadly, the process for creating models will look something like this:
If you already know the set of categories you want, and what transactions for those categories look like, you can skip the first two steps. Step 3 is obviously required if you have no labeled data. One option to gather data is to use the Ntropy labeling team, and this is handled on a per-request basis for customers. We have a vetted internal team of financial experts that and can provide this service free of charge, as part of the deployment process
Before starting, we’ve made public a small, synthetic dataset to use for testing. We will use this throughout the tutorial, and it can be downloaded from S3 at
🔎 Alternatively (and more conveniently) we’ve made the synthetic data publicly available on Google sheets here:
Quite frequently, users come to us and don’t yet have a clear picture of what they want, other than wanting a deeper understanding of transactions. In these cases, the default is to use the outputs of our core classification model.
Instead, we’ll go one step further, and use the core model as an exploratory tool. We can use the outputs of the core model to get a rough idea of the distribution of labels in our dataset, and from there, narrow down where to look. For this part, we can use the existing Ntropy API to “enrich” a list of transactions (see the Colab for code). Enrichment is our term for a transaction that has passed through all of our services, which includes merchant-extraction and categorization. Upon enrichment, your data will look something like this:
The labels field presents the ground truth categories for each transaction. Since we are still in the exploratory phase, just pretend that that column doesn’t exist yet.
The model_predictions field is the output of our core transaction classifier. This is the main field we will use for exploration, however, there are four other outputs from Ntropy enrichment (merchant, website, person, and location) that will be important as well.
The first thing we will do is get a feel for the data distribution. Let’s plot the top 10 most common categories
We see that the for the most part, these fall into about three categories: food and drink, software, and inventory & vendor payments. Broadly, the two most common topics in the data are software and food and drink.
Next, let’s take look at the tail end of the distribution.
In this dataset, it’s assumed that we will see a bunch of software and food and drink transactions, for whatever reason (maybe the transaction belong to startups or restaurant owners). The next most common categories were vendor payments and inventory which are both semantically related in the Ntropy hierarchy.
Looking through the vendor payment and inventory transactions, we start to notice a clear trend. These transactions all seem to be food distributors! Likewise, there appears to be a clear pattern amongst them. The amounts are all debits between about $5000–50,000. This is a perfect candidate for a new category!
At this point, we can solidify our 4 categories — food distributors, food and drink, software, and cybersecurity — and proceed to dataset construction. However, it’s worth pointing out one more thing. We can continue to look at the tail end of the distribution, and hunt for edge cases. For example, consider the following transaction:
Description: RYPE, INC. Entry type: outgoing Amount: 79.99 USD.
RYPE, INC. is a language learning software product, which seems to fall equally under both education and software. For any given schema, ambiguous transactions like these will always appear. Finding borderline cases like these is critical to tuning our model correctly. When we construct our schema, we have to decide if the software label, should or should not include these types of transactions.
In this case, we will make the interpretation that, yes, this qualifies as software. However, you could imagine another scenario where we may be interested in only software that aids on the developer side, but not software that aids on the business side. The choice is yours, and that’s beauty of Custom models!
It says optional, but the reality is that you will likely need to alter (or at the very least inspect) your current datasets if you want to get great performance.
What makes a good dataset? Let’s continue building the one in the previous section and we will see how the thought process works.
1. After exploration, create a temporary label mapping from Ntropy’s core model to the categories that you are interested in. In this case, for food distributors, we would look at transactions that were labeled as inventory, vendor payment, or food and drink labels. Note, that the Ntropy label food and drink can map to more than one thing at this stage (food distributors plus food and drink).
2. Find clean, unambiguous, informative transactions first. Such examples give the model full signal to learn from. A good transaction will have a
3. Find as many edge cases as you can. The previous example of Rype, Inc. is an excellent demonstration. We need to feed the model as many borderline cases as we can, in order to tilt it into the direction that we want.
4. Make a test set! You can skip this step if you like, but it’s the same for the training set step. This is a separate set of data that’s used to gauge how good your model is. Just make sure there is no overlap between train and test sets.
And that’s it. You should be able to make your categories. In our running example, this means we would have chosen food distributors, food and drink, software, and cybersecurity. Finally, you can repeat the same procedure for constructing a test set, to gauge your performance.
Customization is incredibly flexible, but there are two special cases of categories that you need to be aware of (beyond categories that are obviously poor, like special transactions). These are the Other and Not Enough Information categories.
The first can be considered an actual category. When labeling a transaction as Other, the assumption is that there IS enough information for the model to label the transaction, but that it is not one of the things we care about. These categories can be challenging as they will need more data for the model to learn. In future versions we will expose models that are optimized for this situation, however, as of now to keep things simple, let’s just supply examples of the Other category.
The second category, Not Enough Information, is a default category that triggers when… there is not enough information in the transaction. Currently, this is not yet supported by the API, but will be available in future versions.
The training step is pretty easy, and is best shown in the quickstart Colab and the docs. For demonstration purposes, we show just how little code is required.
Until now we’ve only described how to use the API. In this section, we will take the time discuss how the models work behind the scenes, as well as demonstrate performance across a number of tests and benchmarks that we’ve run. As a result, we hope that you will gain a greater understanding of the expected performance you can achieve, under what circumstances, and what steps you can do to make it better.
There’s a famous quote by the physicist Isaac Newton “If I have seen further it is by standing on the shoulders of giants”, which is taken to mean that no progress occurs in isolation, it builds on that which came before it. Our customization model is no different. When you train a customized model, you are not training anything from scratch, but instead training on top of our core model. This means there are several things you don’t need to worry about:
In terms of how make this happen, we are bringing to production advances in the Machine Learning subfield of Meta-Learning, which can cutely be described as “Learning to learn”. There are essentially two parts to this. The first part is a model architecture choice, where we teach our models to adapt quickly. This part is handled on our end, and for all intents and purposes can be thought of as black magic. The second part involves massive multi-task learning, whereby we train our models on a wide variety of tasks, each of which can contribute some amount information to each other. It’s a realization of something we’ve worked towards at Ntropy since founding; a Data Network. More importantly for the user, the takeaway is this:
Each custom model increases our core model’s understanding of transactions, which in turn raises the performance of all custom models.
Given that we understand a bit more about how the models work, let’s take a look at some actual experiments that quantify the previous claims.
Before diving into the figures, we need to first decide what, exactly, we actually want to test. In previous blog posts (and more so in a future post), we spent a great deal of time explaining why transaction classification is difficult, when it can succeed, and when it fails. However, there is one crucial piece that we did not discuss, that renders those discussion partially academic.
How much data does a classifier need to learn from?
That’s the million dollar question. As it turns out, understanding this problem is closely related to understanding which categories are good, which are ambiguous, and which are devoid of information. The next point is so important, I’m going to section off and toss in emojis.
💡💡💡 There are generically two types of categories that exist, and it is absolutely critical to understand which type of category you have when thinking about performance.
The first are lower-level categories, that are well-defined simple things like gas station purchases and bowling alleys.
The second are higher-level categories, that carry with them human bias, and may require logic to understand. An example would be calling dinner with a client customer acquisition costs, as opposed to restaurant spend.
It shouldn’t be too hard to guess that lower-level categories learn quicker than higher-level ones. Indeed, that’s exactly what our results show. Before discussing our experiments, there’s one final point to clarify. When thinking about performance, we also need to think about groups of labels. Let’s say you have 4 well-defined labels like revenue, operating costs, loans, and rent. If you artificially condense this into two categories (revenue + operating costs) and (loans + rent), you’ve suddenly made the model much more difficult train, as you now need to associate two very different things (revenues and operating costs), as belonging to the same category. We will discuss how learning performs below.
We test our Customization endpoint across 6 different tasks, each testing a different type of problem to learn. In Table 1, we present our results across each of the 6 datasets, and display the mean accuracy averaged across 20 random seeds, the total number of training samples, and the number of classes (categories) in the dataset.
Keyword dataset: this consists of 59 transactions that each contain one of three keywords, Direct deposit, Online, or Recurring, somewhere within the transaction. We chose the keywords to contain potentially relevant information to categorization, but also to be sufficiently nondescript as to require learning. Results show that this performs most poorly, for several reasons. First, the pattern is not semantic, but rather syntactical, whereas we have optimized our models to perform best on semantic tasks. Second, learning this pattern requires rewiring the attention patterns, which with only 59 examples, is not sufficient data.
Person dataset: this consists of 764 transactions across all categories, where the goal is to detect whether or not a person exists in the transaction. Note that this model is trained independently of our named-entity extraction model, and so there is no leakage.
Similar dataset: this was constructed by gathering ~5000 transactions whose labels are covered by our hierarchy, and then assigning them to one of 6 categories: debt, revenue, operating expenses, financial services, tax, and other. We are testing how well our customized models can condense the 188 labels in our general hierarchy, into a digestible 6 categories. We will have more to say when we look at the training curves. For now, we just note that this (and the following dataset) are “higher-level” tasks.
Dissimilar dataset: this is very similar to the similar dataset, but now instead of creating condensed, consistent categories, we group together disparate categories into unnatural chunks. The chunks are (1) not_enough_information, (2) customer acquisition cost + government + insurance + revenues and inflows, (3) fees + employee spend + facilities +investment + gifts, (4) cost of goods sold + financial services + personnel,
(5) tools + professional services + insurance + intellectual property + infrastructure. For this type of dataset, a clustering approach (where each new label is assigned to the cluster whose representative is most similar) would fail, whereas our approach succeeds, albeit with a slightly lower score than the similar dataset.
Novel dataset: part of this dataset is publicly available online, and is what we used in the first half of this post. Here we created new categories outside of our hierarchy, and tested how quickly the model could adapt to something it’s never seen before. This falls into the the “lower-level” learning category, and we confirm our intuition.
Case Study: here we present the results of a real-world case study with one of our beta users. The dataset consists of 21 classes spread out of 12,558 data points. Of particular note, this dataset is extremely noisy (40% of entries are labeled incorrectly in the training set).
let’s now take a look at training dynamics, and in particular we will focus on 3 of those datasets.
Of course, more data is (usually) better. But time is finite, and we all want to know; how much do we really need? The answer depends critically on whether you are learning lower or higher level categories. To investigate, let’s look at performance as function of two variables: amount of data and amount of noise.
In Figure 2, we show the results of customization training for one lower-level dataset (Novel) and two higher level ones (Similar and Dissimilar).
Immediately, we notice a few things:
Another, maybe less interesting point, is that performance expectedly decreases with respect to noise levels. 10% noise is relatively tolerable, but beyond that things start to become quite difficult to learn.
As a side note, we remark that noise was introduced by first sampling latent confusion matrices for each class, and then randomly flipping classes. This way, the model can not as easily detect noise, since the error rates have correlated patterns. This models actual worker errors much better than white noise.
Ok, so how is that our models are able to learn so quickly, and so well? We’ve spoken a lot about how the customization models are built on top of our core model, but we can actually show what this looks like. Below are TSNE produced 2-dimensional embeddings of the transactions in our training data before beginning customization.
Different colors correspond to the different ground truth classes. What we see, is that our core categorization models are already exceedingly good at understanding transactions, and bucketing them. This makes the classifiers job much easier, and also explains why border line cases are so important. Our models know enough about what’s going on, to know that they should be confused. For a fun visual aid, we’ve also attached a GIF showing how these hidden states adapt over the course of training, ultimately becoming more clustered (note this is just for fun, and is not created using TSNE like above)
Figure 4. Evolution of a 2-dimensional bottleneck state over continuous training. Every 50 epochs we add more data to the training set, which is what results in the jitters.
Finally, let’s discuss a real-world case study. The biggest issue when deploying in the wild, is that we can no longer guarantee quality controls on the data. The main issue that we need to contend with, is noise in the data. For a practitioner, the natural question to ask is
At what point does Customization make more sense than just making a direct label mapping from Ntropy labels?
Below, we present our results from the Case Study dataset, which consists of 12,558 transactions, 21 categories, and a noise rate of 40% of training data points mislabeled.
For reference, we’ve provided what we call an optimized label mapping, whereby we use the training data to construct a mapping from Ntropy labels to user labels, that will maximize accuracy. Note, this sometimes produces some really questionable mappings, like payroll mapping to investment equity.
The main thing to note is the crossover point between the dashed line and the black curve. At just 6 samples per class, even with 40% noise, it becomes more advantageous to use the customization endpoint over a direct label mapping
With the full training set, we find +12% 🚀 increase in accuracy overall, which is quite a significant improvement, with what is otherwise unusable data.
If you’ve made it this far, the first thing you should do is head over to https://ntropy.com/company/about and check out the job board, because we’re hiring and would love to have you on the team!
Besides that, if you haven’t already, you should check out our Colab, and actually try this out for yourself.
Though we’re stoked to launch our customization endpoint, the work is not done. We’ve touched on a few of these things, but these are several areas we are actively working on for future releases, in order of priority
Thanks for reading, we can’t wait to see what you will build 😄!
P.S.: the number of Kanye reference in the publicly available data is 2, try to find them! 😝