December 6, 2021

Transaction Categorization

Dr. Jonathan Kernes
Machine Learning Engineer


The term “transaction categorization” is widely used in financial services and in the broader software industry and is well-known in fintech circles.


Here we will unpack what transaction categorization means and why it is important, share what is available today and what we are making possible going forward.


Transaction data is an uncut gem


Transactional data is generated every time money is transferred between two entities, one initiating the transaction order and the other receiving it.

A financial  transaction consists of a date, an amount, and a description or a memo, often but not always the explicit identity of the sending financial institution. Depending on the parties involved, the pieces contained in this string may vary.  

Although financial transactions are standardized by the ISO 8583 and other formats, There are approximately 26,895 active ABA RTNs currently in use in the US alone Each one may have its own way of formatting the transaction string.Every FI in the United States has minimum one and they can be assigned up to five.

Once you get different payment processors involved, the strings become even less legible. A PayPal or a Bill.com transaction often looks like the processor is the merchant and is very hard to decrypt as a result.

Here are a few examples of the transaction description:

PAYONEER PAYONEE BUSBILLPAY TRAN#78

BP#9538547FLEET AVE CLEVELAND OHUSA

TST* SUBPAR MINIATURE SAN FRANCISCOCA USA

These inconsistencies and the lack of a common standard makes it almost impossible to interpret transactions programmatically.

Even for humans, transactions are non-trivial to understand. Obtaining data to train machine-learning models to parse transactions is even harder. When we trained our first models a year ago, we found that we needed crowd-sourced human labels per transaction for the model to achieve reasonable accuracy.  Using some of the best off the shelf solutions, such as Scale or Amazon Sagemaker, you end up paying the high cost for it and still not getting the desired output.

There is effectively no ground truth about a transaction. They mean different things for the sender and the receiver of the payment.

For instance, a bill payment for a corporate dinner party is sales for the restaurant , entertainment for the consumer, and a corporate employee expense for the business organizing the party.  Similarly, an AWS transaction can be tracked as cloud infrastructure for one and sales cost for another one, depending on the internal classification logic they are pursuing.

Despite a variety of solutions being built to optimize moving money from point A to point B, the movement of information is still a TODO despite the abundance  of counterfactual effects and real business needs. This is a very apparent, yet a very hard technical problem to solve.

The reason this problem exists in the first place is antiquated technology and lack of incentives for the five parties to a transaction ( the issuing bank, the acquiring bank, the card network, the merchant and the cardholder)  to effectively transport and reveal information.

Given the above, transaction categorization has become an area with lots of activity both by internal data and engineering teams of certain companies, as well as external vendors competing to own the intelligence ecosystem on top of money movement. These include the likes of pave.dev for consumer transaction insights, Heron data for more general categorization, Spade.dev for merchant name cleansing, as well as in-house solutions of the incumbent aggregators like Flinks, Plaid, Finicity, Yodlee, MX.

In-house workarounds are combinations of rules, lookup tables, internally labelled datasets and rudimentary ML models.

All of these vendors and workarounds have different approaches with advantages and disadvantages.

The most optimal solution has to meet the following criteria:

  • scalability, while keeping its flexibility

    More than 10k new businesses are started every day in the US alone. Each of these businesses has a specific cohort of customers and hence types of transactions. It is essential that new transaction labels and merchants can be added on-the-fly without any system downtime or engineering overhead.
  • interoperability

    As it is becoming increasingly easy for fintechs to serve customers globally, the API that parses transaction data has to handle new transaction patterns that it has never seen before without any drop in reliability. It should be able to understand and distinguish between the transactions of consumers, freelancers and small and large businesses. I should also be able to interpret transactions in multiple languages and currency codes. We are seeing a trend of  increasingly global, multi-currency, multi-lingual, multi-account-holder-type transactions, even within single batches, across many of our customers.

    Whether it is a Mastercard payment, a post-processed string from a Plaid or a Finicity or one you are getting from an issuer like Marqeta, your categorization engine needs to be able to parse and infer information equally well and be accessible within all fintech playgrounds.
  • accuracy

    For the companies building products and services driven by information extracted from transaction data, accuracy is critical. It is the difference between a seamless and powerful user experience and a product that doesn’t work.

    Whether you are building a personal finance manager or a savings tool, getting the basics wrong will result in poor analytics and is a handicap for fintech developers.

    The cost of mistakes is high. This effect increases exponentially with the amount in the transaction. For example, just a single wrongly categorized equity investment in a startup can result in a badly priced loan for a bank or significant errors in VAT returns and tax credits.

    The counterfactual power of this information is massive too: the potential of things you could build if you got it right.

Build vs buy: the need for an API

Transaction categorization is a clear need for anyone who is shipping financial products or services, whether those are banks, standalone fintechs or embedded finance use cases, such as Uber offering cards and salary advances to its drivers, ServiceTitan offering payroll and cards to their CRM customers and more.

If an engineer inside a company has touched or seen payments, they know how bad the data is and have worked on solving this problem.

Here are the core issues with in-house categorization engines:

  1. Cold start

    To solve the categorization problem efficiently, some fintech teams spend multiple years and 10M+ USD. Along with the time and monetary cost, the variance of the expected outcome is high. As is especially true for machine-learning, an approach that seems to work well, turns out to plateau fast and becomes a nightmare to maintain and improve past a fixed threshold.
  1. Diminishing returns.

    More in-house data does not always mean better results . To keep accuracy from plateauing in the long tail, a model needs to learn from its diverse information

    In a world where software experiences are increasingly supported by API-s, the lego bricks for the modern economy, spending resources on transaction categorization in-house does not make sense.

    Firstly, it will involve lots of manual labelling and re-training, which your engineers are most definitely going to hate.

    Secondly, it will divert focus from core features.

    Thirdly, building a standardized source of truth about transactions can only be done across the industry, training on a variety of diverse datasets and use cases. We have covered how we do this here.

    Once you start getting clean and enhanced feeds from your customers, the opportunities and use cases on top are endless.

Below we will describe the ones we love most.

Use cases: sky's the limit

  1. Enabling climate positive purchases

    Where and how you spend your money directly affects the planet. To be able to change, we need to keep track and get an understanding of our spend, as well as the merchants we spend with.

    High resolution transaction data is going to play an important role in building a carbon negative future.

  1. Revenue based financing for businesses

    Ages ago if you were an entrepreneur about to start something or needing to grow what you have built, you would have to have great “friends and family” to access capital.

    Later on you had to look and act the part to gain the trust of your bank manager.

    Most recently you need to have years of history, a credit score and fill out a bunch of paperwork. With a bit of luck involved, you will get alright terms and the capital you need.

    With great transaction categorization, luck is overrated. Cash Flows are read and interpreted by machines providing an optimal view on the current and the future of a business. This means cutting the time to access capital from weeks to seconds.
  1. Embedded money

    Imagine if you are the operating system aka the CRM, the payment processor, the employee management system, the customer comms layer for the majority of car repair shops in the US. You have saved them from the pen and paper operations and the pains associated with that. Your software environment is technically the home for their business. However, when they want to grow, make changes, hire more, invest in inventory or get started, they need to go to a bank. An entity who knows nothing about them or their business and who has to start from scratch in understanding them. Banks spend time and money to assess and later oncharge that back to customers in onboarding costs, communication and interest rates.

    Instead, being where these businesses live, you can allow them to connect their cash flows and financial information without leaving your premises. You get very happy customers and 3-5x LTV. They get time and resources to run their business instead of messing about with banks.

    In order to make this happen, banking data needs to be machine-readable at scale and easily married to other meta-data about the business, for instance the CRM information, customer service and reviews. Enter, transaction categorization.

  1. Hyper-personal rewards

    As much as we hate the word hyper, getting things that you actually care for vs generic discounts and points is fun. To be able to do this, one needs to know what you spend your money, hence time on.

    A well tuned categorization engine serves as the backbone for being able to build a rewards system for your customers.


  1. Making money last

    Business finance management and personal finance management tooling have so far been the core consumers of transaction histories and spend data. In order to allow your users, whether a business or an individual, to have a great overview of how to make their money last, you need a granular understanding of where it is coming from and where it is going.


Merchant recognition, understanding recurrence of payments, as well as the individual descriptions of transactions are key to this.

Integration

There are a few ways to get integrated with our transaction categorization API and it usually takes less than 10 minutes.

The no--sign--in, no--code--version is here too: try.ntropy.network. You can start uploading individual transactions and get a feel for the product.

A more comprehensive version of the API lives here: api.ntropy.network. All you need is a key to start sending queries. key here

In the middleground, we have a Python SDK here and Postman collection here.

In case you need any help, we are always here to answer any of your questions on Slack or via Twitter DM-s!

Related posts

December 2, 2022

Herding Entities

Read now
Herding Entities
November 1, 2022

Introducing Income Check with Ntropy API

Read now
Introducing Income Check with Ntropy API
October 12, 2022

Ntropy Raises Series A Funding

Read now
Ntropy Raises Series A Funding