Quickly finding recurring events in massive sequences

Author

Ntropy Team

Team

Here at Ntropy [1] (we’re hiring!), we’re focused on building powerful Machine Learning models. But, every once in while, there’s an opportunity to solve problems with an efficient algorithm. Let’s get started.

Real world motivation

Our primary models deal with financial transactions [2], the same objects that are seen as snippets of text on a bank statement. There’s more to it [3] than that, but for the purposes of this post, we can think of a transaction as just a single datapoint containing a single integer value.

Schematically, we’re assuming that we have a sequence of “similar” transactions (the notion of similarity is some black box that a smart engineer down the hall designed and grouped together for us), and the date, given by an integer representing number of days from some start point, on which they occurred. For those that like numbers, our data looks something like this:

Data = [0, 1, 4, 5, 9, 11, 13, 14, 15, 17, 40, …. (continue until heat death of the universe or the Simpsons go off air [4])]

One thing we’re very interested in understanding, is how many transactions are recurring, meaning they occur at regularly spaced intervals of time. The reason that this is a difficult problem, is because we inherently have incomplete information. That black-box cluster algorithm? Doesn’t exist. In general, we don’t always know which transactions are truly the same.

To give an example, imagine that you pay for two Netflix accounts: one for yourself that you pay on the 2nd of every month, and one for your parents that you pay for on the 15th of every month. The transactions would look identical on paper, but in reality there are two overlapping, repeating sequences.

What we want, is a way to disentangle these two sequences.

Algorithms and Code

Problem Statement

The problem we are interested in solving is the following:

Given a sequence of integers in strictly increasing order, find all longest subsequences, that are regularly separated with intervals ≤ K, for K some integer ≤ N, the length of the sequence.

To make this understandable, consider the following graphic:

Disentangling overlapping sequences. Source: current author.

To construct this GIF, we took three sequences

[9, 20, 31], [11, 16, 21, 26], and [15, 22, 28]

(separated by intervals of 11, 5, and 7 respectively) and placed them all in the same array [9, 11, 15, 16, 20, 21, 22, 26, 28, 31]. The above algorithm statement is an attempt to reverse this join operation, and find the original patterns.

Solution

The inspiration for this solution comes from the well-known Knapsack Problem [5]. If we think about the problem statement, we explicitly set a value K, which is likely much smaller (K << N) than the sequence length. This suggests that an O(N*K) approach is computationally acceptable, and we should optimize for solving the fixed width problem as fast as possible.

If you want speed, you look for a hashmap solution (if you said GPUs, you might be an ML engineer), and that’s exactly what we’re going to do. We make the key observation that for all possible sequences, there are only N possible starting points. Thus, if we scan from left to right, we need only keep track of the longest sequence starting at any given point. Since sequences must not skip any values (can’t have 3, 9, 12, without 3, 6, 9. never skip 3, 6 , 9), as soon as we’ve reached an integer in our sequence that’s larger than the next needed value, we can pop off the accumulated sequence and move on.

We’ll set up one dictionary “nexts” to keep track of this max value, and one dictionary “seqs” to hold the longest sequences beginning at each integer in the array. The rest is just iteration. In code, this becomes:

To complete the algorithm, we must scan over all widths in the region we are looking for (we set a default of 31 for the max 31 days in a month. See following section for possible limitations). This adds an additional multiplicative factor of K, bringing the total runtime to:

O(N*K + N*Log N),

where we’ve added the original sorting time back in. Here’s the code for the final piece:

Note that we’ve elected to return sequences of indices in the original array, as the representatives of our sequences. This is slightly more general, and it’s trivial to get the actual arrays from this (just index the input array!).

To grab the full code, checkout our repository. For a look at other neat problems that we’ve open-sourced 📝 checkout our main repository at:

https://github.com/ntropy-network/ML-tools/

Limitations and Future Work

This isn’t the whole story though.

As with most theories, the proposed detangling algorithm is a bit of a Spherical Cow (physics jokes), in practice, things don’t have regularly repeating intervals. Months have varying amounts of days, payments can be late, and holidays can interrupt work weeks.

The production algorithms we use at Ntropy to detect recurrence are more powerful, and capable of including these fuzzy sequence matches. Furthermore, if we’re going to get fuzzy, we may as well get fully fuzzy. We also remove the black-box clustering step, and take into account transaction-to-transaction distance when determining which elements are part of a recurring sequence.

If this sounds interesting and you want to learn more, feel free to shoot us a message. We’re hiring!