07 May 2024
Evals Part Two: Preparing for an evaluation
Now you are ready to evaluate, what are the main considerations to start thinking about?
Data Preparation
Typically, the first thing that a customer needs to do once they have decided they want to run an evaluation process, is to get together a dataset that is representative of the banking data they expect to receive.
None of the data we require that contains personally identifiable information but there can still sometimes be legal and security hurdles to overcome before data can be shared with third parties which need to be factored in. If you have a legal or data security team, it is never too early to loop them into your discussions. In fact the sooner the better
Someone also needs to create and extract from internal systems/databases the representative dataset. This can be a timely process so it is never too early to start thinking about what types of transactions and how many to include in your data set. A prospective vendor should be a collaborative partner here and provide guidance and best practices as well as required input information and examples.
Using real data will always get you the most representative results and is what we suggest to customers where possible. Our models use all elements of the input transaction object so omitting any or using placeholder values will impact accuracy.
No data, no problem
Often, for startups that have yet to launch, they have no real data to benchmark with. We have two solutions here.
Synthetic data
At Ntropy we have built a transaction generator based on a small transformer that allows us to create synthetic datasets designed to replicate what a customer expects to receive. We can then use these synthetic datasets to run through our enrichment pipelines to help customers evaluate accuracy.
However, the accuracy of our models is always higher when real data has been used because that is what they have been trained on and we highly recommend using real-data. So far we have not seen accuracy uplifts from training models on synthetic sets, however this may change in the future.
High level metrics from similar evaluations
The other option is that we can provide high level accuracy metrics from previous evaluations that we have run on data sets that are similar to what the customer expects to have.
For example, for a B2B lender focused on lending to US ecommerce companies, we can share accuracy metrics from previous processes we have run for customers without sharing the actual underlying data. This gives enough confidence to start running the first tests in production and coming back to the evaluation process with real data. The infrastructure to monitor and capture this data needs to be in place on the customer side, however we do offer QA and monitoring that is continuous via our dashboard.
Small, medium or large
Evaluation processes exist on a spectrum, from fast and lightweight to very deep and thorough and it is ultimately up to the customer as to what fits their needs.
We essentially have two evaluation options for customers.
Lightweight
Anyone can sign up to test our capabilities and the accuracy of our products for free on our website with our self-serve set up option.
Each company can create an organization by clicking “Get Started”. Each dashboard account comes with a 14-day free trial and the ability to enrich 2,000 transactions and 50 PDF pages.
Once we have enriched the required transactions, the data and all of the output fields we produce can easily be exported and be benchmarked against an existing solution.
This is generally suitable for customers who want a fast, and lightweight option to test out our enrichment output without engaging a customer success or sales team. Many engineers prefer this as it is quick and representative.
Extensive
If customers need a more bespoke evaluation process, we can run a more in-depth evaluation with custom requirements taken into account. This can involve larger datasets and gathering metrics over specific things like the quality of identifying intermediaries or long tail merchants, or quality and granularity of business categories.
Customers requiring a full benchmark usually have more internal resources and time to spend creating a representative dataset and evaluating our output and metrics thoroughly. Aligning on expectations upfront is key here . As long as you are looking for the same outcome and you can agree on how it is measured, this process is worth investing in and sets us up for success on both ends.
Evaluation process constraints
How each customer wants to go about testing is somewhat unique to their use-case. The main considerations and constraints of running an evaluation process are time, engineering resources and data.
Data gathering
Data gathering is a crucial aspect of enriching customer data for comparison purposes. However, not all customers possess extensive datasets to send for enrichment. To address this challenge, two approaches can be employed:
- Assisting customers in collecting and curating representative datasets
- Providing alternative methods for data enrichment when customer data is limited
Even when customers have sufficient data, legal and logistical hurdles may arise. Large customers often need to obtain approval from their legal team before sharing any customer data. Once approved, resources must be allocated to construct a representative dataset and extract it from internal systems or databases.
The volume of data to be benchmarked directly impacts the timeline on both ends. While we can efficiently enrich and quality-check large datasets, customers may require additional time to evaluate the data themselves. Larger datasets inherently extend this process.
To streamline data gathering, a high quality partner and vendor should:
- Offer guidance and tools to help customers collect and prepare datasets.
- Develop alternative enrichment methods for customers with limited data.
- Collaborate closely with customers' legal teams to expedite data sharing approvals.
- Optimize data extraction and transfer processes to minimize resource requirements.
- Provide clear expectations on timelines based on dataset size and complexity.
- Continuously refine our enrichment and quality control processes for efficiency.
Time and resource
Time and resources are two critical constraints that are closely intertwined throughout the evaluation process. Although we have the capability to complete a comprehensive evaluation within 48 hours, our pace is ultimately determined by our customers' readiness and availability. Preparing an appropriate and representative dataset for evaluation requires time and effort from the customer's side, including the extraction and transmission of the data.
Moreover, analyzing the evaluation results is a time-consuming task, the duration of which is directly proportional to the size of the dataset and the level of thoroughness desired by the customer. This analysis can range from a quick spot-check of the results based on intuition to a meticulous manual review of the dataset to establish a ground truth. While this process demands time and resources, it is an essential component of a robust evaluation.
As part of our comprehensive evaluation service, our human QA team manually labels up to 500 transactions to create our own ground truth. However, customers often prefer to conduct their own independent evaluation in parallel.
To ensure a successful evaluation, we typically need to gain a deep understanding of the customer's specific use case and objectives. Additionally, establishing key success metrics for benchmarking is crucial. This collaborative process allows us to tailor our evaluation to the customer's unique requirements.
For highly motivated customers, the entire process from initial contact to the completion of the evaluation can be accomplished within a week. However, when resources are limited or competing priorities exist, the timeline may extend beyond that.
Next Up - Part Three
Part three will go into specifics about the Ntropy evaluation process and what it looks like to give you more specifics and help you prepare.