What is acceptance testing for optimization models?

You’ve got a production decision model and an updated decision model. Should you ship the new model to prod? Will the new model meet your acceptable KPI thresholds? Acceptance testing provides the answers by delivering a documented, repeatable decision-making process.

This is the first in a series of blog posts describing our thinking about the types of optimization model testing. We welcome your feedback in our community forum and make sure to watch our techtalk on this topic!

Acceptance tests for optimization models are used to verify if business or operational requirements (e.g., KPIs and OKRs) are being met. Acceptance tests involve running an existing production model and a new updated model against a set of test data. You then look at the results and determine if the new model is acceptable based on criteria identified beforehand. Acceptable for a new route optimization model may mean all route durations are less than 4 hours, or that there are no unassigned stops, or that route duration increased by 2% or less. The key is you define what is acceptable. 

If the new model passes the test (i.e., it’s acceptable), you can roll it out to production. Or you may choose to invest in more intensive real-time testing such as A/B, switchback, or shadow. If the new model doesn’t meet your criteria (i.e., it’s unacceptable), then it’s back to the drawing board and you likely saved your drivers from a bad day and customers from melted ice cream. 

Now, some of you may know some form of acceptance tests by other names: compliance testing, conformance testing, regression testing, visual regression testing, etc. Depending on your definition, there may be some similarities (e.g., regression testing is related, but typically part of local development) or it may just come down to word choice preference. 

All tests have treatment and control. Acceptance testing is no exception. However, acceptance testing goes beyond ad hoc exploration. It is a dedicated flavor of testing that provides a user interface for defining, managing, and making decisions based on predefined acceptance criteria. Acceptance testing enables you to understand the tradeoffs inherent in your system. For example, let’s say you change a model’s value function to prioritize balancing stops across drivers, but doing so results in longer routes. Acceptance testing helps surface these types of side effects. It’s up to you to decide whether that’s an acceptable tradeoff. We think this is what makes it a dedicated and defined testing experience within a larger testing ecosystem, and what we’ll explore further in this post.  

What’s an example of acceptance testing?

Imagine you work at a farm share delivery company. You’re in the business of picking up produce boxes from local farmers and delivering them to customers’ homes using a fleet of 5 vans. The vans have limited capacity and shift availability. You have a decision model that automatically optimizes the van delivery routes because it’s more efficient than manually creating the routes. 

One day, you realize that the drivers are consistently working overtime because your production model doesn’t account for the amount of time needed to service a stop. The solution, you think: add in service times and, voila, overtime problem solved. But if you do that, will there be other unacceptable consequences? 

In this case there will be: unassigned stops, a KPI your business cares about. While the new model is a better representation of reality, the planned routes are longer, which means drivers have less time to get to all the stops, resulting in some stops not getting assigned to a route. The new model cannot ship as-is or without other operational adjustments. 

Acceptance testing allows you to play updates out before you commit to any changes in an operational environment. For example, in the same way that you wouldn’t want to ship code if it doesn’t pass all GitHub actions, you don’t ship a model that doesn’t pass acceptance testing. Any changes to the model are evaluated against the same predefined business rules (e.g., no unassigned stops), so we can make decisions quickly and consistently. In practice, product managers set acceptance criteria based on business priorities and developers can iterate as they see fit with those criteria as the end goal. 

Why do acceptance testing?

Acceptance testing is worth doing when you want to build in quality assurance before shipping a decision model to production and accelerate model development. In other words, you want to ensure you don’t negatively impact operations or waste development cycles in unnecessary rabbit holes. 

When you look at the farm share example above, you want to make sure that you don’t deploy a new model that solves the overtime problem, but results in unassigned stops. Acceptance testing is a methodical way of catching unacceptable outcomes like these earlier in the process and allowing for iteration to arrive at a result that is, well, acceptable. 

When done well, acceptance testing offers a constructive collaboration point between technical and non-technical stakeholders, in addition to a documented, repeatable decision-making process. 

When do you need acceptance testing?

Acceptance testing is appropriate when you have defined KPIs for a given business area, multiple stakeholders involved in the decision making process, and you plan to run these types of tests with some regularity.  

Defined metrics or KPIs are core to acceptance testing: they’re the go/no-go criteria for your model. As we saw in the farm share example, the new model wasn’t acceptable as-is. This meant the team had to explore and propose alternatives: allow drivers to work over time or add vehicles to the fleet. 

This leads us to multiple stakeholders needing to weigh in on the tradeoffs that go along with how to proceed. Operators get feedback from the drivers (e.g., overtime is doable, but not preferred), product owners working with devs to drive business efficiency (e.g., develop new models that improve operational outcomes), finance wants to understand expenditure changes (e.g., overtime vs. hiring costs), and so on. 

These types of decision flows are likely to repeat. There may be changes to shift lengths or fleet composition that warrant going through the same process, using a standard input set to make sure a new model is acceptable to run in production. 

How is acceptance testing usually done?

There are several ways to perform acceptance testing. The most common way we’ve seen people do it is using some level of bespoke, manual analysis that summarizes sometimes hard to parse output and compares (often inconsistent sets of) KPIs on a current decision model versus a new model. This involves the consideration of how the data is gathered and how the data is analyzed. All of these approaches have their challenges.  

Another way to think about this is through the lens of some other form of analysis such as a visual diff or diff-in-diff regression that’s used to understand and verify results. This usually involves comparing a set of metrics between current and proposed runs; running a script to generate visuals (e.g., maps in routing use cases); evaluating the significance of the treatment effect; generating essentially a mini research paper with methods, results, and recommendations; and navigating hurdles to get the recommended changes or model updates into the pipeline. 

It is also possible to leverage GitHub, GitLab, or Bitbucket functionality to create a series of scripted actions to add some automation to the process. The actions could for example include scripts to run a candidate model (in a branch) against the current production model (in a stable or main branch), summarize and compare the results of each, and produce an artifact (a report) that can be downloaded and inspected before deciding to merge / accept the changes. 

There are also online versus offline considerations for each of these paths that we’ll save for another post. The bottom line is that, in general, acceptance testing is a fairly custom and manual process that’s hard to repeat and complicated to manage.

What’s next for acceptance testing?

Our first cut at a robust acceptance testing experience in Nextmv is shipping in the near future. We’re excited about it because it minimizes the amount of manual setup, maximizes the repeatability of the testing workflow, and will satisfy your ability to define your KPIs and visualize the results. 

To learn more about our testing framework, register for this techtalk. In the meantime, if you’d like to get hands on with our batch experimentation capabilities, create a free Nextmv account to get started, leave questions and feedback in our community forum, or reach out to us directly to talk with our team. 

Video by:
No items found.