Have you ever unintentionally pushed a change to production that subsequently broke downstream operations? I have. I pushed a fix, operations recovered, and I learned from it. It’s an experience that’s stayed with me. What would I have done differently? Test and then test some more to de-risk production rollout, and then test even more to fine tune outcomes even further.
But the infrastructure and tools required for testing decision models are often bespoke. I would have had to build/maintain tooling for my former self, which would not have been realistic as an individual and time-consuming for the team as a whole.
At Nextmv, we’re looking to change that by providing DecisionOps tooling that reduces risk by introducing different kinds of tests at particular points in the model development cycle. Inspired by Gitflow, we enable a CI/CD workflow that brings together different testing types, GitHub Actions, and model version management capabilities that make rolling out new changes (and improving outcomes) a less precarious challenge.
In this blog post, I’ll walk through a capacitated vehicle routing problem (CVRP) example using a Google OR-Tools model. (Note, when you go to try this yourself, our platform has an extensible model/solver layer that you can apply this approach to.)
I have a decision model. I want to change it.
I’m going to switch hats. I’m not a decision scientist at Nextmv anymore. I’m an algorithm developer at The Farm Share Company, a delivery business that picks up produce and dairy items from farms and delivers them to customers’ homes while minimizing time on road. I am part of a team responsible for a number of decision algorithms — routing, scheduling, order fulfillment — and I was tasked with making a change to the routing algorithm.
This routing model is built using Google’s OR-Tools and accounts for inputs such as vehicles and stops (pickup and dropoff locations) and various constraints (e.g., vehicle capacity). Operators notified my team about having to constantly (and manually) adjust assignments because the routes are too long and it is confusing that all the vehicles in our fleet are sometimes not being used. Silly me! We never modified the default maximum route duration. Following this, the change I want to introduce is a new parameter, max.travel.duration, that controls how long the duration of a route should be.
How can I do this with Nextmv? Here’s the flow:
- Open a pull request (PR) and change the code.
- Kick off a CI/CD process with an acceptance test.
- Run a shadow test using production data.
- Tune the constraint further using historical data and testing.
- Flip the switch and push our new model to production.
This might feel like a lot, but we’ll fly right through it as we look to gain confidence that our changes won’t negatively impact business operations before we push them live.
Pull request → acceptance test
Let’s begin by opening up a PR that sets a maximum duration that a vehicle can travel for and sets a penalty for not visiting a stop (in case the problem turns out to be infeasible).
Next, we’ll prepare the Nextmv console to handle the PR merge by creating two configuration options within the development instance of our model. This includes a duration (i.e., how long the solver will run for) of 5 seconds, a maximum travel duration of 2,700 seconds (45 minutes), and an unplanned penalty of 50,000 seconds (which gets added onto our value function whenever a stop goes unassigned).
With that configured and looking good, we’re ready to merge our PR (given I have approvals from my teammates, of course).
This merge kicks off a CI/CD workflow using a GitHub Action we’ve prepared. All checks are passed.
This workflow includes running an acceptance test on our updated decision model using historical data, comparing its output to our existing decision model in production.
The results tell us that when we set the maximum route duration in our candidate model to be 45 minutes, there are several changes in our business metrics. The value of the objective function decreases, meaning we are optimizing the overall route duration of the solution. More importantly, given that vehicles now have a realistic duration limit, we are using more of them with, of course, a shorter travel duration.
There are a couple of red check marks, indicating that the model change is inducing unexpected behavior. By enforcing shorter routes, the minimum travel decreases, along with the number of stops. This is reasonable, and it just means that we have to adjust the expectations of the acceptance tests we run to reflect the new model behavior. Or, as we would say in software development: "there is a bug in the test config.”
After manually auditing the checks, so far, so good. But I’m not quite ready to call it production-ready. I want to run some tests using production data and conditions, but not impact my real-world operations. It’s time for a shadow test.
Acceptance test → shadow test
We ran an acceptance test using representative historical data from an input set we created. Now we’ll run a shadow test using online production data, by configuring it in the Nextmv console. We’ll compare two instances — development versus production — and set our end criteria to be 14 runs.
So far so good. Tests are running for both my development and production instances.
Once all 14 runs have completed I can review the results across key metrics I care about such as vehicles activated, maximum stops in vehicle, max travel duration, and minimum stops per vehicle.
Everything is looking good here. As expected, more vehicles are being used when the constraint is enforced in the development instance. More importantly, there are no error messages and the runs are succeeding. I feel pretty confident about deploying these changes to production without impacting our business metrics or encountering stability issues.
Shadow test → Deploy to production
Right now, our production instance is running off of our v0 version (which doesn’t have the max.travel.duration constraint changes).
To deploy our updated model to production, all we need to do is point to the new version that does include our constraint changes.
And just like that: our new model is now running in production — and we don’t have heartburn or concerns about negatively impacting business operations.
If you want to double check, you can see both our development and production instances reference the same model version.
Excellent. That’s all there is to it. But what if we wanted to refine our model just a bit more…
Refinement and tuning
Up to this point, I settled on a max.travel.duration of 2,700 seconds (or 45 minutes) and have felt good about this change. Did I have a specific reason to use that number? No. Would a slight adjustment yield even better results? Maybe! Let’s test it out using a batch experiment!
In this case, I can create two more instances of my CVRP model, each with a slightly different max.travel.duration: one with 30 minutes and another with 60 minutes. Remember, our Production instance has this field set to 45 minutes.
I’ll use a different input set to run against these three instances and configure and start the experiment in using the CLI.
And voilá, our batch experiment is viewable in the Nextmv console and started.
Once completed, I can review the results across several types of custom metrics.
In just a few minutes, we got insight into how we can fine tune our decision model even further. The big numbers on the y axis tell me that there are unplanned stops that are incurring into the unplanned penalty. This is caused by the short routes not being able to accommodate all stops in the input. Setting the parameter to 1 hour appears to yield a more compact distribution of the value function with lower values, given that it is the options that allows for the maximum route duration to service stops.
The next step would be to share these results with Ops and discuss what are sensible max travel duration values that we can set to improve the operation, suggesting we start at 1 hour and take it from there.
Switching hats again, back to my usual self! In this post, you’ve seen how to go from pull request to production rollout using a CI/CD workflow made possible with DecisionOps tooling. At each stage of the process, we used different testing techniques to reduce risk and gain confidence in our model changes. As a result, we could iterate and improve upon our CVRP model — and spend more time on building and tuning decision models rather than building decision tools.
May your solutions ever improving 🖖