Create optimized PagerDuty on-call rotation schedules with decision automation

I retired our on-call scheduling spreadsheet — and you can too. Here’s how I built a custom decision model that generates and sends optimized schedules to the PagerDuty API.

If you use an incident response platform such as PagerDuty, you know that building on-call schedules is challenging. Everyone’s got preferences and hard constraints. For example, on our team, Chris likes two-week stretches of on-call shifts whereas Ronessa can’t be on-call during her daughter’s swim class during the week. It’s hard and time consuming to find good on-call shift solutions that maximize team happiness. And it only gets harder at scale, especially if you’re doing it manually. 

Whether your team is big or small, I suspect that many of us have *that* on-call spreadsheet that we want to retire. So I decided to use Nextmv to create a custom decision model for creating optimized on-call schedules to make that possible. After all, Nextmv exists to make modeling and solving problems like these easier and more efficient in your everyday operations. 

In this blog post, I’ll show you my process from Nextmv installation to sending my optimized schedule to PagerDuty. At a high level, it looks like this: 

  • Set up my local environment
  • Define my schema with input/output
  • Define the solver function
  • Run the model to make scheduling decisions
  • POST schedule updates through PagerDuty API 

You can follow along by creating a free Nextmv account and checking out our documentation. (You can also just jump right into using the PagerDuty scheduling template by initializing it in the Nextmv CLI.) Here’s how I did it.

Step 1: Set up my dev environment

To get started, I logged into my Nextmv account and got the latest versions of the Nextmv CLI and SDK with a few commands. I then initialized the new app template, which provided a bare bones setup with placeholders and descriptions for how to create a brand new custom decision model using Nextmv’s Store API. 

At this point, I had everything I needed to start building my new model in my local dev environment. This workflow only took a few minutes from start to finish. 

Step 2: Define my schema

The first step in defining a decision model is to represent the input and output schema. We think of a schema as a contract the decision model has with upstream and downstream systems. Nextmv decision models are written in Go and JSON in / JSON out. But non-Go developers fear not! You can pick this up very easily with zero Go programming knowledge (trust me, I speak from personal experience here). 

In order to define a schema, I started out the way I recommend every customer begin: creating an input JSON. This forced me to think about how this decision model fits into my larger planning process.  And it’s OK to start small at this step and add data for 1 or 2 users with demo data populated for the properties you want to populate for real users. From there you can test and then grow your dataset. This was my process:

1. Set the schedule. I wanted to update the schedule every 2 weeks, so I added a schedule_start and schedule_end in my input json.

2. Construct the user data. PagerDuty has their own API for updating or overriding schedules. I wanted to make sure I was passing in the data I’d need in order to send this directly to the API (I am running this manually for now - so I didn’t want to have any post processing scripts to cobble that information together.) This meant I needed to pass in the PagerDuty id and type fields on each user as well. 

3. Define my business logic. I specified the array of unavailable days and the array of preferences, both in RFC3339 format for ease of passing into PagerDuty. Here’s an example of that input JSON for a couple of users:

{
 "schedule_start": "2022-10-10T09:00:00-04:00",
 "schedule_end": "2022-10-23T09:00:00-04:00",
 "users": [
   {
     "name": "chris",
     "id": "P5K3K4K",
     "type": "user_reference",
     "unavailable": ["2022-10-11T09:00:00-04:00", "2022-10-14T09:00:00-04:00"],
     "preferences": {
       "days": [
         "2022-10-10T09:00:00-04:00",
         "2022-10-17T09:00:00-04:00",
         "2022-10-18T09:00:00-04:00",
         "2022-10-19T09:00:00-04:00"
       ]
     }
   },
   {
     "name": "david",
     "id": "P6H94Y5",
     "type": "user_reference",
     "unavailable": [
       "2022-10-14T09:00:00-04:00",
       "2022-10-15T09:00:00-04:00",
       "2022-10-16T09:00:00-04:00",
       "2022-10-17T09:00:00-04:00",
       "2022-10-18T09:00:00-04:00",
       "2022-10-19T09:00:00-04:00",
       "2022-10-20T09:00:00-04:00",
       "2022-10-21T09:00:00-04:00",
       "2022-10-22T09:00:00-04:00",
       "2022-10-23T09:00:00-04:00"
     ]
   }
  ]
}

Once I had an input file, I could move on to representing the schema of my Nextmv application. This was just a matter of defining the Go structs, field names, and types corresponding to the JSON file. Here’s what that looked like:

// Input for the pager duty scheduling problem. We have
// pager duty users that need to be assigned to days between the schedule start
// date and the schedule end date.
type input struct {
   ScheduleStart time.Time `json:"schedule_start"`
   ScheduleEnd   time.Time `json:"schedule_end"`
   Users         []user    `json:"users"`
}
 
// Users have a name, id, type, unavailable dates, and preferences.
type user struct {
   Name        string      `json:"name,omitempty"`
   Id          string      `json:"id,omitempty"`
   Type        string      `json:"type,omitempty"`
   Unavailable []time.Time `json:"unavailable,omitempty"`
   Preferences preference  `json:"preferences,omitempty"`
}
 
// Preference of days.
type preference struct {
   Days []time.Time `json:"days"`
}

Next, I moved on to the output representation. For this, I went through the same input JSON exercise described above by starting with an idea of what I wanted my output JSON to look like. I knew I wanted my output to be a copy / paste into a curl command for POSTing to the PagerDuty schedules API. Since some people prefer uninterrupted two-week on-call schedules, I have a two-week rotation set up already in PagerDuty. This meant I wouldn’t be creating a new schedule, but rather, overriding an existing schedule during rotations for individuals who prefer a more flexible schedule.

Because I’m hitting the overrides endpoint in PagerDuty, I structured my output as an array of overrides, with each override including a start and end time, a user (which includes the id and type PagerDuty requires) and a time zone that I set to UTC for everyone.

"overrides": [
     {
       "start": "2022-10-10T09:00:00-04:00",
       "end": "2022-10-11T09:00:00-04:00",
       "user": { "name": "david", "id": "P6H94Y5", "type": "user_reference" },
       "time_zone": "UTC"
     },
     {
       "start": "2022-10-11T09:00:00-04:00",
       "end": "2022-10-12T09:00:00-04:00",
       "user": { "name": "lars", "id": "P7R5EY5", "type": "user_reference" },
       "time_zone": "UTC"
     },
     {
       "start": "2022-10-12T09:00:00-04:00",
       "end": "2022-10-13T09:00:00-04:00",
       "user": { "name": "marius", "id": "PMN3512", "type": "user_reference" },
       "time_zone": "UTC"
     },
     {
       "start": "2022-10-13T09:00:00-04:00",
       "end": "2022-10-14T09:00:00-04:00",
       "user": { "name": "david", "id": "P6H94Y5", "type": "user_reference" },
       "time_zone": "UTC"
     },
     {
       "start": "2022-10-14T09:00:00-04:00",
       "end": "2022-10-15T09:00:00-04:00",
       "user": {
         "name": "ronessa",
         "id": "P0IYDYV",
         "type": "user_reference"
       },
       "time_zone": "UTC"
     },

Once I had the output JSON structure in mind, I was able to define my struct and the corresponding fields and types in code.

// Provide the start, end, user, and timezone of the override to work
// with the PagerDuty API.
type override struct {
   Start    time.Time    `json:"start"`
   End      time.Time    `json:"end"`
   User     assignedUser `json:"user"`
   TimeZone string       `json:"time_zone"`
}
 
// An assignedUser has a name, id, and type for PagerDuty override.
type assignedUser struct {
   Name string `json:"name,omitempty"`
   Id   string `json:"id,omitempty"`
   Type string `json:"type,omitempty"`
}

To recap so far: we’ve defined our schema by representing the data that we expect will be coming in and out of our model. Now we’re ready to dive into the magic. 

Step 3: Define the solver function

Welcome to the juiciest part: the solver definition. This is where we define our model and business logic. Solvers start from the current state of the world (the root store) and generate possible plans for the operation. We tell the solver what plans are operationally valid (Validate), how should I compare plans (Value), how to generate new plans (Generate), and how to express plans in the output (Format). We call this part of our API a store. A store is a collection of variables and their values that defines a plan. In this model our stores have the days of the schedule and potential assignees. 

All of this magic is nested underneath what might appear as one little unassuming function:

func solver(input input, opts store.Options) (store.Solver, error) {
… 
}

We use all the functions on solver to help us generate and search through plans.

Root state

In the solver function, I started by initializing an empty root store, which is our starting point of our model where no users have been assigned to any days in our schedule.

   // We start with an empty root pagerDuty store.
   pagerDuty := store.New()

Then, I added a starting state to it. Ultimately, I needed to get to a point where I had a single assigned user for each day in the schedule period. I accomplished this in our store API by, first, creating a domain called users representing the indices in the users input array. Domains are a useful modeling paradigm Nexmv provides. I like to think of domains as a much more performant mechanism for referencing slices of things, but of course, if I wanted to, I could’ve just referenced the slice of users itself.

   // Next, we add a starting state to store.
   // We create a domain for each day and initialize each day to all users.
   ndays := int(input.ScheduleEnd.Sub(input.ScheduleStart).Hours()/24) + 1
   users := model.NewDomain(model.NewRange(0, len(input.Users)-1))

Then, I created a variable days on the store which copies that domain of users 14 times (ndays) to create a slice of domains (or user indices) for each day I’m trying to assign. This days slice is what I’m ultimately trying to manipulate in the solver to get down to a single user assigned per day. In other words, my root state begins by assuming every user is available on every day and I trim down from here.

   days := store.Repeat(pagerDuty, ndays, users)

Next, I created some variables for use later in the value function. I wanted to accomplish two things: 1) balance assigned days across users to make a more fair schedule and 2) maximize happiness by trying to fulfill preferences as much as possible. I’ll cover this more later, but for now, all we do is add these variables to the store with starting values of 0 for each user and we will manipulate them later.

// Next, we create an `assignedDays` variable to keep track of number of days per user.
   // This is so we can balance assignments across users later.
   // We initialize day length to 0 for all users.
   assignedDays := store.NewSlice[int](pagerDuty)
   for range input.Users {
       pagerDuty = pagerDuty.Apply(assignedDays.Append(0))
   }
   // We also want to maximize worker happiness by fulfilling as many
   // preferences as possible.
   happiness := store.NewSlice[int](pagerDuty)
   for range input.Users {
       pagerDuty = pagerDuty.Apply(happiness.Append(0))
   }

Then, I made some modifications to the root schedule. I looped through each day and then through each user and if the user is unavailable on that day, I removed that user’s index from the domain of available users on that day using the days.Remove(..) logic below. Any operations on store variables such as .Remove will return a slice of changes. We must then apply those changes to the root store in order to update it. 

You’ll also see that while looping through days and then users, I took the opportunity to also populate a preferenceMap which essentially creates a lookup table of date index to a slice of user indices for the users who prefer to be on call that day. 

While looping through days, I also checked whether we got ourselves into an infeasible situation in which there is no one available to work on a particular day. If that were the case, I’d want to just exit with an error because I’d need to handle that one outside of this app.  

Lastly, I checked whether there are any days with exactly 1 user available. If that’s the case, the decision on who to assign is simply to assign the only person available. So I did just that. I assigned that person and then incremented their assignedDays by 1 and incremented their happiness by 1 if that was a day they preferred to be on call.

 // As a first step, we can now remove the users that are unavailable for
   // certain full days and apply the changes. This results in a new (partial)
   // schedule.
 
   // We also create maps of date -> preferred users and update store with
   // unavailable dates removed from users.
   preferenceMap := map[int][]int{}
 
   date := input.ScheduleStart
   dateIndex := 0
   for !date.After(input.ScheduleEnd) {
       for userIndex, user := range input.Users {
           for _, unavailable := range user.Unavailable {
               if date.Equal(unavailable) {
                   pagerDuty = pagerDuty.Apply(days.Remove(dateIndex, []int{userIndex}))
               }
           }
           for _, preference := range user.Preferences.Days {
               if date == preference {
                   preferenceMap[dateIndex] = append(preferenceMap[dateIndex], userIndex)
               }
           }
       }
 
       // If no one is available on a day, the problem is infeasible.
       if days.Domain(pagerDuty, dateIndex).Empty() {
           return nil, fmt.Errorf("Problem is infeasible.")
       }
       assignedUser, dayAssigned := days.Domain(pagerDuty, dateIndex).Value()
       // If there is only 1 person available, assign that person to the day.
       if dayAssigned {
           userAssignedDays := assignedDays.Get(pagerDuty, assignedUser)
 
           // Add 1 to their day length.
           pagerDuty = pagerDuty.Apply(assignedDays.Set(assignedUser, userAssignedDays+1))
 
           // Add 1 to their happiness score if they preferred to work this day.
           if preferredUsers, ok := preferenceMap[dateIndex]; ok {
               for _, p := range preferredUsers {
                   if p == assignedUser {
                       pagerDuty = pagerDuty.Apply(happiness.Set(p, happiness.Get(pagerDuty, p)+1))
                       break
                   }
               }
           }
       }
 
       // Add 1 day to the date.
       date = date.AddDate(0, 0, 1)
       dateIndex++
   }

Validate Method

In order to satisfy the store API, I needed to define a Validate method for the store. This function tells the solver when we’ve gotten to a state which can be considered operationally feasible. For example, if there are still some days without an on-call user assigned, the schedule is not something I can send through to PagerDuty and operate on. In this case, since I start with all available user (indices) in each day in the days slice and trim down from there, this means the store will be viable once there’s exactly one user assigned to each day. This is another reason domains are so powerful - we have all kinds of helpful methods attached to them. In “domains” world, I can use the .Singleton() method, which means only one user is assigned that day for all days. It lets me see if every day has the domain with a single index and therefore is a viable option for consideration.

 
     pagerDuty = pagerDuty.Validate(func(s store.Store) bool {
       // Next, we define operational validity on the store. Our plan is
       // operationally valid if all days are assigned exactly to one person.
       return days.Singleton(s)
 
   })
 

Generate Method

Next, I needed to define the Generate method, which ties this application together. This is the method that gets summoned from within the solver to generate child states from the parent. I already defined a root state above. That’s the starting point. From there, the solver calls the Generate method, which does this: 

  • Finds the first day with 2 or more users available (i.e., the first day we still need to make an assignment for)
  • Loops through each available user for that day and creates a child store in which I assign that user to the day and increment assignedDays and happiness (if the day is a preferred day for the user) 
  • Returns an eager generator with all the child stores I created
 
   pagerDuty = pagerDuty.Generate(func(s store.Store) store.Generator {
// We define the method for generating child states from each parent
       // state. Each time a new store is generated - we attempt to generate more
       // stores.
 
       // We find the first day with 2 or more users available.
       dayIndex, ok := days.First(s)
 
       if !ok {
           return nil
       }
 
       // Create a slice of stores where we'll attempt to assign each of those
       // available users to the day.
       stores := make([]store.Store, days.Domain(s, dayIndex).Len())
       for i, user := range days.Domain(s, dayIndex).Slice() {
 
           // We increment the day length for the user we want to assign.
           userAssignedDays := assignedDays.Get(s, user)
           userAssignedDays = userAssignedDays + 1
 
           // We assign the user and apply changes for that assignment and the
           // day length increment.
           newStore := s.Apply(days.Assign(dayIndex, user), assignedDays.Set(user, userAssignedDays))
 
           // If we were able to assign user to preferred day, increment
           // happiness score and apply changes.
           if preferredUsers, ok := preferenceMap[dayIndex]; ok {
               for _, p := range preferredUsers {
                   if p == user {
                       newStore = newStore.Apply(happiness.Set(p, happiness.Get(pagerDuty, p)+1))
                   }
               }
           }
 
           stores[i] = newStore
       }
 
       // Use an Eager generator to return all child stores we just created for
       // all users available on this day.
       return store.Eager(stores...)
   })

Value Method

Once I had a way to create a bunch of stores (Generate) and a way to determine whether they’re operationally valid (Validate), I needed to define a (Value) function to “score” these different schedule options and identify the “best” one in the allotted time. As I mentioned before, I wanted to create a fair schedule by balancing days assigned across users and maximizing happiness. 

In order to balance days assigned, I computed the minimum and maximum of the assignedDays. I want the difference between those two values to be small - meaning there’s a small range in the number of days assigned to each person. 

In order to maximize happiness, I started by maximizing a single happiness score - but this didn’t work well. It led to assignments where one user with the most preferences had all of their preferences met and other users didn’t have any met. Instead, I opted for maximizing the minimum happiness score which meant I was trying to bump up the happiness for each user.

 
   pagerDuty = pagerDuty.Value(func(s store.Store) int {
       // Now we define our objective value. This is the quantity we try to
       // minimize (or maximize or satisfy). This balances days assigned to each
       // user.
 
       // Calculate sum of squared assigned days.
       sumSquares := sumSquare(assignedDays.Slice(s))
 
       // Calculate minimum happiness across users.
       minHappiness := min(happiness.Slice(s))
 
       // Balance days between users and maximize minimum happiness.
       return sumSquares - minHappiness
   })

Format Method

Now that I satisfied the store API, it was time to format my output so I could easily paste the output into a POST request to PagerDuty. We already said we wanted the output to be an array of override, so all we need to do is format it as such. We loop through each day in our operationally valid schedule in days and create an override for each day. We also add on the min assigned days so we can have a look at that in the output as well to verify our value function is working as we would expect it to.

 
  pagerDuty = pagerDuty.Format(func(s store.Store) any {
       // Next, we define the output format for our schedule.
       // We want to structure our output in a way that the PagerDuty API
       // understands.
 
       values, ok := days.Values(s)
 
       if !ok {
           return "No schedule found"
       }
       overrides := []override{}
       for v, nameIndex := range values {
           assignedUser := assignedUser{
               Name: input.Users[nameIndex].Name,
               Id:   input.Users[nameIndex].Id,
               Type: input.Users[nameIndex].Type,
           }
           overrides = append(overrides, override{
               Start:    input.ScheduleStart.AddDate(0, 0, v),
               End:      input.ScheduleStart.AddDate(0, 0, v+1),
               User:     assignedUser,
               TimeZone: "UTC",
           })
       }
 
       return map[string]any{
           "overrides":       overrides,
           "min_days_worked": min(assignedDays.Slice(s))}
   })

Return a minimizer

Last but not least, we close out our solver function by returning the root store with a minimizer and passing in our options (like run duration). Note, we return a minimizer because we want to minimize the difference between maxDaysWorked and minDaysWorked while encouraging solutions with larger minHappiness values.

   return pagerDuty.Minimizer(opts), nil

Step 4: Use the model for real on-call schedule decisions 

Now that I’ve shown my work, let me get to the fun part: running our decision model to produce optimized on-call schedule plans that get sent directly to PagerDuty 

Running the application

From the directory where my main.go lives, I can now run the following command which uses the Nextmv CLI to run my PagerDuty decision application. It gets my input file and passes that into the solver along with passing in a 1 second duration limit for solving the problem. The resulting best schedule found will be dumped to an output.json file.

nextmv run local main.go -- -hop.runner.input.path input.json \
 -hop.runner.output.path output.json -hop.solver.limits.duration 1s

Here’s the first few lines of what that output file looks like:

{
 "version": { "sdk": "v0.20.2" },
 "options": {
   "diagram": { "expansion": { "limit": 0 }, "width": 10 },
   "limits": { "duration": "1s" },
   "search": { "buffer": 100 },
   "sense": "minimize"
 },
 "store": {
 
   "min_days_worked": 1,
    "overrides": [
     {
       "start": "2022-10-10T09:00:00-04:00",
       "end": "2022-10-11T09:00:00-04:00",
       "user": { "name": "david", "id": "P6H94Y5", "type": "user_reference" },
       "time_zone": "UTC"
     },
     {
       "start": "2022-10-11T09:00:00-04:00",
       "end": "2022-10-12T09:00:00-04:00",
       "user": { "name": "lars", "id": "P7R5EY5", "type": "user_reference" },
       "time_zone": "UTC"
     },
     {
       "start": "2022-10-12T09:00:00-04:00",
       "end": "2022-10-13T09:00:00-04:00",
       "user": { "name": "marius", "id": "PMN3512", "type": "user_reference" },
       "time_zone": "UTC"
     },
     {
       "start": "2022-10-13T09:00:00-04:00",
       "end": "2022-10-14T09:00:00-04:00",
       "user": { "name": "david", "id": "P6H94Y5", "type": "user_reference" },
       "time_zone": "UTC"
     },
     {
       "start": "2022-10-14T09:00:00-04:00",
       "end": "2022-10-15T09:00:00-04:00",
       "user": {
         "name": "ronessa",
         "id": "P0IYDYV",
         "type": "user_reference"
       },
       "time_zone": "UTC"
     },
     {
       "start": "2022-10-15T09:00:00-04:00",
       "end": "2022-10-16T09:00:00-04:00",
       "user": { "name": "chris", "id": "P5K3K4K", "type": "user_reference" },
       "time_zone": "UTC"
     },

You can see it shows the options used for the solver so we can reproduce this output file later if needed and it provides the store output formatted as we specified in our Format method previously. 

Sending the plan to PagerDuty

With this output file, I can now send the overrides to PagerDuty to update the on-call schedule for my team. I didn’t include the full POST request in here because I don’t want anyone playing jokes on us and paging us in the middle of the night, but this is what that POST request looks like (of course, when I run this, I drop in the Nextmv overrides array from the output.json and fill in my PagerDuty token and schedule-id). 

curl --request POST \
 --url https://api.pagerduty.com/schedules//overrides \
 --header 'Accept: application/vnd.pagerduty+json;version=2' \
 --header 'Authorization: Token token= \
 --header 'Content-Type: application/json' \
 --data '{
   "overrides": []

After sending that through, I can immediately see the resulting schedule in the UI. That rainbow of colors from 10/10 - 10/23 is due to the overrides I passed through for that timeframe. You can see I gave Chris his 2-week stretch previously and did not include him in this latest scheduling plan since he opted for continuous on-call versus flexible as the others did.

Conclusion

This walkthrough shows you how quickly you can become operational with a completely custom decision optimization app on the Nextmv platform. 

If you have a similar problem to this, you can run through these steps and initialize the pagerduty-scheduling template which we’ve made available through our CLI. You can customize it as needed, run locally, rinse and replan! 

If you have a different problem - be sure to check out our other templates or build your own. Never will you need to deal with complicated spreadsheet gymnastics again! I’ve officially retired my on-call scheduling Google Doc and you can too.

Video by:
No items found.