Steven Maglio's Homework: What’s useful in a deployment summary?

There’s a larger question that the title question stems from: What information is useful when troubleshooting problematic code in production? If a deployment goes out and things aren’t working as expected, what information would be useful for the development team to track down the problem?

Some things that would be great, would be:

Complete observability of the functions that were having the problem with information about inputs and the bad outputs.
Knowledge of what exactly was changed in the deployment compared to a previously known good deployment.
Confidence that the last update to the deployment could be the only reason that things aren’t functioning correctly.

All three of those things would be fantastic to have, but I’m only going to focus on the middle one. And, I’m only going to focus on the image above.

How do we gain quick and insightful knowledge into what changed in a deployment compared to a previously known ‘good’ deployment?

I guess a good starting place is to write down what are the elements that are important to know about a deployment for troubleshooting (by the development team)?

You’ll need to lookup the deployment history information, so you’ll need a unique identifier that can be used to look up the info. (I’m always hopeful that this identifier is readily known or very easy to figure/find out. That’s not always the case, but it’s something to shoot for.)
When the deployment occurred, date and time?

This would be useful information to known if the problem definitely started after the deployment went out.
Links to all of the work items that were part of the deployment?

Sometimes you can guess which work item is most likely associated with an issue by the use of a key term or reference in it’s description. This can help narrow down where in logs or source control you may need to look.

If they are described in a way that is easily understood by the development team (and with luck, members outside the development team) that would be great.
Links to the build that went into the production deployment? Or, all previous builds since the last production deployment?

Knowing the dates and the details of the previous builds can help track the issue back to code commits or similar behavior in testing environments.

Of course, if you can get to full CI/CD (one commit per build/deployment), then tracking down which work item / commit had the problem becomes a whole lot easier.
Links to the source control commit history diffs?

If you want to answer the question “What exactly changed?” A commit diff can answer that question effectively.
Links directly to sql change diffs?

In the same vein as source control commit history diffs, how about diffs to what changed in the databases.
Statistics on prior build testing results? If a previous build didn’t make it to production, why didn’t it make it there? Were their failing unit tests? How about failing integration tests or healthchecks?

Would quick statistics on the number of tests run on each build (and passed) help pinpoint a production issue? How about code coverage percentages? Hmmm … I don’t know if those statistics would lead to more effective troubleshooting.

Another thing that seems obvious from the picture, but might not always be obvious is “linkability”. A deployment summary can give you an at-a-glance view into the deployment, but when you need to find out more information about a particular aspect, having links to drill down is incredibly useful.

But, there has to be more. What other elements are good to have in a deployment summary?

Steven Maglio's Homework

.NET, OSS & a little more

What’s useful in a deployment summary?

0 comments:

Post a Comment

About me

Categories

Contact

Older Posts