A somewhat scientific analysis of a quality issue

on Monday, August 26, 2019

I was recently working through a concern over the quality of data being entered manually which was causing a number of reprocessing requests. The person that performed the actual task of processing the incoming requests was noticing that there was a number of requests which were nearly duplicates of previous work and they needed to remove / clean-up the original work along with processing the new work. This was classic Muda and I wanted to figure out why.

The person that had reported the issue was convinced that the solution to prevent the reprocessing was to add a manual review step in the process; where the original request would be routed back to another group for review before continuing to their station for execution. The description he gave of the problem made sense, but I had had some direct involvement with some of the people putting in the requests. And when the requests were being submitted I was actually somewhat involved. So, something didn’t feel quite right and I wanted to dig into the data.

Background: The process was the addition of a database role onto a service account. The input data for the process were: Database Name, Service Account Name, Role Name, CreateRole (true/false), and Environment (dev/test/prod/all).

After getting ahold of the data this is what it looked like:

44584 - Db1, ISVC_Acct1, Role1 (all)
44582 - Db1, IUSR_Acct2, Role1 (all)
44536 - Db2, ISVC_Acct3, Role2 (all)
44504 - Db3, ISVC_Acct4, Role3Role (all) - Reprocessing (Bad Name) - Pulled name from Documentation. Docs were later corrected. Docs written by maglio-s. see 44447
44449 - Db4, ISVC_Acct4, Role3 (all)
44448 - Db3, ISVC_Acct4, Role3 (all) - Reprocessing (Wrong Database) - Developer didn't read documentation closely enough. Docs written by maglio-s. see 44447
44447 - Db1, ISVC_Acct4, Role3 (all)
44360 - Db5, ISVC_Acct1, Role4 (all)
44359 - Db6, ISVC_Acct5, Role5 (all)
44358 - Db6, ISVC_Acct1, Role6 (all)
43965 - Db1, IUSR_Acct6, Role1 (all) - Reprocessing (Bad Name) - Pulled name from Documentation. Docs were later corrected. Docs written by maglio-s. see 43960
43964 - Db7, IUSR_Acct6, Role7 (all)
43963 - Db7, IUSR_Acct6, Role8 (all)
43962 - Db7, IUSR_Acct6, Role9 (all)
43961 - Db7, IUSR_Acct6, Role1Role (all)
43960 - Db1, IUSR_Acct6, Role1Role (all)
43959 - Db8, IUSR_Acct6, Role10 (all) - Extra Message - Db8 didn't yet exist in Prod. This wasn't a problem that affected the results or required reprocessing.
43585 - Db9, IUSR_Acct7, Role11 (dev) - Extra Processing - Detected problem with script (updated bot after next Deployments / Dev Support meeting)
43295 - Db11, SVC_Acct8, Role12 (prod)
43294 - Db11, SVC_Acct8, Role12 (test)
43256 - Db7, IUSR_Acct9, Role8 (all)
43255 - Db7, IUSR_Acct9, Role9 (all)
43254 - Db7, IUSR_Acct9, Role7 (all)
43144 - Db3, ISVC_Acct10, Role3Role (all)
43088 - Db10, SVC_Acct11, Role13 (all)
43087 - Db1, SVC_Acct11, Role1 (all)
43086 - Db1, SVC_Acct11, Role14 (all)
43063 - Db11, SVC_Acct12, Role15 (prod)
42918 - Db11, SVC_Acct12, Role15 (test)
42920 - Db12, SVC_Acct12, Role16 (all) - Reviewed Before Running / Reprocessing (Bad Name), see also 42919
42921 - Db12, SVC_Acct13, Role16 (all) - Reviewed Before Running - CJ determined it wasn't necessary (requestor: maglio-s)

(*maglio-s = me; I figured I might as well out myself as the guilty party for a lot of these.)

It doesn’t look like too much reprocessing, until you look at and break down of the overall defect rates:

image

Overall there were 6 defects: 4 reprocessing needed, 1 reviewed and rejected, and 1 bug during processing. That’s 20% defects, with 13.3% reprocessing.

Upon the first review, there did seem to be a data quality issue, but more of an issue with my documentation and people trusting my documentation. If the engineer that was reporting this data quality issue was trying to get me to improve my thoroughness without pointing a finger at me; then good job!

But, when I was talking with the engineer that reported the issue, they were adamant that it wasn’t a single person but an overall quality issue. I couldn’t totally agree with them, but there was definitely was a quality problem. Now, how do we improve the quality?

As mentioned earlier, for the engineer, the solution was to add a manual review step by another group before it got to him for processing. But, that was something I was adamantly trying to avoid. I wanted to avoid it because:

  • It would take a manual process and move that manual labor to another group, rather than replace it.
  • The other group would need to be consulted because it was going to increase their workload, and they would need to add their own ideas and solutions into the conversation.
  • It wasn’t creating a smaller feedback loop for the requestor to figure out if they had submitted bad input.

I’m a fan Henrik Kniberg’s saying, “Manage for the normal, treat the exceptions as exceptional.

Each of these reprocessing issues seemed (to me) to be exceptions. And I wanted to deal with each one as an exceptional case rather than implement a new review step that would become part of the normal process.

The easy part of dealing with each one as an exception, is that you don’t have to change the overall process. And, because I had already been involved in resolving some of them earlier the implementation cost of correcting the documentation and “fixing the bug in the bot” were already taken care of.

However, neither of these approaches really seemed like they were going to be a sure fire way to ensure the quality of the data increased. They both felt like they required a “let’s wait and see if this fixes the problem” approach. And the reporting engineer had a really good point that we need to improve the quality and lower the amount of reprocessing work.

But then something new started to stand out. At the top of this article I mentioned the inputs to the system. One of the inputs to the system that didn’t make it into the analysis data was the parameter CreateRole. In the original implementation of the system, if the role in the database didn’t exist, the script which added the database role would fail. The CreateRole flag was asked for by the development team, so they could indicate to the engineering team that the role would need to be created. The engineering team looked at this problem and fixed the system by ALWAYS creating the role if it didn’t exist. And this is where the heart of the confusion occurred. The development team thought that if CreateRole was set to ‘false’, and the role didn’t exist, then the system would throw an error. The assumption was that even if they got the name wrong, it would be fine because the system wouldn’t create a new role that wasn’t asked for.

After looking at the new information, 3 out of the 4 reprocessed requests (75%) we’re all attributable to the CreateRole flag being ignored. So how do we improve the system?

Multifold:

  • Hold myself to a higher standard when writing documentation in order to prevent downstream team members from using the wrong names.
  • Ensure that Role names are unique enough to not be confused with each other. (The ones that needed to be reprocessed had Role names that were really similar to other Role names.)
  • Add a fast feedback loop, by setting up the input mechanism to verify if a role exists at the time the request is put in (if the CreateRole flag is set to false).

The most important change that came from the data analysis was introducing a new fast feedback loop. And, I don’t think we would have found it without analyzing the data. It’s a hard discipline to gather the metrics, but we need to start doing it much more frequently and with greater detail.

Promoting Vision and Behavior through Validation

on Monday, August 19, 2019

Building out a new architecture requires many processes and behaviors to be re-evaluated and modified. Those changes take time to refine, communicate and become embedded within a departments teams. And, when it comes to changes within the understanding of how processes works, it’s a very slow process which takes a large amount of repetition of intentions in order to build a communal understanding of a larger vision. And, usually, a lot of the details get lost when you’re trying to shift your mindset from one large vision over to a new large vision. The small stuff doesn’t really sink in until the large stuff has taken root.

But, there are minor opportunities to help remind team members of those smaller details when implementing validation checks within your systems. Software developers do this all the time when developing websites for their end customers. But, it isn’t done as often for internal facing applications or infrastructure pieces. For example, Domino’s requires you to put in a telephone number when you order a pizza for delivery, because if their driver becomes lost they will need to call you. The error message on the Domino’s order form does say the field is required, but it also has a little pop-up tip of why it’s required. When was the last time you saw an internal facing application explain why it wants you to do something in a validation message?

Its difficult to know when it will be useful to put in the extra time to adjust an error message in order for it to be more informative. So being on the look out for anyone that is reporting an issue and directly saying, “I didn’t know we should do it that way” is very helpful in locating places where a little more attention could go a long way.

Here’s an example. This was the original error message:

Service Accounts should start with ISVC_, IUSR_, or SVC_. Domain Account = '{0}'. AD User and Group accounts can also be used. Service Accounts should not start with '{domain name}\'

And after the update, here’s the new error message:

Service Accounts should start with ISVC_, IUSR_, or SVC_. Domain Account = '{0}'. AD User and Group accounts can also be used. Service Accounts should not start with '{domain name}\'. Please see `serviceaccount guidelines` for more information.

And this is what `serviceaccount guidelines` provides:

Service Accounts

Service accounts cannot be over 20 characters long including the endings of 'Dev' and 'Test'. Because of that, the maximum allowed length is 16 characters, which should include the prefix.

Prefixes

    IUSR_   Stands for IIS User Account. Used with websites that are seen/used by end users.
    ISVC_   Stands for IIS Service Account. Used with webservices and webjobs.
    SVC_    Stands for Win Services Account. Used with Windows Services and Scheduled Tasks.

Example

    ISVC_OrgAppName{env}


When choosing names for accounts, try to avoid redundancies within the name. For example, IUSR_DoorWebApp{env} would have 2 redundancies. 'Web' is redundant because the prefix of 'IUSR_' indicates it's a top level web application. And, 'App' is redundant because the prefix of 'IUSR_' indicates it's a web application. Another example of redundancy would be to add 'Svc' in an 'ISVC_' account name, eg. 'ISVC_DoorSvc{env}'.

It’s a small addition, but it has two effects. First, it’s communicating out a group of standardized application types and how to signal to other’s what the role of your application is. It’s also empowering the team members to have the information necessary to make solid decisions without needing to reach out to other teams (who may be busy with other work).

It’s extra overhead to put together the extra documentation, but it can definitely be worth it.

Powershell: Using a file hash to test for a change

on Monday, August 12, 2019

The PowershellForGitHub module is great! But … sometimes it can be a bit verbose when it’s trying to help out new users/developers of the module. This isn’t a bad thing in any way, just a personal preference thing. And, the module owner, Howard Wolosky, is really open to suggestions. Which is great!

So, I opened a ticket (PowerShellForGitHub Issue #124) to explain my confusion over the warning messages. And, to be fair, I explained my confusion in a very confusing way. But, he was nice enough to work through it with me and we found that something we needed was a way to tell if someone had updated a settings file after downloading the module on their machine.

Enter, Get-FileHash.

This command looks like it’s been around for quite a while, but it does the classic job of creating a hash of a file. And, that hash can be stored in code, so that it can be used to check if a change in the file has occurred.

So, how to use the check.

Here’s the original code:

And, here’s the updated code using Get-FileHash:

PowerShellForGitHub–Adding Get-GitHubRelease

on Monday, August 5, 2019

PowerShellForGitHub is an awesome powershell module for interacting with the GitHub API. It has a wide set of features that are already implemented and its supported by Microsoft(!!). You can also tell the amount of care the maintainer, Howard Wolosky, has put into it when you dig into the code and read through the inline documentation, contributing documentation and telemetry support(!!). BTW, if you ever need to create a PII transmittable string, check it out: Get-PiiSafeString.

One of the features I was looking for the other day was the ability to retrieve a list of releases for a repository. (I was building a script to monitor the dotnet/core releases; in order to build an ASP.NET Core Hosting Bundle auto-installer.)

I submitted a pull request of the update last week and was really impressed with all the automation in the pull request processing. The first thing that surprised me was the integrated msftclas bot (Microsoft Contribution License Agreements), which posted a legal agreement form that I (or the company I represent) consent to give Microsoft ownership of the code we contribute. It was soo smooth and easy to do.

Next was the meticulous level of comments and review notes on the pull request. If he made all those comments by hand, holy moly! That would be amazing and I would want to praise him for his patience and level of detail. Hopefully, some of the comments were stubbed out by a script/bot; which would be a really cool script to know about.

So, I’m gonna go through the comments and see if I can update this pull request.

  • *facepalm* Wow, I really missed changing the name GitHubLabels.ps1 to GitHubReleases.ps1 in the .tests.ps1 file.
  • White space in .tests.ps1: Ahhh … I can better see the white space formatting style now.
  • Examples missing documentation: Hahaha! My mistake. It looks like I started writing them and got distracted.
  • Telemetery: I loved the note:

    For these, I think it's less interesting to store the encrypted value of the input, but more so that the input was provided (simply in terms of tracking how a command is being used).

    Thank you for pointing that out! It makes complete sense.

In summary, a big Thank You to Howard Wolosky and the Microsoft team for making this module! It was a huge time saver and really informative on how to write Powershell code in a better way.


Creative Commons License
This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.