A somewhat scientific analysis of a quality issue

on Monday, August 26, 2019

I was recently working through a concern over the quality of data being entered manually which was causing a number of reprocessing requests. The person that performed the actual task of processing the incoming requests was noticing that there was a number of requests which were nearly duplicates of previous work and they needed to remove / clean-up the original work along with processing the new work. This was classic Muda and I wanted to figure out why.

The person that had reported the issue was convinced that the solution to prevent the reprocessing was to add a manual review step in the process; where the original request would be routed back to another group for review before continuing to their station for execution. The description he gave of the problem made sense, but I had had some direct involvement with some of the people putting in the requests. And when the requests were being submitted I was actually somewhat involved. So, something didn’t feel quite right and I wanted to dig into the data.

Background: The process was the addition of a database role onto a service account. The input data for the process were: Database Name, Service Account Name, Role Name, CreateRole (true/false), and Environment (dev/test/prod/all).

After getting ahold of the data this is what it looked like:

44584 - Db1, ISVC_Acct1, Role1 (all)
44582 - Db1, IUSR_Acct2, Role1 (all)
44536 - Db2, ISVC_Acct3, Role2 (all)
44504 - Db3, ISVC_Acct4, Role3Role (all) - Reprocessing (Bad Name) - Pulled name from Documentation. Docs were later corrected. Docs written by maglio-s. see 44447
44449 - Db4, ISVC_Acct4, Role3 (all)
44448 - Db3, ISVC_Acct4, Role3 (all) - Reprocessing (Wrong Database) - Developer didn't read documentation closely enough. Docs written by maglio-s. see 44447
44447 - Db1, ISVC_Acct4, Role3 (all)
44360 - Db5, ISVC_Acct1, Role4 (all)
44359 - Db6, ISVC_Acct5, Role5 (all)
44358 - Db6, ISVC_Acct1, Role6 (all)
43965 - Db1, IUSR_Acct6, Role1 (all) - Reprocessing (Bad Name) - Pulled name from Documentation. Docs were later corrected. Docs written by maglio-s. see 43960
43964 - Db7, IUSR_Acct6, Role7 (all)
43963 - Db7, IUSR_Acct6, Role8 (all)
43962 - Db7, IUSR_Acct6, Role9 (all)
43961 - Db7, IUSR_Acct6, Role1Role (all)
43960 - Db1, IUSR_Acct6, Role1Role (all)
43959 - Db8, IUSR_Acct6, Role10 (all) - Extra Message - Db8 didn't yet exist in Prod. This wasn't a problem that affected the results or required reprocessing.
43585 - Db9, IUSR_Acct7, Role11 (dev) - Extra Processing - Detected problem with script (updated bot after next Deployments / Dev Support meeting)
43295 - Db11, SVC_Acct8, Role12 (prod)
43294 - Db11, SVC_Acct8, Role12 (test)
43256 - Db7, IUSR_Acct9, Role8 (all)
43255 - Db7, IUSR_Acct9, Role9 (all)
43254 - Db7, IUSR_Acct9, Role7 (all)
43144 - Db3, ISVC_Acct10, Role3Role (all)
43088 - Db10, SVC_Acct11, Role13 (all)
43087 - Db1, SVC_Acct11, Role1 (all)
43086 - Db1, SVC_Acct11, Role14 (all)
43063 - Db11, SVC_Acct12, Role15 (prod)
42918 - Db11, SVC_Acct12, Role15 (test)
42920 - Db12, SVC_Acct12, Role16 (all) - Reviewed Before Running / Reprocessing (Bad Name), see also 42919
42921 - Db12, SVC_Acct13, Role16 (all) - Reviewed Before Running - CJ determined it wasn't necessary (requestor: maglio-s)

(*maglio-s = me; I figured I might as well out myself as the guilty party for a lot of these.)

It doesn’t look like too much reprocessing, until you look at and break down of the overall defect rates:

image

Overall there were 6 defects: 4 reprocessing needed, 1 reviewed and rejected, and 1 bug during processing. That’s 20% defects, with 13.3% reprocessing.

Upon the first review, there did seem to be a data quality issue, but more of an issue with my documentation and people trusting my documentation. If the engineer that was reporting this data quality issue was trying to get me to improve my thoroughness without pointing a finger at me; then good job!

But, when I was talking with the engineer that reported the issue, they were adamant that it wasn’t a single person but an overall quality issue. I couldn’t totally agree with them, but there was definitely was a quality problem. Now, how do we improve the quality?

As mentioned earlier, for the engineer, the solution was to add a manual review step by another group before it got to him for processing. But, that was something I was adamantly trying to avoid. I wanted to avoid it because:

  • It would take a manual process and move that manual labor to another group, rather than replace it.
  • The other group would need to be consulted because it was going to increase their workload, and they would need to add their own ideas and solutions into the conversation.
  • It wasn’t creating a smaller feedback loop for the requestor to figure out if they had submitted bad input.

I’m a fan Henrik Kniberg’s saying, “Manage for the normal, treat the exceptions as exceptional.

Each of these reprocessing issues seemed (to me) to be exceptions. And I wanted to deal with each one as an exceptional case rather than implement a new review step that would become part of the normal process.

The easy part of dealing with each one as an exception, is that you don’t have to change the overall process. And, because I had already been involved in resolving some of them earlier the implementation cost of correcting the documentation and “fixing the bug in the bot” were already taken care of.

However, neither of these approaches really seemed like they were going to be a sure fire way to ensure the quality of the data increased. They both felt like they required a “let’s wait and see if this fixes the problem” approach. And the reporting engineer had a really good point that we need to improve the quality and lower the amount of reprocessing work.

But then something new started to stand out. At the top of this article I mentioned the inputs to the system. One of the inputs to the system that didn’t make it into the analysis data was the parameter CreateRole. In the original implementation of the system, if the role in the database didn’t exist, the script which added the database role would fail. The CreateRole flag was asked for by the development team, so they could indicate to the engineering team that the role would need to be created. The engineering team looked at this problem and fixed the system by ALWAYS creating the role if it didn’t exist. And this is where the heart of the confusion occurred. The development team thought that if CreateRole was set to ‘false’, and the role didn’t exist, then the system would throw an error. The assumption was that even if they got the name wrong, it would be fine because the system wouldn’t create a new role that wasn’t asked for.

After looking at the new information, 3 out of the 4 reprocessed requests (75%) we’re all attributable to the CreateRole flag being ignored. So how do we improve the system?

Multifold:

  • Hold myself to a higher standard when writing documentation in order to prevent downstream team members from using the wrong names.
  • Ensure that Role names are unique enough to not be confused with each other. (The ones that needed to be reprocessed had Role names that were really similar to other Role names.)
  • Add a fast feedback loop, by setting up the input mechanism to verify if a role exists at the time the request is put in (if the CreateRole flag is set to false).

The most important change that came from the data analysis was introducing a new fast feedback loop. And, I don’t think we would have found it without analyzing the data. It’s a hard discipline to gather the metrics, but we need to start doing it much more frequently and with greater detail.

0 comments:

Post a Comment


Creative Commons License
This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.