I have a nightly scheduled job which will send out a notification email if the job has an error occur anywhere within it (even when the error is handled). This job infrequently sends out the error email. However, my long history of reviewing these emails has brought me to the point where I assume the error is always:
-
There was a file lock on file X when the file was being save; the function detected the error, waited a brief time period for the lock to clear and then retried the save operation successfully.
I can't remember a time when that wasn't the case. Because of this, I am finding myself less interested in actually reading the error message and desiring to simply ignore the email. But, I know that is going to lead to a situation where something unexpected will happen and I'll ignore the warning emails. Which would be a failure of the entire warning system.
So, what I have is a very narrowly defined and well known case of when the exception occurs, and I have a desire to ignore it. If I setup the code to simply suppress this error after the save operation successfully completes, then I should be able to safely reduce the amount of noise in the error messages that are sent to me. (It should still report the error if the retries never complete successfully)
This is a very common scenario: Teams setup a warning mechanism that is highly effective when a system is first built. At that time, there are a myriad of possible unforeseen errors that could occur. There also hasn’t been enough operational history to feel that the system is stable, so being notified on every potential problem is still a welcome learning experience. As those problems are reduced or eliminated it builds trust in the new system. However, it’s also very common that once a team completes a project and does a moderate amount of post deployment bug fixes, they are asked to move on and prioritize a new project. Which gives no devoted / allocated time to maintaining small and inconsistent issues that arise in the previous project(s).
Unfortunately, the side effect of not giving the time needed to maintain and pay down the technical debt on the older projects is that you can become used to “little” problems that can occur on them; including ignoring the warning messages that they send out. And this creates an effect where you can start to distrust that the warning messages coming from a system are important, because you believe that you know the warning is “little” or “no big deal”.
The best way to instill confidence in the warning and error messages produced by a system is to ensure that the systems only send out important messages, separating the Signal from the Noise.
For my scenario above, the way I’m going to do this is to prevent these handled errors from sending out notification emails. This goes against best practices because I will need to alter the global error monitor in Powershell, $global:Error. But, given that my end goal is to ensure that I only receive important error messages, this seems like an appropriate time to go against best practices.
Below is a snippet of code which can be used to remove error records from $global:Error that fit a given criteria. It will only remove the most recent entries of that error, in order to try and keep the historical error log intact.
You need to be careful with this. If the error you’re looking for occurs within a loop with a retry policy on it, then you need to keep the errors which continued to fail beyond the retry policy, and only remove future errors when the retry policy succeeded. You can better handle the retry policy situation by using the –Last 1 parameter.
No comments:
Post a Comment