What’s an Andon Cord?

on Monday, June 8, 2020

I’ve seen a couple of great explanations of an Andon Cord, but I feel like there’s another side to them. Something that hasn’t already been written about in John Willis’ The Andon Cord, Six Sigma Daily’s The Andon Cord: A Way to Stop Work While Boosting Productivity, or even Amazon’s Andon Cord for their distributed supply chain (example 1, 2).

Personally, I’ve heard two descriptions of the Andon Cord and one of them makes a lot of sense to me, and the other one is the popular definition. The popular definition is that an Andon Cord is the capability to pause/halt any manufacturing line in order to ensure quality control and understanding. Anyone can pull the Andon Cord and review a product that they are unsure about. Which intuitively makes sense.

The second definition I heard was from Mike Rother’s book, Toyota Kata, and wasn’t so much of a definition as a small glimpse into the history the Andon Cord. The concept that we know today started at Toyota before they made cars. This was back when they were making sewing machines. And, you need to imagine that the sewing machine manufacturing line was modeled off of Henry Ford’s Model T production lines. So there was a belt that ran across the factory floor and in between stations that people worked at. The Sewing Machines would be mounted to the line and would move from one station to the next on a timed interval (lets say every 5 minutes). So, each person would be working at their station and they would have 5 minutes to complete the work of their station; which was usually installing a part and then some sort of testing. If the person on the line felt that they couldn’t complete their work on time, then they should pull the Andon Cord to freeze the line. This ensured that no defective part/installation continued on down the factory line. The underlying purpose of not having a bad part go down the line is that disassembling and reassembling a machine to replace a defective part is very expensive compared to stopping the line and fixing it while the machine is at the proper assembly level. This makes complete sense to me.

The second definition makes a lot more sense to me than the first because of one unspoken thing:

Anyone on the assembly line can pull the Andon Cord. The Andon Cord can be pulled by anyone, but it’s supposed to be for the people who are actually on the assembly and have expert knowledge about their step within the overall process. It’s their experience and knowledge on that particular product line which makes them the correct person to pull the cord. It’s not for people from other product lines to come over and pull their lines cord.

This is a classic problem that I’ve run into time and again. On multiple occassions, I have seen the Ops and Management side of businesses hear that “anyone can pull the Andon Cord” and immediately start contemplating how they can use the cord to add Review Periods into process lines and allow anyone to put the brakes on a production deployment if anyone doesn’t understand it.

But those ideas seems counter-productive to the overall goals of DevOps. You don’t want to add a Review Period as that just delays the business value from getting to the end customer. And you don’t want to stop a release because someone who isn’t an expert on a product has a question about it; you want that person to go ask the experts, and then you want the experts on a Product Line to pull the Andon Cord.

Now, in an idealized world, all the people involved in a products deployment process would all be on the same Product Team. That team would be made up of Dev, Ops, and other team members. And all of those team members would be experts on the product line and would be the right people to pull the cord.

However, the majority of businesses I’ve talked with have separate Devs and Ops/Engineering teams. Simply because that structure has been lauded as a very cost effective way to reduce the companies expenditure on Ops and allow for their knowledge to be centralized and therefore non-redundant. But, when the Ops team is separate from the Dev team then the Ops team has a sense that they are a part of every product line; and they should have ownership over allowing any release to go to production. Even when they are not experts on the product line and have no knowledge of what a change actually does.

This sense of ownership that Ops (and to some degree Management) have often manifest in the form of asking for a review period between the time a deployment has passed all of it’s testing requirements and it actually goes out to production. This review period should start with a notification to the customers and usually end a few hours/or a day later so that Operations, Management, and Customers all have time to review the change and pull the cord if they have concerns about the change. Except, the Customer and the Product Team are the only ones on that line who are really experts on the product. And for customers that work alongside their product teams, they usually know what’s coming long before scheduling; and customers that don’t work alongside their product teams usually won’t be involved at all at this point.

So, if the above is true, then Operations (and Management) wouldn’t be experts on the product line at this point in the release process, so why would they be pulling the cord at this point?

For Management, I’m not sure. But, for Operations, they are experts on the Production environment that the deployment will be going into. Ops should be aware if there are any current issues in the production environment and be able to stop a deployment from making a bad situation worse. But, that isn’t a product line Andon Cord. That’s an Andon Cord for an entire environment (or a subsection of an environment). The Operations team should have an Andon Cord to pause/halt all deployments from going into Production if something is wrong with that environment. Once the environment has been restored to a sense of stability, then Operations should be able to release the cord and let the queued deployments roll out again. (sidenote: Many companies that are doing DevOps have communication channels setup where everyone should be aware of a Production environment problem; this should allow for “anyone” to pull the Environment Andon Cord and pause deployments for a little while.)

Finally, in the popular definition of the Andon Cord there is a lot of attention paid to human beings pulling the Andon Cord, but not a lot of explicit statements about machines pulling the Andon Cord. For me, I see it as both groups can pull the Andon Cord. It seems like everyone intuitively understands that if unit tests, or smoke tests, or a security scan fails then the process should stop and the product should go back to the developer to fix it. What I don’t think people connect is that that’s an Andon Cord pull. It’s an automated pull to stop the process and send the product back to the station that can fix the problem with the least amount of rework required. To see that though, you have to first recognize that a CI/CD automated build and deployment process is the digital transformation of a factory floors product line. Your product moves from station to station through human beings and automated tooling alike (manual commit, CI build, CI unit tests, manual code review, manual PR approval, CI merge, CI packaging, CD etc.), and at every station there is a possibility of an Andon Cord pull.

0 comments:

Post a Comment


Creative Commons License
This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.