The goal of an IT organization is to help the business show a profit through the delivery of valuable software. As we consider our process, instead of thinking in terms of projects, think in terms of capabilities needed to complete the work. The concept of capability implies measurability. Measurements taken from a stable process establishes a baseline for predictability. Predictability is needed to support planning, including budgeting. Yet we find that accurate communication about what is meant by capability is a significant problem.
“Meaning starts with the concept, which is in somebody’s mind, and only there: it is ineffable.”
— W. Edwards Deming “Out of the Crisis”
Each measurement of a process is effectively a sample, and when we talk about metrics, we’re really talking about a series of samples. In order for us to believe in reported metrics, we need to have confidence that the measurements of the series are of the same thing and taken in a consistent manner. We pin our confidence on a set of statements about our metrics process that are sufficiently unambiguous to allow different people to arrive at a common understanding of what is meant. Definitions of terms used to support metrics are called operational definitions.
“An operational definition puts communicable meaning into a concept”
— W. Edwards Deming “Out of the Crisis”
When I worked on oil pipeline control software, one of the problems that operators faced was discrepancies in batches. A batch of 10,000 cubic meters of gasoline was metered through a pipeline into a storage tank. The buyer off-loaded the fuel into tanker trucks which had a capacity of 44 cubic yards, calculating that they would need to make 227 loads in order to haul away their purchase. When the storage tank was emptied by the 223rd load, they filed a claim for the missing 4 loads.
We explained to the customer that a 10 degree Fahrenheit drop in temperature will reduce the volume of 10,000 cubic meters of gasoline about about 200 cubic meters, which equates to about 4 tanker loads. It takes a long time for 10,000 cubic meters of gasoline to change to the ambient atmospheric temperature, and normally the fuel was delivered before a volume change was noticeable. In this case, the fuel sat in the tank for a long time, and it had gotten colder. The customer simply pointed to his contract, and said “It doesn’t say anything about warm or cold, it just says 10,000 cubic meters”. He got 4 additional loads out of the bargain.
It turned out that writing a contract for the delivery of cubic meters of fuel was an insufficient definition for what was being transacted. The failure to state in advance the operational details of the transaction created a loss for the business and ill will with the customer. The operational definition for future transactions was revised to state that product was to be measured by cubic meter at a specified temperature. The volume could then be calculated based on the actual temperature of the fuel at the time it was loaded.
Metering the productivity running through a software delivery pipeline is a bit more complex. We construct classes and categories in our minds to represent productivity, pipelines and the software running through, things that do not physically exist, yet we come to treat them in the same way as tangible objects of the material world.
Edward Martin Baker said “An operational definition is a procedure agreed upon for translation of a concept into measurement of some kind”.
If we ask everyone how much they think that the capability of their team has changed over the last few weeks, we will likely get different answers, but even when different people provide the same answer, we have no way of knowing if they mean the same thing. In order for the measurement of improvement in capability to be of practical meaning to the business, we will have to define what we mean by capability.
Relying on sentiment as a proxy metric for capability is hardly reliable. Nobody would agree for their salary to be calculated in such an arbitrary way, and we shouldn’t be asking the business to take subjective numbers as a basis for budget allocation.
To get to a more reliable reporting of our actual capabilities, we need to look at the delivery process in a bit more detail, to identify attributes that correlate with productivity and are also objectively measurable. Qualitative attributes may supplement a decision making process, but first we need a solid foundation of objective quantitative measurements.
The delivery process isn’t something that happens apart from the development process. Early integration is a key principle of Agile practice, where development activity and deployment activity are closely intertwined. Agile practice mitigates risk through iterative development and delivery practice mitigates risk by incremental inspection of batches of work product, locking in the the gains throughout defined stages of the delivery process.
When pipeline test gates and release protocols are properly defined, then the objective measurements of events in the delivery pipeline form a reasonable basis for appraising an organization’s capabilities. Key pipeline events indicative of the health of the delivery process are Lead Time, Cycle Time, Release Frequency and Change Failure Rate. Those attributes of the system can be measured objectively and taken together are good signals of a software organization’s capabilities, provided that we properly define what we mean by those terms.
The concept of operational definition was first established in Walter A. Shewhart’s 1932 book “Economic Control of Quality in Manufactured Product”. Shewhart and W. Edwards Deming both worked at the General Electric Hawthorne plant in Chicago, which produced transmitters for telephone systems.
Shewhart later transferred to Bell Labs, and by 1936 Deming was in charge of the Graduate School of the US Department of Agriculture, where he invited Shewhart to teach a masters class on statistics. With Deming as Editor, the four lecture series was published in 1938 under the title of “Statistical Method from the Viewpoint of Quality Control”. The 1986 Dover edition is still the master reference on the topic today.
Deming wrote that a practical approach to operational definitions is essential to the concept of measurement, and thus the key to understanding productivity and quality control. Deming is perhaps best know for his work with Japanese industry in the post war period of reconstruction. His 1950 book “Some Theory of Sampling” represents the material that Deming taught to Japanese engineers, scientists and managers, a significant contribution to the manufacturing revolution of the following decades.
The most well know aspect of the Japanese industrial transformation is the Toyota Product System, built around kanban, Taiichi Ohno’s replacement of the fixed-time assembly line with a signaling mechanism using buffers to allow for variable time at a workstation.
But an equally important aspect of the Toyota Production System is the Improvement Kata: a continuous cycle of improvement. The Kata is based on Shewhart’s “Plan Do Study Act” cycle, which is basically a restatement of the scientific process tailored for practical application on the contemporary shop floor. It comes down to us as the Deming Cycle, because Deming taught it.
The Toyota Kata replaces the notion of quality being defined in terms of meeting specification with the idea of continual improvement of current state. Kaoru Ishikawa appended an Observe step to the Deming Cycle, reflecting the importance of an unambiguous assessment of current condition central to the Kata. Deming taught Shewhart’s principles of operational definitions to Japanese managers as a core skill needed to generate the consistent and reliable metrics needed for current state assessment.
Deming said that his 1982 book Quality, Productivity and Competitive Position was based in large part on his understanding of Dr. Shewhart’s teaching. While Deming spoke of the manufacturing domain, the book eventually came to the attention of software thought leaders who recognized the significance for the problems of knowledge work. Just as continuous integration practice was adapted from the Taiichi Ohno’s kanban signaling mechanism, the conception of metrics that we have today cannot be properly explained without reference to Walter Shewhart’s principle of operational definitions.
Operational Definitions doesn’t just mean that we need to agree on what things mean. It refers to pragmatic definitions suited for a specific context. We don’t need all-encompassing definitions as for scientific work, nor do we need definitions guaranteed to produce the most precise answer possible. An operational definition is tailored to the specific purpose of measurement in given context, and need only be sufficiently precise to support a common understanding of what is meant.
Operational definitions are essentially the instructions for how the measurement should be taken. Measurements are effectively observations, and our goal is to insure that a series of observations are consistent. Our starting point is to establish what event-state changes should trigger our measurement operation.
Consider the concept of Cycle Time: how long it takes for work to get done and delivered. We can have useful discussions about the concept of Cycle Time, but the concept alone is far too ambiguous to be of practical use. We have no way of knowing to what degree we are talking about the same thing, and no way of knowing if a series of measurements taken over time are comparable. Cycle Time represents a measurable capability of a service, but only after the bounds of what constitutes a cycle are expressed.
The conceptual definition may be perfectly adequate for a general discussion, but it is pointless to try to take up the sober work of actual measurement without first establishing an operational definition.
For example, what constitutes the cycle in “Cycle Time”? Does it start…
- when a story gets pulled into a ready queue?
- when it first goes into work?
- or when the first commit is made?
And when does the Cycle close?
- when a story passes a specified test gate?
- when the Product Owner accepts work?
- or when the story goes into production?
There is no “right” answer, only the necessity of applying the same definition to a series of measurements taken from a given service.
“Any measurement is the result of applying a procedure. "
— W. Edwards Deming “Out of the Crisis”
Even when we provide unambiguous direction on what constitutes Cycle Start and Stop, there may be more attributes about what we need to consider. Remember, that while an operational definition need only be as precise as needed for our specific purposes, it must be sufficiently clear to resolve any ambiguity about how to apply the procedure.
Consider an organization that adopts the definition of a Cycle starting when the developer firsts pulls a story into work, and the Cycle ending with production release, counting only business days, excluding weekends and holidays. Several months of measurements are taken. Later, after the team takes time for a company retreat, should the retreat days should be included, because they are business days, or excluded, since the team isn’t available to work on task. If we deduct the retreat days, what about vacation weeks? Or time diverted to some urgent security incident? What constitutes a legitimate deduction to the Cycle Time days?
Facing these questions, the organization may decide just use straight calendar time as a lot less hassle. That represents a change in procedure that in no way affects the service capability. It changes the the values reported for the capability, but not the capability itself.
Any procedure for measurement can be used so long as it’s understood, and people depending on the reported values to make decisions are aware of context of the operational definition.
There can be no notion of the capability of a department, team or organization that does not rest upon a foundation of clearly articulated statements of the object of measurement and the methods used.
“There is no true value of any characteristic, state or condition that is defined in terms of measurement or observation. Change of procedure for measurement (change in operational definition) or observation produces a new number.”
— W. Edwards Deming
It may come as a surprise to many that there is no true value of the speed of light. 299,792,458 meters per second is the “correct” value if you adopt the international standard of the definition of a meter, which as you might imagine has it’s own hotly debated operational definitions, and limit the applicability of the measurement to a vacuum. If you need precision to 7 or 8 decimal positions, and for the measurements to be taken under atmospheric conditions, then you’ll need to adopt a different operational definition, and perhaps incur greater costs of taking and verifying a series of measurements.
The cost of carrying out a procedure is relevant. A cheaper (less time consuming) procedure may be preferred over a more accurate one.
What do you suppose you would have to charge if you were asked develop a procedure to verify the accuracy of the commonly accepted value of circumference of the earth of 40,000 kilometers, give or take 2%?
Before you look into booking some time with SpaceX, consider that in the 2nd century BC, Eratosthenes devised a procedure that that he could execute without leaving the comfort of his desk at the library in Alexandria. His estimate of 252,000 stadia was within 2% of the actual value, perhaps closer. We can’t say how close he was because there is more than one way of specifying the length of a stadia.
Where there are multiple procedures for deriving a value, the preferred method is referred to as the master reference. The master reference can be updated as the objective of the measurement changes, or the means of measurement are improved, for greater accuracy or lower cost.
“The preferred procedure is distinguished by the fact that it supposedly gives or would give results nearest to what are needed for a particular end.”
— W. Edwards Deming “Out of the Crisis”
If we can’t measure the speed of light or the circumference of the earth without operational definitions, how could we ever hope to measure productivity in knowledge work without them?
If our Devops metrics reflect events of the delivery process, it is only reasonable that we sort out what we mean by pipeline, working out operational definitions for the term pipeline, and for the related terms system, environment, stage, protocol and release, all of which we covered in a prior post. Given a delivery process bounded in this way provides a solid foundation to place meaningful criteria for the Devops metrics.
What follows is a draft proposal for operational definitions for the following key metrics of Devops:
- Lead Time
- Cycle Time
- Release Frequency
- Change Failure
- Mean Time to Recovery
They are not offered as a “best practice”, but rather simply as an example. The only correct definition is the one the serves your purposes.
We’ll define Release Frequency as the total number of successful releases to a delivery stage divided by elapsed days. The various deployment stages are accounted independently, such as UAT releases as opposed to Production releases. Deployments are generally considered to be to the system as a whole, or any component of the system.
It sounds simple enough but what constitutes that system? Earlier, we provided the operational definition of system as “the network of interdependent components which work together to achieve the goals of the system. Systems provide value to customers. While projects may be focused on a component of the system, deliveries are always made in the context of the system as a whole.”
Software systems are more often an array of overlapping services. When is a release the collection of services together, and when is each independently releasable service accounted as its own release cycle? You may start accounting several system component together as a single release, because they are closely dependent, and then later decide to account them separately, because they are actually independently releasable.
The only correct answer is that you account for system component releases in a consistent way, ensuring that all aspects of change are accounted for somewhere.
Our definition of Lead Time starts with inception, the first meeting of the stakeholders and the design team, and it ends on delivery of a system increment to production. It sounds simple enough, but often work starts before inception. Then there’s the question of when the Lead Time clock stops for an iterative development process. Does the Cycle end with the first MVP release, or when the last feature requested is in production?
The main thing for Lead Time is that it should be defined in terms of business value realized.
Our definition of Cycle Time starts when a story is first transitioned to an In Progress work state, and ends on release to production. Story criteria that are split and left for a future iteration of course start with their own cycle.
Cycle Time should end when the business can begin receiving benefits from the work: on production release. Organizations that have significant procedural obstacle to releases such as a regulatory clearance process might choose to close Cycle Time cycles at an earlier stage, such as after UAT approval, and subsume the clearance delays in Lead Time accounting.
Change Failure Rate is the relationship of failure demand to feature demand. Feature demand is the routine work that comes out of the product backlog. Failure demand is work that is ticketed after the story’s Cycle Time Cycle has closed. Bugs that are ticketed while the cycle is still open is accounted as feature demand.
A practical objective of change failure tracking is to provide a damper on over zealous attempts to reduce Cycle Time. Another goal is to provide insight into the mix of work in the flow for capacity planning, to have a better idea of how much of a work stream’s capacity might be available for feature demand work.
Mean Time to Recovery is the time it takes to restore all or any part of a system at a given pipeline stage: code, content, library versions, services, routing, authentication, caching, configuration or settings. In short, restoring the system as a whole to a prior state is the subject of measurement.
Finally, we’ll assert that Lead Time, Cycle Time and Release Frequency are measured in straight calendar days. As we discussed earlier, any scheme of deducting non-working days is more complicated that it at first appears.
It is not important that you agree with those definitions, but it is essential that you arrive at and socialize your own operational definitions for the attributes you choose to measure. Version them to pin the meaning and to communicate that they are to be considered something more than a conceptual guide. The purpose of an operational definition is to describe the procedure for taking a measurement.
It is important to realize from the proceeding that there are two aspects of an operational measurement; one is quantitative and the other qualitative. One consists of numbers such as elapsed time in n measurements of events in a process, and the other consists of the tasks accomplished by a team in accordance to accepted practice leading to the generation of the measured event, such as a release. The later qualitative aspect is where value is delivered to the business, the numbers are simply pointers to summarize that value.
For the purposes of our discussion about the software delivery process, and the role of pipelines in realizing our business objectives, the quality we are focused on is the capability of the organization to provide the service of developing and delivering valuable software systems. All metrics should be understood in the context of achieving that goal.
“Operational definitions can help make common and verifiable the meanings of statements used by people who must communicate with each other”
— W. Edwards Deming
Anybody taking the trouble to present metrics should be prepared to answer the question of what decisions their information is intended to inform. The purpose of reporting metrics is to support decisions, so it follows that we need to understand the kinds of decisions they are intended to support, to insure that a given metric is suited to the stated purpose.
In “Fit For Purpose”, David Anderson proposes a taxonomy of metrics as fitness criteria, health monitors, improvement drivers and vanity metrics.
Vanity metrics are measurements that don’t influence decisions. Getting the right information to make decisions is invaluable, but no metrics capture is free, and when metrics are not being used to support decision making, the effort is coming straight out of productivity.
Fitness metrics are typically expressed as threshold criteria. Page load time is an example of a fitness criteria. Exceeding the stated threshold would make a pending release a no-go, not fit for purpose.
Health metrics are expressed as ranges, and have an essential role in providing ground truth. The job of a blood pressure monitor is not to tell you what you wish your blood pressure were, but to provide an unvarnished reading of reality.
Improvement drivers are expressed as targets. The Improvement Kata typically uses health metrics to establish current state and the sets improvement driver targets, relative to current status.
Don’t conflate health metrics with improvement drivers. Health metrics are intended to provide an objective assessment of current state, so we don’t want to set targets in the context of the health metrics. A heart rate monitor samples the patient heart rate, and provides a running summary of the sample series. Based on that information, you might want to set an improvement driver to lower the heart rate. The health metrics are expressed as a range, the improvement driver is expressed as a target. They are related, but distinct. In any case, the real focus should not be on either metric, but the proposed actions to try to achieve the improvement. Don’t get caught up in chasing numbers. Actual improvement is what we’re after.
For example, we might want to work on reducing Cycle Time, a health metric, but that is unlikely to happen as a result of simply doing things sooner or faster. Cycle Times are what they are for a reason, and the process of improvement is not in making faster times, but in identifying what those reasons are, and adapting practices to facilitate the work. Improved Cycle Time is an outcome that can be measured, but not actually the object of improvement.
“An operational definition translates a concept (round, random, safe, conforming, good) into a test and a criterion … the test-method and the criterion to have meaning for business and legal purposes can only be stated in statistical terms: likewise a standard for safety, reliability or performance”
— W. Edwards Deming
Delivery practice is the cornerstone of an IT organization. The reliability of a delivery service is a factor of the attention paid to pipeline implementation; the effectiveness of the test gates which guard stages and especially of the release protocols adopted. The capability of a delivery service can be appraised in terms of health metrics reported from pipeline events. These metrics can only be reliable if we insist on accurate communication about what we mean.
Software delivery pipelines bounded by operational definitions provide a solid foundation to begin to explore the question of variability. Walter A. Shewhart invented the Control Chart to visualize variability, with the goal of communicating what constitutes stability of a process and how to go about the business of improving things. Thus, the next topic in this series is “Understanding Variability”.
Man on factory floor with measuring tape. Photo credit: Spencer Davis
People gathered pointing at a laptop Photo credit: John chnobrich
OMV Oil Refinery in Schwechat, Austria. Photo credit: Dimitry Anikin
Measuring Tape. Photo credit: Darling Arias
Woman with Protractor Photo credit: J J Jordan
Operational Definitions Photo credit: Stephen Dawson
Sunlight on a forest road. Photo credit: Ingmarr
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2464836/Tweet alt="Tweet this" />