friends don't let friends compare velocities
Story Points are based on a parametric estimation technique useful for separating time and effort.
Velocity is a scheme for using Story Points to calculate the team’s progress across Sprints using interval scale math.
There’s nothing sketchy about interval scale math, although having a sketchy understanding of the principles will invariably lead to sketchy results.
Two plus two equals five is not without its attractions.
— Fyodor Dostoevsky
The canonical reference to Velocity calculation is Mike Cohn’s “Agile Estimation and Planning”, which was published over 15 years ago. The myriad of posts on the topic are wildly contradictory. Let’s set the record straight.
Measuring a Team's Progress
Outside of Agile practice, governance basically consists of verifying that promised requirements are delivered within budget. Agile practitioners argue that requirements should be left open for discovery during the development process: a compelling position but it leaves managers understandingly nervous about how to measure progress. The Velocity Calculation is a sort of magic formula that combines estimation with a metric used for governance.
Velocity is a measure of a team’s rate of progress.
— Mike Cohn
There’s nothing so sad as a manager without a metric, and Velocity is particularly satisfying because it’s complicated enough that people don’t feel bad that they don’t fully understand it. Most people don’t get past the Story Points part of it, but there is more to it than that. But first things first.
The Story of Story Points
Imagine we gave up software for a career in landscaping. If we find that it takes us 30 minutes to mow a half acre lot, then we can say with confidence that it should take 2 hours to mow a 2-acre lot. We could assign 3 Mowing Points to the half-acre lot and then we’d know that the 2-acre lot should be 12 Points.
Story points are a unit of measure for expressing the overall size of a user story, feature, or other piece of work.
— Mike Cohn
The beauty of this method is that it disassociates time and effort. If we get a faster mower, and it takes us only 20 minutes to mow a half acre, then we can apply our Mowing Points to recalculate that it should take 80 minutes to mow the 2 acres.
That convenient technique is called parametric estimation. Like it’s cousin, analogous estimation, it has its place in our estimation arsenal, but it can be dangerous when used in place of proper analysis.
Story Points Meet Their Match
If lawns were anything like software stories, eventually we’d encounter an acre of genetically modified steel-fiber turf. Not suspecting the trouble we’re in for, we’d burn our 6 Mowing Points allocation before we’re out of the gate. We’ll need to build custom tooling for our mower, mount diamond-edged blades, and swap out inflatable tires for solid rubber. Tell me it ain’t so.
Story Points employ parametric estimation at the stage in the process where analogy can be very deceptive, hiding a 89 Point story in a batch of otherwise innocent looking 3 and 6 Point stories. Seeking out this kind of trouble should be a feature of our estimation process. Analogous size estimation can give a false sense of certainty, causing us to overlook simple questions that might reduce uncertainty about true nature of some black swan we’re about to take into work.
Parametric estimation is useful for many estimation scenarios, for example when you have no clue, and you need to establish a starting point. However, it’s not a substitute for analysis.
Story Points as Interval Scale Measurement
Story Point estimation starts with establishing a baseline by assigning an arbitrary value to the smallest task. Recall from our lawn mowing work, we chose 3 Mower Points to signify a half acre lot, but we could have as easily picked 4, 40 or 400 Mower Points. It doesn’t matter, because we’re using the half-acre lot as our baseline. The arbitrary value we choose for the baseline is the feature that allows us to separate effort from time in estimating other tasks.
The measurement scale used for Story Points is called an interval scale, a method of measurement used to great advantage in many disciplines. For example, Intelligence Quotient or IQ is an interval scale measurement. Fahrenheit and Celsius are interval scales that everyone is familiar with. Unlike Kelvin, which is not an interval scale, neither 0°C or 0°F indicates an absence of temperature: just arbitrary points on the respective scales.
The arbitrary zero is the secret sauce of an interval scale. Given that we know something, it allows us to use that something as the baseline to establish a scale of comparison with other things with that something. We’re creating a new scale every time we set a baseline anchored on something we know. We can make reliable comparisons between different values on the same scale, but comparisons of values from scales based on different baselines are unreliable.
Friends don’t let friends compare Velocities
Imagine you’re a global construction manager who receives a call from a crew chief who reports that they had to stop working because the temperature reached 40 degrees. You check with another crew chief who declares: “40 degrees: we work! no problem!" Rewarding the crew chilling out in Fahrenheit while punishing the crew facing heat stroke in Celcius would be no more absurd than the common enterprise practice of judging the performance of different Scrum teams by comparing Velocity, something so mathematically goofy as to be worthy of the Dilbert pointy-haired boss award. Still, people do it all the time.
So hopefully it’s clear by now that comparing Velocity between Sprints of the same Scrum team is the only thing that Velocity was ever intended for and the only application for which it’s suited. But even for that, you should recognize that the practice is bending the math because you have to set the baseline for each Sprint.
Measurements of velocity are imprecise, and we expect velocity to fluctuate.
— Mike Cohn
Every time you set the baseline at the start of Sprint Planning, you’re establishing a new, unique interval scale, incompatible with any other interval scale. Since Velocity calculation is a comparison across Sprints, you are making comparisons across various unique interval scales.
The Necessity of a Consistent Baseline
An experienced team will be careful to keep the baseline story as close to possible to the same size from Sprint to Sprint so that the different interval scales of each Sprint are at least comparable. Many teams cannot be expected to baseline consistently, and in those cases, Velocity is going to be at best, misleading.
Story Points are like company script because you can inflate or deflate valuation any time you want.
— Michael Godeck
An experienced Scrum Master working under an irrational pointy-haired boss might not be wholly unjustified at selectively rebasing the scale to show a bump in Velocity if it keeps the boss from meddling with the team. Bernie Madoff had quite a long run with his scheme, and Sprint Velocity can be made to work for you, too.
There is nothing sketchy with the mathematics of interval scales; it’s just that the comparison of values from different scales with variant baselines can be misleading. In comparing Sprints from the same team, with similar baselines, then you could say that it all comes out in the wash; that is to say, the variance is subsumed in the average. Still, allow me to be so bold as to point out that there are simpler and more consistently reliable way of measuring progress.
Estimate for Decision Support
The purpose of estimation is not to make sure that the hamster is going faster on the wheel but to support decisions. Just as the separation of time from effort is the essence of Story Points, the real magic of estimation only reveals itself when you unbundle evaluation of performance from analysis of the work.
For teams practicing Kanban, after achieving stablity, Little’s Law provides a simple and mathematically sound measure of a team’s rate of progress.
In theory, Scrum teams can derive a comparable metric using Monte Carlo simulations, but before you try that in practice, be sure you understand the mathematics of Regression to the Mean.
Let's agree to define productivity in terms of throughput. We can debate the meaning of productivity in terms of additional measurements of the business value of delivered work, but as Eliyahu Goldratt pointed out in his critique of the Balanced Scorecard, there is a virtue in simplicity. Throughput doesn’t answer all our questions about business value, but it is a sufficient metric for the context of evaluating the relationship of practices with productivity.