What does a 5 look like?

in Software Engineering, Delivery

How long will it take?

My team, like most agile teams, estimate story size, relatively, using the Fibonacci series, but we didn't know what a 5 looked like. We also had the notion that 5 points was about a days worth work, but is that still valid? (To put this into context: We're effectively a new team - the team has changed considerably over the last six months.)

We decided that we should identify a few sample stories which we could refer to when sizing any new work and work out how long it takes to do a 5 point story, but at the same time we each had the same thought: Estimating is difficult and, notoriously, unreliable so why do we do it?

Why do we estimate?

The most obvious reasons why we estimate are:

  • Measure performance (velocity) - We did another 10 points this week. Did we get faster or were our estimates off?
  • Planning - we try to work out when something will be completed - If a 5 person team can do 25 points a week, I should have this 5 point story in a week, right?
  • Prioritisation - I could have a single 5 point story or a 2 and a 3 point story, thats better value, right?

Is there a problem using estimates in that way?

Those things are fine, but a reliable velocity is built over many iterations with a consistent story quality and mature team. But teams can change, the quality of stories varies, we're asked to estimate things we have little or no experience of, and we wrongly think: I built a logon page for another product using a completely different stack, so that'll be the same. Points are not transferable.

Despite that, we'll try and plan (because an estimate is a promise - I kid of course), prioritise lots of smaller stories and pat ourselves on the back when the velocity goes up.

Next steps

So back to the point. I had collected three months of stories that had been completed, but which stories should I use as my reference stories and how do I work out how long a story should take?

I decided to plot the cycle time taken for each story against story point. I expected to see a nice grouping of cycle times for each point that increases as the estimate does which meaning I could use any story as my reference and at worse I would find out that 5 points isn't a days worth of work any more.

Well I got this:

plain chart image

No correlation

The chart above would suggest that there is no correlation between cycle time and the estimate given to a story. There is a lot of variation at each sizing and some data points that look like they might be outliers.

Also, those stories that have 0 for an estimate - we didn't do work for free, we just didn't assign an estimate to them. Whos to say what we would've estimated those as if we had, so I'll remove those. I'll remove too the story points that only have a single data point, 0.5 and 8 points.

With the chart tidied up I can use BoxPlotR to produce a box plot:

box plot image

1 point 2 points 3 points 5 points
Upper whisker 6.01 5.10 4.91 5.88
3rd quartile 3.07 2.52 4.37 3.98
Median 0.08 1.06 1.44 2.02
1st quartile 0.04 0.22 0.84 0.88
Lower whisker 0.01 0.01 0.01 0.08
Nr. of data points 11.00 35.00 16.00 17.00

Centre lines show the medians; box limits indicate the 25th and 75th percentiles as determined by R software; whiskers extend 1.5 times the interquartile range from the 25th and 75th percentiles, suspected outliers lie outside of the whiskers; data points are plotted as open circles.

The box length for each point is similar, suggesting similar amount of variation in the data. The data for 1 and 3 points, seem to be skewed to the right, based on the distance of the median from the lower quartile. The skew for 3 points is quite slight as the length of the whiskers is very similar.

The median cycle time does seem to trend upwards as the story point increases, but I don't have much confidence in it.

Furthermore, I certainly wouldn't quote the median value: This is a 5, so you'll have it in 2 days. We're just about as likely to deliver a 2 point story in 2 days as we are a 5 point story. Instead, if a customer wanted to know when we'd complete a ticket, I'd rather tell them, that in most cases, it would probably be done within two days.

Cycle time (days) Percentage of stories delivered
<=2 63
>2 <=4 18
>4 <=6 10
>6 9

What are we estimating

My team works for different customers and each of those customers have their own constraints that may contribute to the variation in cycle time. What if I just look at the time it takes for us to actually code a story (I think its all to easy for a developer to just consider how long it'll take them to do "their" part anyway) - lets see:

box plot of dev time

1 point 2 points 3 points 5 points
Upper whisker 0.07 1.79 1.75 3.98
3rd quartile 0.04 0.80 1.00 1.96
Median 0.04 0.10 0.91 0.88
1st quartile 0.02 0.05 0.22 0.19
Lower whisker 0.01 0.01 0.01 0.08
Nr. of data points 12.00 35.00 17.00 18.00

Centre lines show the medians; box limits indicate the 25th and 75th percentiles as determined by R software; whiskers extend 1.5 times the interquartile range from the 25th and 75th percentiles, suspected outliers lie outside of the whiskers; data points are plotted as open circles.

Obviously the time taken is reduced now - note the change in scale. All except the one point stories still have some variation with a skewed distribution. The variation of the two and three point stories is less than it was when we plotted cycle time. Therefore, some of the variation can be attributed to various constraints that slow down the delivery of a ticket.

Furthermore, it seems like we're better at estimating smaller stories, which I'd back up from experience as we tend to give a story a bigger estimate if we're a bit unsure and that would explain the greater variation of the 5 point stories.

What does #NoEstimates look like?

What would it look like if I removed the estimate from the plot? I've included the #NoEstimates data points too (the ones with 0 story points in the first chart).

box plot of dev time

No estimate
Upper whisker 6.02
3rd quartile 3.10
Median 1.08
1st quartile 0.19
Lower whisker 0.01
Nr. of data points 89.00
Mean 2.34

Centre lines show the medians; box limits indicate the 25th and 75th percentiles as determined by R software; whiskers extend 1.5 times the interquartile range from the 25th and 75th percentiles, suspected outliers are represented by dots; crosses represent sample means; data points are plotted as open circles.

The amount of variation is no worse than the previous plots. And again, the data is skewed to the right - just look at the spread of the swarm in the lower quartile.

We already know that, in most cases (63% of tickets), we would deliver a ticket within two days. If we use the upper quartile, 3.1 days, 74% of tickets would be delivered within that time.

Conclusions

So what does it all mean?

I've struggled to write this post. The theme has changed a lot, flip-flopping between #NoEstimates and #Estimates.

I think the takeaways from this would be:

  • Should you use estimates to plan? We might have it for you in 1 hour or 5 days.
  • Should you prioritise work based on estimates? Was it really a quick win?
  • There is too much variation to produce a meaningful trend.

I've now got my reference stories (a few stories at each point closest to their median). I know 5 points does not equal a effort. I can tell customers that their story will be live in about three days. And the team will work with our customers to reign in the varied cycle time.

Hypothesis, Measure and Learn

Now I've got my sample stories for each story point, I think by using those stories as a reference when we estimate, we should see a tighter grouping at each point when I next plot cycle time against story point.