Fudge Sunday - Cloud in Public: Mean Time To RCA
This week we continue to take a look at public things for a public cloud.
This issue is part 4 of a 5 part series
As of this issue, we now have historical perspectives and definitions for status dashboards, Engineering SLO, and DevCommsOps. Next, let’s talk about cultural values and the innovations which drive continuous improvement in pursuit of publishing timely Root Cause Analysis (RCA) that, in turn, help further the development of key performance indicators (KPIs).
Meant Time To Root Cause Analysis in practice
Last week we covered Who, What, and Where for cloud companies that “write it down” to pursue goals for The Perfect Team. This issue will get to one of the two remaining questions, When, and next week we will explore Why.
Now, perhaps, is time for another neologism. This neologism is Mean Time To RCA. As of now, the only search engine results for “Mean Time To RCA” will likely return this newsletter, and “Mean Time To Root Cause Analysis” will likely return Splunk too.
“Mean Time To RCA” can be viewed through several lenses or perspectives within a learning-focused postmortem culture. While vendors of tooling utilized by SRE and incident management practitioners have a variety of perspectives on the fastest way or most complete approach to get to RCA, they all trend to other Mean Time To X as a foundation (Ishikawa diagrams, Kaizen methods, Cause Maps, Postmortem Templates, etc.). That said, marketing teams for tooling vendors may look for a way to, at best, differentiate or, at worst, obfuscate with a thesaurus approach to naming conventions.
If X = R = Respond, Repair, Recovery, Resolve, or Resolution
If X = I = Identify, Isolate, or Insights
If X = F = Failure, Fix, Fidelity, or Facilitate
If X = A = Acknowledge, Activity, or Action
If X = D = Determine, Detect, or Diagnose
If X = V = Verify or Validate
If X = T = Triage or Telemetry
If X = C = Confirm, Clarity, or Closure
If X = RR… 🤣🤣🤣🤣
and so on
but it ALL adds up to the time it takes to get to RCA
So, one may wonder if MTTAA is the Mean Time To Another Acronym.🤔
Effectively, Mean Time To RCA (for this series) refers to the time it takes to produce actionable insights from a root cause analysis. The lessons learned will inform, refine, or result in creating KPIs or Objectives and Key Results (OKRs) for the organization as part of a commitment to conspicuous and continuous improvement.
We know there is an increasingly personalized approach to DevCommsOps among hyperscale public cloud service providers. So, we need to understand the impact on Mean Time To RCA from both general public DevCommsOps and the effect from personalized approaches.
To provide examples, let’s examine where Mean Time To RCA is found within the hyperscale public cloud service providers today using our previous searches for “Root Cause Analyses (RCAs) / Incidents.” Once again, the list is in no particular order or weighting other than shorter names to longer names.
IBM Cloud Mean Time to RCA examples:
~5 days for an outage duration of ~3 days
~10 days for an outage duration of ~12 hours
~10 days for an outage duration of ~9 hours
~10 days for an outage duration of ~6 hours
~2 days for an outage duration of ~3 hours
~3 days for an outage duration of ~2 hours
And so on
Alibaba Cloud Mean Time to RCA examples:
Unable to find any notices that include outage duration
Unable to find any links from news coverage of outages
And so on?
Microsoft Azure Mean Time to RCA examples:
Unable to find any notices with an actual publication date
RCA publishing is organized by the start date of an outage
Several RCA reference outages lasting to the following day
Otherwise, ~1 day for an outage duration of any length (unlikely?)
And so on?
Amazon Web Services Mean Time to RCA examples:
~9 days for the April 21, 2001 “disruption” and no duration calculated
~5 days for the July 2, 2012 “event” and no duration calculated
~5 days for the October 22, 2012 “event” based on Twitter update
~5 days for the December 24, 2012 “event” based on Twitter update
~3 days for the December 17, 2012 “event”
~5 days for the June 13, 2014 “disruption” based on Twitter update
The August 7, 2014 message URI seems to be recycled from 2011 🤷♂️
~3 days for the November 25, 2020 “event”
And so on
Google Cloud Platform Mean Time to RCA examples:
~9 days for the October 31, 2019 “incident” duration of ~3 days
~14 days for the May 20, 2021 “incident” duration of ~1 hour
And so on
Oracle Cloud Infrastructure Mean Time to RCA examples:
~3 days for the July 7, 2021 “Production Event” duration of ~16 hours
And so on?
As noted previously, AWS has relatively few (major) “post event summaries,” Google Cloud Platform has “incidents,” Oracle Cloud Infrastructure has “incidents,” Microsoft Azure has RCAs, and IBM Cloud has “incidents.”
For this sampling, there was no access to consoles (portals) required.
In summary, there are stark variations amongst the hyperscalers in expressing Mean Time To RCA. Further, it is reasonable to expect the market will drive demand for standards that normalize the variations.
At the same time, DevCommsOps mixes public and personalized views that are unique to the customer experience. Further, the drive for personalization will result in Mean Time To RCA for the customer informed by their unique specific dependency mapping. The Azure and Oracle Cloud approaches will appeal to particular Enterprise customers.
As a reminder, we have established definitions for status dashboards, Engineering SLO, DevCommsOps, and Mean Time To RCA. We have a baseline that is ready to compare general public dependencies and customer personalized views of the underlying dependencies among hyperscale public cloud service providers.
Our last issue in the series will look at the increasing importance of dependency mapping across hyperscale public cloud service providers. Finally, we will consider business value engineering and customer journeys.
If you read this far, when you think about a multicloud journey, keep Faction in mind as a strategic partner for maximizing access to hyperscale public cloud service provider innovations.
We’re hiring at Faction!🎉🤓☁️🚀
To see our current openings click here. ⬅️🤓☁️🚀
See a fit for you or someone in your network? ✅🤓☁️🚀
Please don’t hesitate to reach out to me.☎️🤓☁️🚀
Want to learn more? Here are some recent Faction related articles:
Storage & Data Protection: A Cloud-First Strategy by Alyson Langon
Intelligent CIO’s Myths of Multi-Cloud by Matt Wallace
Dataversity’s The Hidden Costs of Cyber Attacks by Mike Phelan
I am linking to my disclosure.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
1903 Live Oak St #92 Beaufort, NC 28516-0092