View profile

Fudge Sunday - Cloud in Public: Mean Time To RCA

Fudge Sunday
As of this issue, we now have historical perspectives and definitions for status dashboardsEngineering SLO, and DevCommsOps. Next, let’s talk about cultural values and the innovations which drive continuous improvement in pursuit of publishing timely Root Cause Analysis (RCA) that, in turn, help further the development of key performance indicators (KPIs).
Meant Time To Root Cause Analysis in practice
Last week we covered Who, What, and Where for cloud companies that “write it down” to pursue goals for The Perfect Team. This issue will get to one of the two remaining questions, When, and next week we will explore Why.
Now, perhaps, is time for another neologism. This neologism is Mean Time To RCA. As of now, the only search engine results for “Mean Time To RCA” will likely return this newsletter, and “Mean Time To Root Cause Analysis” will likely return Splunk too.
“Mean Time To RCA” can be viewed through several lenses or perspectives within a learning-focused postmortem culture. While vendors of tooling utilized by SRE and incident management practitioners have a variety of perspectives on the fastest way or most complete approach to get to RCA, they all trend to other Mean Time To X as a foundation (Ishikawa diagrams, Kaizen methods, Cause Maps, Postmortem Templates, etc.). That said, marketing teams for tooling vendors may look for a way to, at best, differentiate or, at worst, obfuscate with a thesaurus approach to naming conventions.
  • If X = R = Respond, Repair, Recovery, Resolve, or Resolution
  • If X = I = Identify, Isolate, or Insights
  • If X = F = Failure, Fix, Fidelity, or Facilitate
  • If X = A = Acknowledge, Activity, or Action
  • If X = D = Determine, Detect, or Diagnose
  • If X = V = Verify or Validate
  • If X = T = Triage or Telemetry
  • If X = C = Confirm, Clarity, or Closure
  • If X = RR… 🤣🤣🤣🤣
  • and so on
  • but it ALL adds up to the time it takes to get to RCA
So, one may wonder if MTTAA is the Mean Time To Another Acronym.🤔
Effectively, Mean Time To RCA (for this series) refers to the time it takes to produce actionable insights from a root cause analysis. The lessons learned will inform, refine, or result in creating KPIs or Objectives and Key Results (OKRs) for the organization as part of a commitment to conspicuous and continuous improvement.
We know there is an increasingly personalized approach to DevCommsOps among hyperscale public cloud service providers. So, we need to understand the impact on Mean Time To RCA from both general public DevCommsOps and the effect from personalized approaches.
To provide examples, let’s examine where Mean Time To RCA is found within the hyperscale public cloud service providers today using our previous searches for “Root Cause Analyses (RCAs) / Incidents.” Once again, the list is in no particular order or weighting other than shorter names to longer names.
IBM Cloud Mean Time to RCA examples:
  • ~5 days for an outage duration of ~3 days
  • ~10 days for an outage duration of ~12 hours
  • ~10 days for an outage duration of ~9 hours
  • ~10 days for an outage duration of ~6 hours
  • ~2 days for an outage duration of ~3 hours
  • ~3 days for an outage duration of ~2 hours
  • And so on
Alibaba Cloud Mean Time to RCA examples:
  • Unable to find any notices that include outage duration
  • Unable to find any links from news coverage of outages
  • And so on?
Microsoft Azure Mean Time to RCA examples:
  • RCA (detailed) can be made available upon request
  • Unable to find any notices with an actual publication date
  • RCA publishing is organized by the start date of an outage
  • Several RCA reference outages lasting to the following day
  • Otherwise, ~1 day for an outage duration of any length (unlikely?)
  • And so on?
Amazon Web Services Mean Time to RCA examples:
  • ~9 days for the April 21, 2001 “disruption” and no duration calculated
  • ~5 days for the July 2, 2012 “event” and no duration calculated
  • ~5 days for the October 22, 2012 “event” based on Twitter update
  • ~5 days for the December 24, 2012 “event” based on Twitter update
  • ~3 days for the December 17, 2012 “event”
  • ~5 days for the June 13, 2014 “disruption” based on Twitter update
  • The August 7, 2014 message URI seems to be recycled from 2011 🤷‍♂️
  • ~3 days for the November 25, 2020 “event”
  • And so on
Google Cloud Platform Mean Time to RCA examples:
  • ~9 days for the October 31, 2019 “incident” duration of ~3 days
  • ~14 days for the May 20, 2021 “incident” duration of ~1 hour
  • And so on
Oracle Cloud Infrastructure Mean Time to RCA examples:
Notes:
In summary, there are stark variations amongst the hyperscalers in expressing Mean Time To RCA. Further, it is reasonable to expect the market will drive demand for standards that normalize the variations.
At the same time, DevCommsOps mixes public and personalized views that are unique to the customer experience. Further, the drive for personalization will result in Mean Time To RCA for the customer informed by their unique specific dependency mapping. The Azure and Oracle Cloud approaches will appeal to particular Enterprise customers.
As a reminder, we have established definitions for status dashboards, Engineering SLODevCommsOps, and Mean Time To RCA. We have a baseline that is ready to compare general public dependencies and customer personalized views of the underlying dependencies among hyperscale public cloud service providers.
Our last issue in the series will look at the increasing importance of dependency mapping across hyperscale public cloud service providers. Finally, we will consider business value engineering and customer journeys.
Stay tuned!
Work Plug!
If you read this far, when you think about a multicloud journey, keep Faction in mind as a strategic partner for maximizing access to hyperscale public cloud service provider innovations.
We’re hiring at Faction!🎉🤓☁️🚀
To see our current openings click here. ⬅️🤓☁️🚀
Please don’t hesitate to reach out to me.☎️🤓☁️🚀
Want to learn more? Here are some recent Faction related articles:
  1. Unlocking the Opportunities of Multi-Cloud by Travis Vigil
  2. Storage & Data Protection: A Cloud-First Strategy by Alyson Langon
  3. Intelligent CIO’s Myths of Multi-Cloud by Matt Wallace
  4. Dataversity’s The Hidden Costs of Cyber Attacks by Mike Phelan
  5. Multi-cloud data fabric use cases with CTOAdvisor
  6. Multi-cloud technical overview with CTOAdvisor
  7. Multi-cloud data security with CTOAdvisor
  8. Multi-cloud data access with CTOAdvisor
  9. Multi-Cloud at VMWorld with CTOAdvisor
Disclosure
I am linking to my disclosure.
Did you enjoy this issue? Yes No
Jay Cuthrell
Jay Cuthrell @JayCuthrell

Start the week more informed than the week before

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
2814 Brooks St, Suite 512, Missoula, MT, 59801