Effective Ways to Approach Site Reliability Engineering in an Organization
Ram is a seasoned IT professional with over 25 years of experience in IT Strategy & Execution, Business/Digital Transformation, Operations Management, Vendor Governance, Security, Risk & Compliance Management and many other critical areas. Prior to joining S&P in 2022, he has worked with Tata Tele Business Services and Tech Mahindra across various capacities.
In a recent interaction with Keerthana H K (Correspondent, CIO Insider), Ram shared his insights on site reliability engineering (SRE), its importance, and ways business leaders must approach it in today’s dynamic business landscape. Below are the excerpts from the exclusive interview –
Tell us about a few KPIs that are considered for site reliability engineering.
Stability and reliability are the two pillars when it comes to site reliability engineering. Since stability focuses more on the availability of a system, uptime of the application and mean time between failures become the main KPIs. In terms of reliability, performance and quality of a system are two very important factors. The mean time to resolution when faced with a problem, performance index indicating the performance with respect to the agreed SLAs and error budget compliance are also critical aspects in SRE.
How can business leaders stay updated with emerging trends and technologies in SRE and IT operations?
With the advent of internet, there is so much of information overload today. There are a huge number of companies and startups coming-up each day, and a large number of vendors also approach you when you occupy a leadership position to pitch-in their product/service offerings. In the meantime, you also get invited to a lot of industry forums, most of which are virtual or hybrid events due to the Covid pandemic. This has made it a lot easier for business leaders to attend and acquire knowledge of the latest happenings in the market. You can also have focus group within the
industry between similar other companies so that you are able to share information with each other about their experience, value-add they’re experiencing and many other aspects. While it is critical for every business leader to stay updated with the latest tech advancements, identifying the technologies that are relevant to your organizations is a major challenge. To solve this, business leaders must ensure to have a clear understanding of their business challenges and technology pain points.
Suggest a few strategies to foster a culture of collaboration and accountability with an SRE team.
Firstly, we baseline our current positioning, identify what our dream state is and chalk-out a detailed plan of action to reach those objectives. Secondly, we should define responsibilities at both individual and team levels towards accomplishing those goals. Post this, you need to have lead & lag indicators that help you measure the journey from your current position to where you want to get to, which will also help you measure the maturity as you go along. When you share this kind of perspective with your team and show them the bigger picture, they get united by the purpose which results in effective collaboration with each other. Also, when you have a well-defined organizational goals, the best way to being-in accountability is to set goals at the individual levels using SMART (Specific, Measurable, Achievable, Relevant and Time-bound). Another way in which I have been able to build a culture of collaboration within my team is by creating avenues for knowledge sharing based on common interests.
Once you have properly documented the tasks that are being done manually, you need to lookout for opportunities for automating any task that is routine and repetitive.
While stability and reliability are the first two pillars of SRE, engineering is the third pillar, wherein automation is an integral part of every SRE team. Unless you have a clear visibility of what you’re doing manually, it is very difficult for a company to implement automation or even semi-automation techniques. Once you have properly documented the tasks that are being done manually, you need to lookout for opportunities for automating any task that is routine and repetitive. Automation enables an organization to enhance the efficiency of its processes, reduce errors, cut-down security threats and frees-up the bandwidth of critical resources who can then focus on more complex scenarios and innovate faster. Most importantly, it is important to keep in mind that automation is an on-going process and not a one-time thing.
How should organizations ensure SRE practices are aligned with their business goals & objectives?
Since IT or SRE teams cannot have their own strategies without understanding the business, the SRE strategy is always unique to every organization. Thus, it is important for SRE teams to stay aligned with what the business does, understand its pain points and identify ways in which they can bring-in efficiency, increase revenue, reduce TTM and do customer acquisition more efficiently. This way, SRE teams can maximize the value for your business with minimum resources and risk.