Develop An Understanding Of Site Reliability Engineering Jargon

There are multiple jargon in Site reliability engineering, and each has its specific purpose in SDLC. Let’s understand how this help maintains the health of the system.

Alka Singh
2 min readNov 7, 2022

Site Reliability Engineering

Site reliability engineering, or SRE, ensures a software system’s reliability, stability, and resiliency. It was Google that set the foundation of SRE to get visibility within every service, node, cluster, and data center to support their increasing product velocity and ensure that change or push of any kind wouldn’t hinder release consistency.

In an always-on world, companies that want their services up and running at all times are using SRE principles and practices as forethought. It enables them to define SLIs, SLOs, SLAs, and error budgets that keep teams (developers, operations, and system administrators) aligned to what customers want and what engineering teams should do to achieve customer success/satisfaction. SRE helps improve engineering efficiency and the team’s day-to-day productivity. Adopting SRE also means creating incident playbooks and practicing running through them to prepare teams for the inevitable.

Incident Management / Incident Response

Incident management is an umbrella term organizations use to set up an emergency response strategy to deal with an incident. At the same time, incident response is a structured way or a framework established as a guide to managing incidents. Setting up incident management and incident response practices and policies provides teams a structure beforehand and eliminates last-minute chaos of who should do what or what needs to be done in a service outage.

A framework and set of defined processes provide a standardized way for distributed teams of global organizations to communicate, act and respond to problems. The goal is to handle the situation in a way that limits damage and minimizes recovery time and cost. Responding quickly to an incident will help the organization minimize losses, reduce exploited vulnerabilities, restore services and processes, and mitigate the risk of future incidents.

On-call

SRE On-Call is putting IT and development teams on call to ensure the organization has the right people available to address a problem during an incident regardless of the time of the day it occurs. It’s not that companies didn’t have a setup of this kind earlier. Still, the processes were slow since escalation teams didn’t accurately see the incident and where it happened in the system. And top of it, the distributed workforce worldwide was another roadblock making maintaining on-call schedules difficult.

Having SRE On-Call includes standardized tools and processes integrated into monitoring and alerting systems. An automated scheduling system and escalation alerts notify the right person about the issues that occurred. They provided them with complete visibility of the alert firing so that they could create better hypotheses as to what caused the problem. As everyone (developers, ITOps, project managers, etc.) can join and communicate through a centralized channel, everyone gets visibility to what’s happening and contribute as and when required. Once an incident is over, the system generates reports automatically for postmortem analysis.

--

--

Alka Singh

Technical Writer & Content Strategist | Active Since 2011 | Worked at OnGraph | Working On writingtrick.com |