Definition ” SRE ” What is Site Reliability Engineering?
Site Reliability Engineering or SRE for short is one of. Google-developed Service Management model. The development and operation of large distributed systems are closely linked. The control processes represent a concretization of the DevOps philosophy.
Companies on the topic
In site reliability Engineering, the development team also includes members who also have an operational background.
(© vladimircaribb – stock.adobe.com)
Site Reliability Engineers build a bridge between development and operation by applying a software-technical mindset to system administration topics. They divide their work equally between operational and development tasks. The ideal candidate for an SRE post is therefore either a software engineer with a stressful background as an administrator or a highly qualified system administrator with experience in programming and automation.
During operation, SRE employees investigate the resilience and weaknesses of the systems in order to identify optimization and scaling possibilities. In addition, the search for solutions to simplify the handling of the systems has the same priority. Experience in the operation of large IT infrastructures, flows directly into the development of the structures.
Part of the concept is that SREs do not spend more than half their time on operations. A violation of this rule is considered a sign of poor implementation.
The SRE concept became successful and well known by the company Google, which cultivated it long before the company made the principles public. It has the same goals of process improvement as the DevOps philosophy. DevOps was coined around 2008 and stands for a corporate culture of cross-team collaboration. All instances should be brought into line with the same vision, joint responsibility for success should be established.
SRE and DevOps
Prior to the distribution of DevOps, the development and operations teams worked independently with their own goals and targets. In order to communicate better and work together more smoothly, DevOps teams became the most important in every company.
DevOps and SRE serve equally to reduce the gap between software development and software operations, with the goal of improving the release cycle for complex distributed systems. The DevOps concept defines what the results should look like; it stands for a cultural change within a company. SRE is about designing the theoretical DevOps approach as a workflow with appropriate methods and tools.
SRE includes the continuous automation of manual tasks and continuous integration and delivery. SREs take responsibility for operational reliability and automation throughout the infrastructure lifecycle, monitoring deployment and operation of releases.
The 5 Basic principles of the DevOps philosophy and their implementation by SRE
1. Dismantling organizational silos
Large companies have a complex organizational structure with a variety of teams, often working separately in “silos”. Each team has a different view of the whole thing, which favors inefficiency. The task of DevOps and SREs is to better align the teams with each other towards the overall goals and a common vision.
2. View failures as part of the process
DevOps assumes that failures are part of the process and are helpful in learning from it. SREs ensure that there are not too many failures or failures. To do this, they use formulas to weigh up failures with the release of new versions: service level indicators (SLIs) and service level targets (SLOs). SLIs measure failures over time. A SLO is an agreement within a service level agreement on a specific metric, such as uptime or response time, to be met.
From the SRE perspective, a clear agreement between the business and IT levels is required to establish optimal targets for service level targets and service level indicators. Each violation leads to a re-evaluation and optimization of the goals.
The SRE guidelines encourage radical change within certain limits. SREs have a risk budget to test these limits and potentially innovate faster. SRE quantifies this acceptable risk as an”error budget”. When the error budgets are exhausted, the focus shifts from development to improving reliability. Availability and further development are thus balanced.
3. Implement changes in quick small steps
Like DevOps, SRE promotes continuous improvement through small and frequent development steps. With short iteration cycles, any negative effects are less severe, and low-risk improvements can be easily tested and implemented (as automated as possible).
4. Use common tools and automation
Incompatibility and integration problems between technologies from different vendors, eras and use cases can create silos even in a DevOps environment. SRE introduces uniform technologies and comprehensive information access in the various IT teams. SRE demands that all teams working on the same service use the same technologies.
SRE pursues the principle of automating manual tasks that are repetitive and reactive and do not bring lasting improvement. Automation should free up capacity for work that brings long-term benefits.
5. Reliability based on measurement data
Evaluating appropriate reliability targets is a contextual challenge for DevOps and SREs. SREs ensure that all levels in the company agree on how to measure reliability and what to do if the value does not meet the specifications.
The DevOps key metrics are the number of deployments in time, the lead time from commitment to release, the number of failed deployments, and the recovery time required.
The basics for SRE are the Service level Target (SLO) and the Service level indicator (SLI). SREs use these metrics to determine whether a release with a change will go into operation or not.