Traditionally it has been developers developing applications and then operations team deploying it and maintaining the applications developed.
But there needs to be a team who are familiar to a certain extent with both worlds of app development and operations and act as quick liaison between these two teams and ensure the good performance of the deployed application, planning the required resources and capacity for the applications and ensuring these applications are always available and up and running and can be recovered quickly if any disaster occurs.
This team is the SRE team - Site Reliability Engineering Team. Bridge gap between Dev and Ops. They may be Devops in some organizations, they may not be Devops in certain organizations.
It doesn't mean these people start developing applications or they start performing the complete operations activity, but they will ensure that all what is required for application to perform well in field is addressed by both teams. For example developer has to provide proper logs, proper metrics, perform required testing etc and the operations team should know the availability of logs, monitor them, should know the details of metrics available, monitor and manage them etc. The middle team, ensures both parties are aware of these and work towards the common goal.
What are skills needed for an SRE
- Infrastructure management through code- Should be able to create automation scripts and code to administer the environment and the application.
- Bridge between administrators and application developers
- Focus should be to improve the reliability of the application by writing code
- Should be able to identify where the problem lies- system, process, environment or technology and resolve through coordination.
- Should work based on SLA defined and SLO ( Service level Objective)
- Should be able to put practice/process to multiple tests and keep fine tuning the practice/process. Should have very good debugging skills.
Responsibilities of SRE
- Performance
- Latency
- Availability
- Capacity planning
- Change management
- Monitoring and tracking
- Disaster recovery
- Efficiency
Day in a life of SRE engineer
- 50% of time spent in Operational work
- 50% of time spent in coding to automate the operational work and make it more efficient.
- Use the freed up time to ensure these kind of operational overheads do not occur in application first place, by working closely with development team
- Should work within error budget - Acceptable level of downtime in environment is called error budget. Error budgets are the noncompliance tolerance for the SLO. Usually downtimes will occur during upgrades etc. These upgrades are in turn focused towards reliability only.
- Should monitor and notify users. Should monitor and try to do automatic heal etc and trigger to anyone should be done only if physical intervention is required. - Alerts, ( immediate action), tickets, ( actions later) , logging ( no action, just record it)
- Should anticipate failures and be prepared to face them and act on them. Should maintain playbook capturing MTTF( Mean time to failure - Frequency of failures on average) and MTTR(Mean time to Repair - How long it takes to recover from failure)
- Should do forecasting and capacity planning - Actions taken Provisioning- adding new location, modifying existing locations and testing
How SRE is different from DevOps
- Devops - What needs to be done. Accept failures and address them when they occur. Automate operational tasks.
- SRE - How needs to be done. Will enable Devops team and ensure all what they need is made available to them. Minimize errors and failures. Focus on minimizing the cost of failure, by gradual change. Check if automation of operational tasks itself is a problem for reliability. Measure operations and check if it is reliable
Common between SRE and DevOps
- Have skill overlap between apps and operations team.
- Want to reduce cost of failure by gradual change where ever needed
- Encourage automation
Advantages of SRE
- Break silos
- Enable coordination and collaboration
SLA, SLI, SLO
SLI
SLI is less than or equal to SLO(Target)
- Latency - Time to respond to a request
- Error rate - % of requests that failed
- System Throughput - total number of requests per second handled
- Availability - Fraction of time service is in usable state
- Yield - fraction of successful request
- Durability - How long data will be retained for a service
Availability is most critical.
- 99% - 2 9s
- 99.999% - 5 nines
- 99.95% - 3 and a half nines
SLO ( Desirable target. What we want should be defined)
- How much system is available
- Precise numerical target for availability. Target is objective
- Challenges in defining - Increase reliability will lead to increase in cost, less reliable-greater velocity, should consider periodic downtime and maintenance
- Publishing SLO - sets user expectations, reduces user complaints,
SLA - Service Level Agreement
- Contract between provider and consumer - SLOs and consequences of not meeting SLO
- Clearly defines metrics - Like speed of service
- Responsibilities of service. Procedures for escalation,
- Expectations of each party
- Protects all parties and ensure understanding of requirements by all.
- Contains 2 components - Service and management
- Service is about the metrics responsibilities, escalation metrics etc
- Management is about how the service is measured, monitored, process etc
SRE defines SLI, helps in providing inputs for definition of SLA
Risk
Reliability is achieved through - Redundant hardware ( helps failover, maintenance)
Opportunity Cost - people involved in maintaining redundant servers. This leads to these engineers not taking up work or new opportunities
Measuring level of Risk
- Unplanned down time - desirable and acceptable level to be defined.
- Time based availability - Ex 99.999% means Approx 52 minutes of downtime per year = uptime/ (uptime+downtime)
- Aggregate availability = successful request/ total requests