Anjamma Special: December 2022

SRE lifecycle

More involvement in architecture design
Less in Development Stage - inputs on testing
Increase in Limited availability period - Measure reliability and get feedback from users
General availability - full support for support from operations perspective. More reliability inputs got here.
Deprecation stage - Service is abandoned and no more new features added. SRE team continues with support here for old service. When anyone is no longer using it, they will turn the resources for the old services off and help transition to the new service and support for new service.

They provide guidance on

Architecture design
Provide inputs on how to create code that can be reliable
best practices
identify pitfalls
Help define SLA

SRE team should work closely with Development team, DevOps Team and Product Team towards a common goal of reliability.

They should identify the risks, threats and rank them. Monitor and review them.

SRE teams also participate in sprints and focus on Reliability. Key milestones

Production Readiness Review
Launch Meetings
Retrospective meetings

Production changes, outages,

How to measure SRE performance

Impact - Metrics
Maturity Matrix
Re-evaluate

Meaure

Customer experience - Usability - Server side latency and client side latency
Percentile latency -
Median latency. Average percentage of times response is valid. Doesn't take slow response into consideration
Long tail latency - includes responses that past the acceptable cut off time as errors
99 percentile latency - How cut of is decided. based on 99 percentile reliability.
Future Load - Man forecast. How much load might be expected in future. Memory usage, Storage size, bandwidth, queries per second.
Organic growth - Statistical model, based on history
Inorganic growth - This cannot be extrapolated from past data. Ex temporary surge

Service Efficiency

Load Testing should be preformed frequently by SREs.
Should build outage prevention in system so outage of service doesn't occur
Cascading of outages should be prevented

Working Model

Strategic workshop for reliability with Dev and SRE
Service Definition - Consider network architecture, security, reliability, business process etc
Service Design - Dev team create service, SRE focus on infra needs
Testing - Dev, SRE . Involve customer also
Pilot release - Further feedback and get closer to final product

Traditionally it has been developers developing applications and then operations team deploying it and maintaining the applications developed.

But there needs to be a team who are familiar to a certain extent with both worlds of app development and operations and act as quick liaison between these two teams and ensure the good performance of the deployed application, planning the required resources and capacity for the applications and ensuring these applications are always available and up and running and can be recovered quickly if any disaster occurs.

This team is the SRE team - Site Reliability Engineering Team. Bridge gap between Dev and Ops. They may be Devops in some organizations, they may not be Devops in certain organizations.

It doesn't mean these people start developing applications or they start performing the complete operations activity, but they will ensure that all what is required for application to perform well in field is addressed by both teams. For example developer has to provide proper logs, proper metrics, perform required testing etc and the operations team should know the availability of logs, monitor them, should know the details of metrics available, monitor and manage them etc. The middle team, ensures both parties are aware of these and work towards the common goal.

What are skills needed for an SRE

Infrastructure management through code- Should be able to create automation scripts and code to administer the environment and the application.
Bridge between administrators and application developers
Focus should be to improve the reliability of the application by writing code
Should be able to identify where the problem lies- system, process, environment or technology and resolve through coordination.
Should work based on SLA defined and SLO ( Service level Objective)
Should be able to put practice/process to multiple tests and keep fine tuning the practice/process. Should have very good debugging skills.

Responsibilities of SRE

Performance
Latency
Availability
Capacity planning
Change management
Monitoring and tracking
Disaster recovery
Efficiency

Day in a life of SRE engineer

50% of time spent in Operational work
50% of time spent in coding to automate the operational work and make it more efficient.
Use the freed up time to ensure these kind of operational overheads do not occur in application first place, by working closely with development team
Should work within error budget - Acceptable level of downtime in environment is called error budget. Error budgets are the noncompliance tolerance for the SLO. Usually downtimes will occur during upgrades etc. These upgrades are in turn focused towards reliability only.
Should monitor and notify users. Should monitor and try to do automatic heal etc and trigger to anyone should be done only if physical intervention is required. - Alerts, ( immediate action), tickets, ( actions later) , logging ( no action, just record it)
Should anticipate failures and be prepared to face them and act on them. Should maintain playbook capturing MTTF( Mean time to failure - Frequency of failures on average) and MTTR(Mean time to Repair - How long it takes to recover from failure)
Should do forecasting and capacity planning - Actions taken Provisioning- adding new location, modifying existing locations and testing

How SRE is different from DevOps

Devops - What needs to be done. Accept failures and address them when they occur. Automate operational tasks.
SRE - How needs to be done. Will enable Devops team and ensure all what they need is made available to them. Minimize errors and failures. Focus on minimizing the cost of failure, by gradual change. Check if automation of operational tasks itself is a problem for reliability. Measure operations and check if it is reliable

Common between SRE and DevOps

Have skill overlap between apps and operations team.
Want to reduce cost of failure by gradual change where ever needed
Encourage automation

Advantages of SRE

Break silos
Enable coordination and collaboration

SLA, SLI, SLO

SLI

SLI is less than or equal to SLO(Target)

Latency - Time to respond to a request
Error rate - % of requests that failed
System Throughput - total number of requests per second handled
Availability - Fraction of time service is in usable state
Yield - fraction of successful request
Durability - How long data will be retained for a service

Availability is most critical.

99% - 2 9s
99.999% - 5 nines
99.95% - 3 and a half nines

SLO ( Desirable target. What we want should be defined)

How much system is available
Precise numerical target for availability. Target is objective
Challenges in defining - Increase reliability will lead to increase in cost, less reliable-greater velocity, should consider periodic downtime and maintenance
Publishing SLO - sets user expectations, reduces user complaints,

SLA - Service Level Agreement

Contract between provider and consumer - SLOs and consequences of not meeting SLO
Clearly defines metrics - Like speed of service
Responsibilities of service. Procedures for escalation,
Expectations of each party
Protects all parties and ensure understanding of requirements by all.
Contains 2 components - Service and management
Service is about the metrics responsibilities, escalation metrics etc
Management is about how the service is measured, monitored, process etc

SRE defines SLI, helps in providing inputs for definition of SLA

Risk

Reliability is achieved through - Redundant hardware ( helps failover, maintenance)

Opportunity Cost - people involved in maintaining redundant servers. This leads to these engineers not taking up work or new opportunities

Measuring level of Risk

Unplanned down time - desirable and acceptable level to be defined.
Time based availability - Ex 99.999% means Approx 52 minutes of downtime per year = uptime/ (uptime+downtime)
Aggregate availability = successful request/ total requests

Anjamma Special

Friday, December 9, 2022

Site Reliability Engineering - Part 2

Site Reliability Engineering- Part 1

Blog Archive

Labels