SRE lifecycle
- More involvement in architecture design
- Less in Development Stage - inputs on testing
- Increase in Limited availability period - Measure reliability and get feedback from users
- General availability - full support for support from operations perspective. More reliability inputs got here.
- Deprecation stage - Service is abandoned and no more new features added. SRE team continues with support here for old service. When anyone is no longer using it, they will turn the resources for the old services off and help transition to the new service and support for new service.
They provide guidance on
- Architecture design
- Provide inputs on how to create code that can be reliable
- best practices
- identify pitfalls
- Help define SLA
SRE team should work closely with Development team, DevOps Team and Product Team towards a common goal of reliability.
They should identify the risks, threats and rank them. Monitor and review them.
SRE teams also participate in sprints and focus on Reliability. Key milestones
- Production Readiness Review
- Launch Meetings
- Retrospective meetings
Production changes, outages,
How to measure SRE performance
- Impact - Metrics
- Maturity Matrix
- Re-evaluate
Meaure
- Customer experience - Usability - Server side latency and client side latency
- Percentile latency -
- Median latency. Average percentage of times response is valid. Doesn't take slow response into consideration
- Long tail latency - includes responses that past the acceptable cut off time as errors
- 99 percentile latency - How cut of is decided. based on 99 percentile reliability.
- Future Load - Man forecast. How much load might be expected in future. Memory usage, Storage size, bandwidth, queries per second.
- Organic growth - Statistical model, based on history
- Inorganic growth - This cannot be extrapolated from past data. Ex temporary surge
Service Efficiency
- Load Testing should be preformed frequently by SREs.
- Should build outage prevention in system so outage of service doesn't occur
- Cascading of outages should be prevented
Working Model
- Strategic workshop for reliability with Dev and SRE
- Service Definition - Consider network architecture, security, reliability, business process etc
- Service Design - Dev team create service, SRE focus on infra needs
- Testing - Dev, SRE . Involve customer also
- Pilot release - Further feedback and get closer to final product