SRE lifecycle
- More involvement in architecture design
- Less in Development Stage - inputs on testing
- Increase in Limited availability period - Measure reliability and get feedback from users
- General availability - full support for support from operations perspective. More reliability inputs got here.
- Deprecation stage - Service is abandoned and no more new features added. SRE team continues with support here for old service. When anyone is no longer using it, they will turn the resources for the old services off and help transition to the new service and support for new service.
- Architecture design
- Provide inputs on how to create code that can be reliable
- best practices
- identify pitfalls
- Help define SLA
- Production Readiness Review
- Launch Meetings
- Retrospective meetings
- Impact - Metrics
- Maturity Matrix
- Re-evaluate
- Customer experience - Usability - Server side latency and client side latency
- Percentile latency -
- Median latency. Average percentage of times response is valid. Doesn't take slow response into consideration
- Long tail latency - includes responses that past the acceptable cut off time as errors
- 99 percentile latency - How cut of is decided. based on 99 percentile reliability.
- Future Load - Man forecast. How much load might be expected in future. Memory usage, Storage size, bandwidth, queries per second.
- Organic growth - Statistical model, based on history
- Inorganic growth - This cannot be extrapolated from past data. Ex temporary surge
- Load Testing should be preformed frequently by SREs.
- Should build outage prevention in system so outage of service doesn't occur
- Cascading of outages should be prevented
- Strategic workshop for reliability with Dev and SRE
- Service Definition - Consider network architecture, security, reliability, business process etc
- Service Design - Dev team create service, SRE focus on infra needs
- Testing - Dev, SRE . Involve customer also
- Pilot release - Further feedback and get closer to final product