Anjamma Special

Friday, December 9, 2022

Site Reliability Engineering - Part 2

SRE lifecycle

More involvement in architecture design
Less in Development Stage - inputs on testing
Increase in Limited availability period - Measure reliability and get feedback from users
General availability - full support for support from operations perspective. More reliability inputs got here.
Deprecation stage - Service is abandoned and no more new features added. SRE team continues with support here for old service. When anyone is no longer using it, they will turn the resources for the old services off and help transition to the new service and support for new service.

They provide guidance on

Architecture design
Provide inputs on how to create code that can be reliable
best practices
identify pitfalls
Help define SLA

SRE team should work closely with Development team, DevOps Team and Product Team towards a common goal of reliability.

They should identify the risks, threats and rank them. Monitor and review them.

SRE teams also participate in sprints and focus on Reliability. Key milestones

Production Readiness Review
Launch Meetings
Retrospective meetings

Production changes, outages,

How to measure SRE performance

Impact - Metrics
Maturity Matrix
Re-evaluate

Meaure

Customer experience - Usability - Server side latency and client side latency
Percentile latency -
Median latency. Average percentage of times response is valid. Doesn't take slow response into consideration
Long tail latency - includes responses that past the acceptable cut off time as errors
99 percentile latency - How cut of is decided. based on 99 percentile reliability.
Future Load - Man forecast. How much load might be expected in future. Memory usage, Storage size, bandwidth, queries per second.
Organic growth - Statistical model, based on history
Inorganic growth - This cannot be extrapolated from past data. Ex temporary surge

Service Efficiency

Load Testing should be preformed frequently by SREs.
Should build outage prevention in system so outage of service doesn't occur
Cascading of outages should be prevented

Working Model

Strategic workshop for reliability with Dev and SRE
Service Definition - Consider network architecture, security, reliability, business process etc
Service Design - Dev team create service, SRE focus on infra needs
Testing - Dev, SRE . Involve customer also
Pilot release - Further feedback and get closer to final product

Site Reliability Engineering- Part 1

Traditionally it has been developers developing applications and then operations team deploying it and maintaining the applications developed.

But there needs to be a team who are familiar to a certain extent with both worlds of app development and operations and act as quick liaison between these two teams and ensure the good performance of the deployed application, planning the required resources and capacity for the applications and ensuring these applications are always available and up and running and can be recovered quickly if any disaster occurs.

This team is the SRE team - Site Reliability Engineering Team. Bridge gap between Dev and Ops. They may be Devops in some organizations, they may not be Devops in certain organizations.

It doesn't mean these people start developing applications or they start performing the complete operations activity, but they will ensure that all what is required for application to perform well in field is addressed by both teams. For example developer has to provide proper logs, proper metrics, perform required testing etc and the operations team should know the availability of logs, monitor them, should know the details of metrics available, monitor and manage them etc. The middle team, ensures both parties are aware of these and work towards the common goal.

What are skills needed for an SRE

Infrastructure management through code- Should be able to create automation scripts and code to administer the environment and the application.
Bridge between administrators and application developers
Focus should be to improve the reliability of the application by writing code
Should be able to identify where the problem lies- system, process, environment or technology and resolve through coordination.
Should work based on SLA defined and SLO ( Service level Objective)
Should be able to put practice/process to multiple tests and keep fine tuning the practice/process. Should have very good debugging skills.

Responsibilities of SRE

Performance
Latency
Availability
Capacity planning
Change management
Monitoring and tracking
Disaster recovery
Efficiency

Day in a life of SRE engineer

50% of time spent in Operational work
50% of time spent in coding to automate the operational work and make it more efficient.
Use the freed up time to ensure these kind of operational overheads do not occur in application first place, by working closely with development team
Should work within error budget - Acceptable level of downtime in environment is called error budget. Error budgets are the noncompliance tolerance for the SLO. Usually downtimes will occur during upgrades etc. These upgrades are in turn focused towards reliability only.
Should monitor and notify users. Should monitor and try to do automatic heal etc and trigger to anyone should be done only if physical intervention is required. - Alerts, ( immediate action), tickets, ( actions later) , logging ( no action, just record it)
Should anticipate failures and be prepared to face them and act on them. Should maintain playbook capturing MTTF( Mean time to failure - Frequency of failures on average) and MTTR(Mean time to Repair - How long it takes to recover from failure)
Should do forecasting and capacity planning - Actions taken Provisioning- adding new location, modifying existing locations and testing

How SRE is different from DevOps

Devops - What needs to be done. Accept failures and address them when they occur. Automate operational tasks.
SRE - How needs to be done. Will enable Devops team and ensure all what they need is made available to them. Minimize errors and failures. Focus on minimizing the cost of failure, by gradual change. Check if automation of operational tasks itself is a problem for reliability. Measure operations and check if it is reliable

Common between SRE and DevOps

Have skill overlap between apps and operations team.
Want to reduce cost of failure by gradual change where ever needed
Encourage automation

Advantages of SRE

Break silos
Enable coordination and collaboration

SLA, SLI, SLO

SLI

SLI is less than or equal to SLO(Target)

Latency - Time to respond to a request
Error rate - % of requests that failed
System Throughput - total number of requests per second handled
Availability - Fraction of time service is in usable state
Yield - fraction of successful request
Durability - How long data will be retained for a service

Availability is most critical.

99% - 2 9s
99.999% - 5 nines
99.95% - 3 and a half nines

SLO ( Desirable target. What we want should be defined)

How much system is available
Precise numerical target for availability. Target is objective
Challenges in defining - Increase reliability will lead to increase in cost, less reliable-greater velocity, should consider periodic downtime and maintenance
Publishing SLO - sets user expectations, reduces user complaints,

SLA - Service Level Agreement

Contract between provider and consumer - SLOs and consequences of not meeting SLO
Clearly defines metrics - Like speed of service
Responsibilities of service. Procedures for escalation,
Expectations of each party
Protects all parties and ensure understanding of requirements by all.
Contains 2 components - Service and management
Service is about the metrics responsibilities, escalation metrics etc
Management is about how the service is measured, monitored, process etc

SRE defines SLI, helps in providing inputs for definition of SLA

Risk

Reliability is achieved through - Redundant hardware ( helps failover, maintenance)

Opportunity Cost - people involved in maintaining redundant servers. This leads to these engineers not taking up work or new opportunities

Measuring level of Risk

Unplanned down time - desirable and acceptable level to be defined.
Time based availability - Ex 99.999% means Approx 52 minutes of downtime per year = uptime/ (uptime+downtime)
Aggregate availability = successful request/ total requests

Tuesday, December 21, 2021

Opportunity

Today I will talk about opportunities we get in life. Are you thinking that a great job in a big company is some foreign land is an opportunity? Are you thinking that getting funding from a big funding agency to start your business is an opportunity? Of course these are opportunities, But there are many more opportunities in life, that God bestows on you constantly, which we fail to recognize and think that we don't get enough opportunities. God flood us with opportunities always, but it is we who fail to recognize them.

Your younger sister, asking you help for her Math's exam is an opportunity. She will be so happy when she gets your help and always thinks you are the smartest and brightest kid in the planet. There are so many children who do not have a sister and miss the happiness they get when the younger one, thinks so great about them and keeps them as a role model for her growth. The love you get from sister is incomparable. So always think everything you can do for your sister as an opportunity.

When your mother asks your help to dry the clothes, think that as an opportunity to serve her. It is a small thing you can do to her for the million things she constantly does for you. Never ever try to dodge any small opportunity you get to please your mother. There are millions who do not have a mother and waiting to feel what it is like to serve a mother so dear.

When your father asks you to write a grocery list, it is an opportunity. When your grandma asks you to help her open chrome browser in mobile phone, it is an opportunity. When your friend asks you to help her with homework, it is an opportunity. When a beggar asks you for some food, it is an opportunity. When somebody asks you for favor, it is an opportunity. Why should he/she ask you for favor when there are billions of people in this world.

So never attach opportunities to only materials, attach it to every act you can do to please any living thing in this world. It is difficult to get such opportunities, if we are born as an insect.

We have the sense organs and can sing, dance, feel, taste and do so many things. Use the sense organs to effectively pray to the God and use this unique opportunity you got to be born as a human and try to sing his name, chant his name, dance to his bhajans, eat his prashadh, feel his presence and enjoy the life to the fullest. Start looking at this life full of opportunities.

Thursday, December 16, 2021

Marghazhi - Margasheersha

To both my daughters I want to tell you something very special today. it is about the month Marghazhi and what makes it so special. Margasheersha as we call this is considered very auspicious.

Lord Krishna says in Bhagavadh Geetha that he is the month Margasheersha among the 12 months. It is the dawn period of the Devas. The festivals of Vaikunta Ekadheshi, hanuman Jayanthi, Aruithira darshan fall in this month. Thirupavai is recited during this month. It is very good especially for girls like you. Also during this month there are many Ayyappan devotees who go to Sabari Mala.

I will be really happy if you both do Pavai Nombu. Andal performed this to merge with Lord Ranganathan.it is performed by Girls and Women to get a good husband and for longevity of husband. Girls should get up early before sunrise and pray to the god and read Thirupavai. This is so simple right. Can you do this? Getting a good husband and healthy husband is very important to a girl. Now the same is important for man too to get a good wife. There is a way for us to pray for this and there is nothing wrong in following that. After all I have taught both of you Carnatic music, so now is the time to practice that by singing the Pavai Songs and also get the God's Blessings. So start right from today, what are you waiting for?

Dhana

Today I want to tell you about Dhana. Dhana is something you give to others, which can be utilized by them. This gives Punya to the person giving Dhana, so he can get a good life here in earth and also after he leaves the earth.

There is something that comes in Kathopanishad. The Rishi gives away Cows in Dhana and he gives away cows that have become old and would not longer be able to give milk. His son Nachiketha was seeing this and he felt what his father was doing was wrong. Actually it is wrong.

So what are the things you should consider when giving the Dhana.

Give to most deserving people. Do not give to relatives, friends who are already well off and do not really value what you are giving as Dhana.
The material you give in Dhana should be of the best quality. Not something that you no longer need and want to throw away. Instead of throwing away, giving it as Dhana is a sin.
It will be really good if the material you give is your most favorite.

There are people who say Anna Dhana ( giving food) is the highest form of Dhana, as this is the only Dhana in which people say enough, as they cannot eat more than what their stomach can hold.

I am specially amazed with the Kanya Dhana. It is not about commoditizing women and considering it as a material that can be given to someone. But just think about the Girl Child's parents who raise the child by pouring their affection and love, do all that is required for the girl. The most precious gem in their life is the girl child and then relinquishing their everything on her, after giving her to a person in marriage is something extra ordinary. Giving away what is the most precious to you is true Dhana.

Wednesday, December 15, 2021

Automation - Basics

Today I will tell something about automation. Automation means you do certain tasks using computer software/hardware without involving human action. Example, if you submit the details for opening a bank account in a proper format, then there is no need to talk to bank staff etc, the completely the account is opened and your account details are mailed to your inbox.

More complex example is insurance claims process, here there are multiple talks that happen between the stakeholder parties to go for insurance settlement for a claim. But if we automate it, then the software will take care of everything, provided we give the inputs needed in the required standard format.

You can think of many complex examples. Things about all domain, any task where humans are needed, they could be replaced with software/hardware. Other than the emotions of human beings and the situation based intelligence, rest all are getting automated. Now the scientists are trying to have software do decision systems as well as trying to mimic the human emotions.

If the task is simple like checking the mail content for some keywords and then doing a paper work and sending a mail, a dumb rule based program can automate this work. But if the work is complex, like based on analysis of a huge set of data, if some decisions are to be made, then we need to go for machine learning programs, that are not as dumb, but try to match the human thinking and at times, try to better the human thinking.

But now, if you want to automate a business process, what are the things you should do? Think first and strategize and think what may be the complexities involved in this, before moving further. Take 5 minutes and think. Read on. Automated means, the things need to be performed by a computer by itself, by utilizing the instructions you provide to it. So if it has to work properly, you should be able to provide proper instructions in such a way that the computer wont get confused and will be able to perform by itself, just by following the instructions you provide. To reach this level, we need to have many things sorted out.

Process fitness - See whether the whole process we are trying to automate is fit to be automated. How can you decide this. Based on below few criteria we can decide that.

Whether the instructions can be given a well defines rules. They should not be ambiguous. They should be clearly expressed as rules.
The process should be repetitive. If the process is going through unique workflow steps each time, it is difficult to provide a common instruction to the computer right.
Whether the whole process is standardized and is stable process. Doesn't keep changing constantly due to external factors which are not in our control.
Whether we can give the input in a standard format. This should also not keep on changing.

Automation Complexity - We need to see how complex the automation is going to be. This will consider factors like what type of application, number of screens, what are the business logic algorithms in place, what and how many type of inputs to be given etc.

Based on this assessment, we can conclude any business process that we consider for automation into 4 categories

No Automation - Cannot do any automation
Semi automate
High Cost - Fully Automate
Low Cost - Fully Automate

Tuesday, December 7, 2021

Starting LOVING

I know this topic entices you. You and many teenagers like you think love is lovely and the moment I say LOVE, what you people imagine is pink, balloons, cake, costly gifts, new dress, chocolates, dinner in nice restaurants, picnics, and the so many good things young lovers do the moment they fall for each other.

I am not talking about this imaginary love and this love that is short lived. I want to talk to you about the love that you need to develop towards all.

Just think in your home, everything is neat and tidy, including the bathrooms and toilets. Have you anytime thought who keeps these tidy for you to use. There are times, you throw the things hither-thither and rush to school and when you come back everything is neat and tidy. There are daily chores like cooking, washing, sweeping, buying vegetables, buying grocery etc that gets done meticulously without getting missed even for a day. These make your life easy. What is it that you can do to bring a small smile to the face of people who do this. Think, that is love.

Just think about your teachers who teach you. They do have lot of personal life problems, but once inside classroom, they are only concerned about the subject and fully work to make sure what they teach, reaches your brain. They don't allow their personal life problems inside the classroom for your welfare. What can you do to bring a small smile to their face?

There are so many people whom you come across in life, who contribute to your welfare directly like the doctor you visit, the shopkeeper of the shop where we get our idli floor daily, our servant maid, our milkman, our garbage collector, our newspaper person? Is there any small gesture you can do to bring a small smile to their face? Do you remember their names/ Do you greet them with their names? Is there any time you thanked them for their service/

There are many people whose face you have not seen, but they are also working for your welfare. The farmer, the prime minister, the chief minister, the army staff, the railways staff, the roadways staff, the numerous government and non government staff so many? Is there any gesture you can do to bring a small smile to their faces?

Why are you studying. Education is a tool. it should be used effectively and powerfully to bring a small difference to all the people you are connected with. Think about it. You should not always think about the by product which is money, but think about what you can do using your education to bring a smile to everyone around, how you can help them, how you can make the world a better place to live. that is LOVE. So start loving everyone from today and when their is Love, there is care, there is sincerity and their is genuineness to do something to the LOVED ONE. So Start LOVING everyone from today. All the best child.