Ten best practices for building resilient systems
Resilience (the IT definition): The ability of a software system to continue functioning despite encountering various unexpected events or challenges.
Stable systems are not good enough anymore
We used to architect and build systems chasing the elusive goal of stability. Stable systems were the crowning achievement. Back then, we did not have to update components, add services, respond to legislation, and take on external dependencies every month. Back then, stable systems might have been possible. Today, however, stability is increasingly unrealistic. Change tolerance is a much better target than stability moving forward. One way to ensure change tolerance is by designing your teams and systems for maximum resilience.
TL;DR: You can make all the lists you want, but you will not gain much if your staff does not practice responding to failovers, executing backup recoveries, making deployments and rollbacks, and responding to monitor alerts. Only once the human components of the system are committed to resilience will implementing the list below give you resilience and change tolerance.
I recently spoke with Veli Pehlivanov, CTO at Resolute Software, about his 10 best practices for building resilient systems. You can watch the video here:
In this article, I list and expand on some of Veli’s answers for further information.
Here is Veli’s list:
How you deploy updates or new features can significantly impact system stability. The goal is to prevent updates or new features from becoming "Unexpected events or challenges." Through that lens, Deploy in Place (where you update every node at once) can put more pressure on a system than Rolling or Blue/Green updates. Often, the kind of deployment you are making will determine how you need to deploy, so if you are required to use a specific rollout method, being aware of how your update plans put stress on your system can help you build resilience to that stressor.
Redundancy is not just about adding VMs in the cloud. Many of us manage hybrid systems depending on external services (e.g., a data feed service provided by an external vendor).
Have a backup strategy
Making a backup is an action. Having a backup strategy includes processes. What you back up and how often can depend on what you are backing up and how critical it is. Backing up a customer-facing transactional database in use 24/7 will require a different backup than an internal-facing reporting database that gets used heavily twice a week. You can use many backup and backup retention strategies as a template—full backup, differential backup, Mirror, Snapshot, Grandfather-Father-Son, and more. Let your business needs drive your recovery plan and let the recovery plan requirements go the backup method you choose.
Recovery and restoration planning
In the event of a server failure, how long can your surviving servers handle the load before the server is replaced or restored? If you do not have a failover plan for one of your servers, a critical business asset, your recovery and backup restoration plan must accommodate business needs. This should drive your backup strategy.
You will get plenty of experience making backups but remember to get experience finding and restoring backups according to the Deployment Planning you have done. Spend more time practicing restoration than you want to. Not getting this right negates the value of all the time and money you have spent making backups.
When failover servers come online, what data replication do you need to do from your primary servers before the failover can go online? Do you need to combine your failover server with your main servers in your data replication loop? Practice both ways and see how much time is required. If you have automation spin up a backup VM or another microservice and container, what parts of your system need to know when they have come online? How will those components realize they are ready for traffic?
Just as with backups, there are many flavors of load-balancing strategies. (Round-Robin, Least Connections, Least Response Time, IP Hash, Random, Chained Failover, weighted versions of many of these, and dynamic and adaptive balancing). Each is well suited to a specific set of conditions and budgets. Let your business requirements drive this. No need to spend more money on a dynamic and adaptive solution when a simple Least Connections algorithm will do. Include your load-balancing controller in your restoration and recovery plans.
Sure, isolation sounds depressing, but it can bring you much joy when an isolated service fails and does not take down anything else in the process! Isolation is all about separating different components, applications, or processes to minimize their interactions and potential impact on each other.
The primary objective of isolation is to enhance security, improve system stability, and protect critical resources from unauthorized access or interference. That almost sounds like resilience in a bottle! There are key components of a system you can isolate effectively:
Segregating different networks or network segments to prevent unauthorized access between them. This can be achieved through firewalls, virtual LANs (VLANs), network segmentation, or software-defined networking (SDN) techniques.
It runs applications or services in separate containers or virtual machines to prevent one application's issues from affecting others. A microservice (a service that does only one part of a more significant offering) can have many system requirements running in a separate container. No other services are affected if the microservice causes its data access layer to become unresponsive. Not even different instances of the service that failed. (So, with good Monitoring and Detection, you may know that a container has failed.)
Restricting access to sensitive data based on user roles and permissions. Implementing proper access controls allows data to be isolated from unauthorized users and prevents potential breaches. API-accessible data access layers are essential here. No service should write directly to your database. Updating database permissions, access, and schema changes would involve editing and re-compiling the service. That is asking for trouble.
Separating processes within an operating system to ensure that if one process crashes or becomes compromised, it does not affect the stability or performance of other processes.
Utilizing virtualization technologies to create isolated virtual machines or containers that act as independent systems. This approach allows running multiple isolated environments on the same physical hardware.
Cloud resource isolation
In a cloud environment, isolating resources like virtual machines, networks, and storage helps maintain the privacy and security of data and services belonging to different customers or tenants.
Design systems with redundancy and fault tolerance to isolate and contain failures, ensuring they do not cascade and cause widespread outages.
Failover automation – replication and restarting
This is the process of automatically detecting and responding to a failure by transferring its operation from the primary or active component to a standby or redundant one. It can be an entire server with a monolithic block of server code or a microservice in a container if you have good isolation practices.
The primary goal of failover automation is to minimize downtime and ensure continuous availability of critical applications and services. For this to work, you will need Monitoring, Decision-Making (rules-based algorithms are fine), Failover process, Validation, and Recovery-Restore to return to normal. The replacement service or container may take the place of the failed component. This is the least disruptive, and with microservices, containers, and VMs, easy too.
If you maintain a low-availability resource, letting a failed service be unavailable while executing your Recovery and Restore plan might be okay. In that case, you did not need to read this section.
Monitoring and alerting
What you do not know cannot hurt you. But it can get you fired. With so many moving parts, most systems are impossible to monitor manually. The days of running a sysadmin window on a computer in the data center are over. Good monitoring and alerting will tell you what has failed and what it was working on when it failed. Great monitoring will tell you when something is out of a specified tolerance or threshold and alert you to what may become a failure soon enough for you to prevent it.
Integrate your system monitoring with your Incident Response plan. Imagine spinning up a new VM and transferring the load to the new server before the old server even fails! Of course, you will want to know why your server was about to fail so you can address that. The more you monitor, the more you can use analytics to correlate events to find complicated causes or contributors to failures. Remember to include Monitoring and Alerting systems in your high-availability failover, backup, and recovery plans. This will cost more than you want and is not fun or exciting work. But not doing this is like trying to do your job while blindfolded.
Versioning and rollbacks
It is not a question of IF a rollback will happen, it is WHEN a rollback will happen. Even if your developers and testers are flawless < I will wait for you to stop laughing>, things out of your control, like privacy legislation or industry security certification requirements, may force the rollback of one of your releases. Know how to do this. Practice this so it does not become a vector for instability or a way to introduce service/schema versioning problems.
Try everything in your resilience plan. Practice everything. Think back and recall a time when fixing some unexpected failure made things worse before they got better. What if restoring to last week's server backup means restoring a data access layer incompatible with the schema change made 3 days ago?
How do you prevent this? How do you recover from it if you did not discover this before the restoration? Test software. Test execution. Test assumptions. Fix or update as required to address issues you find.
This is not how I want to spend my month
No, probably not. But think about it differently than something you have done for a month, and then you are done. It is something you schedule to do for a few hours every week. Forever. It is writing Deployment playbooks. Recovery Playbooks, Rollback Playbooks. It is practicing one playbook every week. It uses monitoring and programmatic responses to prevent failures.
Solid data center operations are not sexy or glamorous IT work (like building out dual 64-core Threadripper Pro CPUs with 8 Nvidia A100s with 1024GB of HBM2 memory for screaming fast AI servers) but doing this less exciting work is how you build resilience, so you are calling fast AI server always has something to work on.