High Availability & Disaster Recovery
Generally our workload should be highly available to recover immediately from disasters with very little to no losses.
In this document we mention failure domains a lot, in case of cloud failure domain is usually an Availability Zone (in most clouds' terminology) but, in case of on-premise deployments it depends on the budget of the project and the customer's definition but usually the servers in these domains have separate power and networking infrastructure at least. Our Disaster Recovery Strategy :
Multi-Zone Deployment within Dammam Region:
Within the Dammam region (me-central2), distribute your resources across multiple zones (e.g., me-central2-a, me-central2-b, me-central2-c). This mitigates the risk of downtime due to zone-specific failures or maintenance events.
Data Replication:
Implement data replication mechanisms to replicate critical data across zones within the Dammam region and across regions globally. Leverage GCP services like Cloud Storage for object storage, Cloud SQL for relational databases, or Cloud Spanner for globally distributed databases to ensure data redundancy and availability.
Load Balancing and Traffic Distribution:
Utilize GCP's load balancing services, such as HTTP(S) Load Balancing or Network Load Balancing, to distribute incoming traffic across multiple instances and zones within the Dammam region. This optimizes performance and ensures high availability by automatically routing traffic away from failed instances or zones.
Automated Backup and Restore:
Set up automated backup schedules for critical data and configurations, storing backups in alternative regions or storage locations outside the Dammam region. Utilize GCP services like Cloud Storage for backups and implement automated backup policies to ensure data integrity and rapid restoration in case of data loss or corruption.
Monitoring and Alerting:
Implement comprehensive monitoring and alerting using GCP's monitoring and logging services such as Stackdriver Monitoring and Stackdriver Logging. Monitor key metrics, including latency, error rates, and resource utilization across zones within the Dammam region, and configure alerts to notify you of any anomalies or potential issues.
Disaster Recovery Testing:
Regularly test your disaster recovery procedures and failover mechanisms to validate their effectiveness and reliability. Conduct simulated failover drills and recovery exercises to ensure readiness for any eventuality and identify and address any gaps or weaknesses in your DR strategy.
By following this generic disaster recovery strategy tailored for Google Cloud Platform and specifically focusing on the Dammam region in KSA and its zones, you can ensure robust resilience and continuity for your applications and services, even in the face of unexpected disruptions or disasters.
High Availability:
Kubernetes control plane should be Highly Available across different failure domains:
etcd needs 3 replicas for consensus and can only tolerate one down so, they need to be distributed across three different failure domains.
other control plane component (scheduler, controller-manager, api-server) need at leas 2 replicas across two different failure domains.
Kubernetes nodes need to be distributed across at least two failure domains (and if we include a workload that need at least two replicas up like etcd, Kafka using Kraft or Zookeeper, etc..., then we need nodes across three failure domains as well)
All our apps are stateless and scalable, and we run them with a minimum of 2 replicas.
SQL databases should have at least one synchronous replica in a different failure domain other that the master in case of master failure.
Provided CSI or storage either the RWX or RWO solutions also should be highly available.
For SQL Databases and Persistent Volumes a scheduled point in time backups should be taken and stored in a durable, highly available storage with the frequency and retention of the backup decided by the customer.
Currently we use Redis for caching purposes only, so no backups are needed for it.
Last updated