Although Kubernetes (and especially managed Kubernetes services such as GKE, EKS, and AKS) provide out-of-the-box reliability and resiliency with self-healing and horizontal scaling capabilities, production systems still require disaster recovery solutions to protect against human error (e.g. accidentally deleting a namespace or secret) and infrastructure failures outside of Kubernetes (e.g. persistent volumes). While more companies are embracing multi-region solutions, it is a complicated and potentially expensive option if all you need is a simple backup and restore option. In this post, we’ll look at using Velero to backup and restore Kubernetes resources as well as demonstrating its use as a disaster recovery or migration tool.

Are Backups Still Needed?

A key point that is often lost when running services in high availability (HA) mode is that HA (and thus replication) is not the same as having backups. HA protects against zonal failures, but it will not protect against data corruption or accidental removals. It is very easy to mix up the context or namespaces and accidentally delete or update the wrong Kubernetes resources. This may be a Custom Resource Definition (CRD), a secret, or a namespace. Some may argue that with IaaS tools like Terraform and external solutions to manage some of these Kubernetes resources (e.g. Vault for secrets, ChartMuseum for Helm charts), backups become unnecessary. Still, if you are running a StatefulSet in your cluster (e.g. ELK stack for logging or self-hosting Postgres to install plugins not support on RDS or Cloud SQL), backups are needed to recover from persistent volume failures.

Velero

Image for post

Velero (formerly known as Ark) is an open-source tool from Heptio (acquired by VMWare) to back up and restore Kubernetes cluster resources and persistent volumes. Velero runs inside the Kubernetes cluster and integrates with various storage providers (e.g. AWS S3, GCP Storage, Minio) as well as restic to take snapshots either on-demand or on a schedule.

Installation

Velero can be installed via Helm or via the CLI tool. In general, it seems like the CLI gets the latest updates and the Helm chart lags behind slightly with compatible Docker images. However, with each release, the Velero team does a great job updating the documentation to patch CRDs and the new Velero container image, so upgrading the Helm chart to the latest isn’t a huge concern.

#software-engineering #backup #disaster-recovery #kubernetes #programming

Disaster Recovery on Kubernetes
1.30 GEEK