Recover Rancher Kubernetes cluster from a Backup


Take a deep dive into Best Practices in Kubernetes Networking
From overlay networking and SSL to ingress controllers and network security policies, we've seen many users get hung up on Kubernetes networking challenges. In this video recording, we dive into Kubernetes networking, and discuss best practices for a wide variety of deployment options.

Etcd is a highly available distributed key-value store that provides a reliable way to store data across machines, more importantly it is used as a Kubernetes’ backing store for all of a cluster’s data.

In this post we are going to discuss how to backup etcd and how to recover from a backup to restore operations to a Kubernetes cluster.

Etcd in Rancher 1.6

In Rancher 1.6 we use our own Docker image for etcd which basically pulls the official etcd and adds some scripts and go binaries for orchestration, backup, disaster recovery, and healthcheck.

The scripts communicate with Rancher’s metadata service to get important information, such as: how many etcd are running in the cluster, who is the etcd leader, etc. In Rancher 1.6, we introduced etcd backup, which works besides the main etcd in the background. This service is responsible for backup operations.

The backup operations work by performing rolling backups of etcd at specified intervals and also supports retention of old backups. Rancher-etcd does that by providing three environment variables to the Docker image:

  • EMBEDDED_BACKUPS: boolean variable to enable/disable backup.

  • BACKUP_PERIOD: etcd will perform backups at this time interval.

  • BACKUP_RETENTION: etcd will retain backups for this time interval.

Backups are taken at /var/etcd/backups on the host and are taken using the following command:

etcdctl backup --data-dir <dataDir> --backup-dir <backupDir>

To configure the backup operations for etcd in Rancher 1.6, you must supply the mentioned environment variables in the Kubernetes configuration template:

After configuring and launching Kubernetes, etcd should automatically take backups every 15 minutes by default.

Restoring backup

Recovering etcd from a backup in rancher 1.6 requires the user to have data in the etcd volume created for etcd. For example, if you have 3 nodes and you have backups created in the /var/etcd/backup directory:

# ls /var/etcd/backups/ -l
total 44
drwx------ 3 root root 4096 Apr  9 15:03 2018-04-09T15:03:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:05 2018-04-09T15:05:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:07 2018-04-09T15:07:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:09 2018-04-09T15:09:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:11 2018-04-09T15:11:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:13 2018-04-09T15:13:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:15 2018-04-09T15:15:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:17 2018-04-09T15:17:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:19 2018-04-09T15:19:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:21 2018-04-09T15:21:54Z_etcd_1
drwx------ 3 root root 4096 Apr  9 15:23 2018-04-09T15:23:54Z_etcd_1

Then you should be able to restore operations to etcd. First of all you should only start with one node, so that only one etcd will restore from backup, and then the rest of etcd will join the cluster. To begin the restoration, use the following steps:

target=2018-04-09T15:23:54Z_etcd_1
docker volume create --name etcd
docker run -d -v etcd:/data --name etcd-restore busybox
docker cp /var/etcd/backups/$target etcd-restore:/data/data.current
docker rm etcd-restore

The next step is to start Kubernetes on this node normally:

After that you can add new hosts to the setup. Note that you have to make sure that new hosts don’t have etcd volumes.

It’s also preferable to have etcd backup mounted to NFS mount point so that if the hosts are down for any reason, it won’t affect the backups created for etcd.

Etcd in Rancher 2.0

Recently Rancher announced GA for Rancher 2.0 and became ready for production deployments. Rancher 2.0 provides unified cluster management for different cloud providers including GKE, AKS, EKS as well providers that do not yet support a managed Kubernetes service.

Starting from RKE v0.1.7, the user is allowed to enable regular etcd snapshots automatically. In addition, it lets the user restore etcd from a snapshot stored on cluster instances.

In this section we will explain how to backup/restore your Rancher installation on an RKE managed cluster. The steps for this kind of Rancher installation is explained in the official documentation in more detail.

After Rancher Installation

After you install Rancher using RKE as explained in the documentation, you should see similar output when you execute the command:

# kubectl get pods --all-namespaces
NAMESPACE       NAME                                    READY     STATUS    RESTARTS   AGE
cattle-system   cattle-859b6cdc6b-tns6g                 1/1       Running   0          19s
ingress-nginx   default-http-backend-564b9b6c5b-7wbkx   1/1       Running   0          25s
ingress-nginx   nginx-ingress-controller-shpn4          1/1       Running   0          25s
kube-system     canal-5xj2r                             3/3       Running   0          37s
kube-system     kube-dns-5ccb66df65-c72t9               3/3       Running   0          31s
kube-system     kube-dns-autoscaler-6c4b786f5-xtj26     1/1       Running   0          30s

You will notice that cattle pod is up and running in cattle-system namespace; this pod is the rancher server installed as a Kubernetes deployment:

RKE etcd Snapshots

RKE introduced two commands to save and restore etcd snapshots of a running RKE cluster; the two commands are:

rke etcd snapshot-save --config <config-path> --name <snapshot-name>

AND

rke etcd snapshot-restore --config <config-path> --name <snapshot-name>

For more information about etcd snapshot save/restore in RKE, please refer to the official documentation.

First we will take a snapshot of etcd that is running on the cluster. To do that, lets run the following command:

# rke etcd snapshot-save --name rancher.snapshot --config cluster.yml
INFO[0000] Starting saving snapshot on etcd hosts       
INFO[0000] [dialer] Setup tunnel for host [x.x.x.x] 
INFO[0003] [etcd] Saving snapshot [rancher.snapshot] on host [x.x.x.x] 
INFO[0004] [etcd] Successfully started [etcd-snapshot-once] container on host [x.x.x.x] 
INFO[0010] Finished saving snapshot [rancher.snapshot] on all etcd hosts

RKE etcd snapshot restore

Assuming the Kubernetes cluster failed for any reason, we can restore normally from the taken snapshot, using the following command:

# rke etcd snapshot-restore --name rancher.snapshot --config cluster.yml

INFO[0000] Starting restoring snapshot on etcd hosts    
INFO[0000] [dialer] Setup tunnel for host [x.x.x.x] 
INFO[0001] [remove/etcd] Successfully removed container on host [x.x.x.x] 
INFO[0001] [hosts] Cleaning up host [x.x.x.x]      
INFO[0001] [hosts] Running cleaner container on host [x.x.x.x] 
INFO[0002] [kube-cleaner] Successfully started [kube-cleaner] container on host [x.x.x.x] 
INFO[0002] [hosts] Removing cleaner container on host [x.x.x.x] 
INFO[0003] [hosts] Successfully cleaned up host [x.x.x.x] 
INFO[0003] [etcd] Restoring [rancher.snapshot] snapshot on etcd host [x.x.x.x] 
INFO[0003] [etcd] Successfully started [etcd-restore] container on host [x.x.x.x] 
INFO[0004] [etcd] Building up etcd plane..              
INFO[0004] [etcd] Successfully started [etcd] container on host [x.x.x.x] 
INFO[0005] [etcd] Successfully started [rke-log-linker] container on host [x.x.x.x] 
INFO[0006] [remove/rke-log-linker] Successfully removed container on host [x.x.x.x] 
INFO[0006] [etcd] Successfully started etcd plane..     
INFO[0007] Finished restoring snapshot [rancher.snapshot] on all etcd hosts

Notes There are some important notes for the etcd restore process in RKE:

1. Restarting Kubernetes components

After restoring the cluster, you have to restart the Kubernetes components on all nodes, otherwise there will be some conflicts with resource versions of objects stored in etcd; this will include restart to Kubernetes components and the network components. For more information, please refer to Kubernetes documentation. To restart the Kubernetes components, you can run the following on each node:

docker restart kube-apiserver kubelet kube-controller-manager kube-scheduler kube-proxy
docker ps | grep flannel | cut -f 1 -d " " | xargs docker restart
docker ps | grep calico | cut -f 1 -d " " | xargs docker restart

2. Restoring etcd on a multi-node cluster

If you are restoring etcd on a cluster with multiple etcd nodes, the same exact snapshot must be copied to /opt/rke/etcd-snapshots, rke etcd snapshot-save will take different snapshots on each node, so you will need to copy one of the created snapshots manually to all nodes before restoring.

3. Invalidated service account tokens

Restoring etcd on a new Kubernetes cluster with new certificates is not currently supported, because the new cluster will contain different private keys which are used to sign service tokens for all service accounts. This may cause a lot of problems for all pods that communicate directly with kube api.

Conclusion

In this post we saw how backups can be created and restored for etcd in Kubernetes clusters in both Rancher 1.6.x and 2.0.x. Etcd snapshots can be managed in 1.6 using Rancher’s etcd image and in 2.0 using RKE CLI.

Take a deep dive into Best Practices in Kubernetes Networking
From overlay networking and SSL to ingress controllers and network security policies, we've seen many users get hung up on Kubernetes networking challenges. In this video recording, we dive into Kubernetes networking, and discuss best practices for a wide variety of deployment options.
Hussein Galal
github
Hussein Galal
DevOps Engineer, Rancher
快速开启您的Rancher之旅