Docker based build pipelines (Part 3) - Managing Production Environments

So far in this series of articles we have looked at creating continuous integration pipelines using Jenkins and continuously deployingto integration environments. We also looked at using Rancher compose to run deployments as well as Route53 integration to do basic DNS management. Today we will cover production deployments strategies and also circle back to DNS management to cover how we can run multi-region and/or multi-data-center deployments with automatic fail-over. We also look at some rudimentary auto-scaling so that we can automatically respond to request surges and scale back when request rate drops again. If you’d like to read this entire series, we’ve made an eBook \”Continuous Integration and Deployment with Docker and Rancher\” available for download.

[toc]

Deployment Strategies

One of challenges when managing production environments is to ensure minimal or zero downtime during releases. Doing so predictably and safely, takes quite a bit of work. Automation and quality assurance can go a long way to make releases more predictable and safe. Even then, failures can and do happen, and for any good ops team the goal would be to recover quickly while minimizing impact. In this section, we’ll cover a few strategies for running production deployments and their trade-offs for updating your application.

In-place updates

The first strategy is called in-place update, as the name suggests, the idea is to re-use the production environment and update the application in-place. These are also sometimes referred to as Rolling Deployments. We’re going to work with our sample application (go-auth) we covered in part 1 and part 2. Further, we’re going to assume that you have the service running with Rancher. To do an in-place update, you can use the upgrade command:

auth_version=${GO_AUTH_VERSION} rancher-compose --project-name go-auth \
 --url http://YOUR_RANCHER_SERVER:PORT/v1/   \
 --access-key <API_KEY>                      \
 --secret-key <SECRET_KEY>                   \
 -- verbose up -d --force-upgrade --pull auth-service

Behind the scenes, Rancher agent fetches the new image on each host running an auth-service container. It then stops the old containers and launches new containers in batches. You can control the size of the batch by using the --batch flag. Additionally, you can specify a pause interval ( --interval) between batch updates. A large enough interval can be used to allow you to verify that the new containers are behaving as expected, and on the whole, the service is healthy. By default, old containers are terminated and new ones are launched in their place. Alternatively, you can tell Rancher to start the new containers before stopping the old containers by setting the start_first flag in your rancher-compose.yml.

auth-service:
  upgrade_strategy:
    start_first: true

If you are not happy with the update and want to roll-back, you can do so with the rollback flag for the upgrade command. Alternatively, if you want to proceed with the update, simply tell Rancher to complete the update by specifying the confirm-update flag.

auth_version=${GO_AUTH_VERSION} rancher-compose --project-name go-auth \
 --url http://YOUR_RANCHER_SERVER:PORT/v1/   \
 --access-key <API_KEY>                      \
 --secret-key <SECRET_KEY>                   \
 -- verbose up -d --[rollback|confirm-upgrade] auth-service // to confirm or rollback an update

You can also perform these updates using the Rancher UI, by selecting \“upgrade\” from a service’s menu (shown below).

In-place updates are quite simple to perform and don’t require the additional investment to manage multiple stacks. There are, however, downsides to this approach for production environments. First, it is typically difficult to have fine-grained control over rolling updates, i.e., they tend to be unpredictable under failure scenarios. For example, dealing with partial failures and rolling back a rolling update can get quite messy. You have to know which nodes were deployed too, which failed to deploy and which are still running the previous revision. Second, you have to make sure all updates are not only backwards compatible but also forward compatible because you will have old and new versions of your application running concurrently in the same environment. Last, depending on the use case, in-place updates might not be practical. For example, if legacy clients need to continue to use the old environment while newer clients roll forward. In this case separating client requests is much easier with some of the other approaches we are going to list today.

Blue-Green Deployments

Lack of predictability is a common problem with in-place updates. To overcome that, another strategy for deployments is to work with two parallel stacks for an application. One active and the other in standby. To run a new release, the latest version of the application is deployed to the standby stack. Once the new version is verified to be working, a cut-over is done to switch traffic from the active stack to the standby stack. At that point the previously active stack becomes the standby and vice versa. This strategy allows for verification of deployed code, fast rollbacks (switching standby vs active again) and also extended concurrent operation of both stacks if needed. This strategy is commonly referred to as blue-green deployments. To accomplish such deployments through Rancher for our sample application, we can simply create two stacks in Rancher: go-auth-blue and go-auth-green. Further, we’re going to assume that the database is not part of these stacks and is being managed independently. Each of these stacks would just run auth-service and auth-lb services. Assuming that go-auth-green stack is live, to perform an update, all we need to do is to deploy the latest version to the blue stack, perform validation and switch the traffic over to it.

Traffic Switch

There are a couple of options for doing the traffic switch, changing DNS records to point to the new stack or using a proxy or load-balancer and routing traffic to the active stack. We cover both options in detail below.

DNS record update

A simple approach is to update a DNS record to point to the active stack. One advantage of this approach is that we can use weighted DNS records (Details later) to slowly transition the traffic over to the new version. This is also a simple way to do canary releases which are quite useful for safely phasing in new updates on live environments or for doing A/B tests. For example, we can deploy an experimental feature to it’s own feature stack (or to the in-active stack) and then update the DNS to forward only a small fraction of the traffic to the new version. If there is an issue with the new update, we can reverse the DNS record changes to roll back. Further, it is much safer than doing a cut-over where all traffic switches from one stack to another, which can potentially overwhelm the new stack. Although simple, DNS record update is not the cleanest approach if you want all your traffic to switch over to the new version at once. Depending on the DNS clients, the changes can take a long time to propagate, resulting in a long tail of traffic against your old version instead of a clean switch over to the new version.

Using a reverse proxy

Using a proxy or load-balancer and simply updating it to point to the new stack is a cleaner way of switching over the entire traffic at-once. This approach can be quite useful in various scenarios, e.g., non-backwards compatible updates. To do this with Rancher, we first need to first create another stack which contains a load-balancer only. [](https://cdn.rancher.com/wp-content/uploads/2015/11/29204307/add-lb-external.png)

external-elb-config

Next, we specify a port for the load-balancer, configure SSL and pick the load-balancer for the active stack as the target service from the drop down menu to create the load-balancer. Essentially we are load balancing to a load-balancer which is then routing traffic to actual service nodes. With the external load-balancer, you don’t need to update the DNS records for each release. Instead, you can simply update the external load-balancer to point to the updated stack.

$external\_lb$

Multi-Region, Multi-Cloud deployments

Now that we are deloying production deployments we need to consider availability. For example Amazon’s SLA supports a 99.95% up time for each region without incurring penalties for Amazon. This is in the three nines availability bracket and is normally considered a minimum for most large scale customer services. For more critical services 5 nines of up time are a more acceptable target. To get there we will need to place resources in multiple amazon availability zones. We can also use of Rancher to manage cross-cloud provider redundancy however, this level of redundancy is not required in most deployments.

Launching Tagged Instances

The first step in making multi-region or cloud deployments is to launch Rancher Compute nodes in multiple AWS regions. You may also launch compute nodes on other cloud providers such as Digital Ocean. Use labels to tag each host with the provider and region. The figure below shows a three node cluster with one node in AWS US East, one in US West and one in Digital Ocean.

Cross-Region/Cloud Security

Note you have to configure your Rancher server security group to allow access from the compute nodes in remote regions. This needs to be done using CIDR blocks as security group based white lists do not work across regions. If you would like a more secure solution you will need to use VPC Peering Connections and Direct Connect or VPNs to connect across regions. However, for the purposes of this article we are using very permissive security group rules and relying on Rancher security instead.

Update Compose Templates

Now that we have compute nodes in two amazon regions and also in Digital Ocean we will update our docker-compose.yml and rancher-compose.yml to launch services in the various regions. In the docker-compose.yml find the auth-service tag and copy its content two more times and name the three tags as follows. Similarly find the auth-lb tag and copy it two more times and rename to reflect region and provider.

auth-service-aws-east:
  tty: true
  command:
  ....
auth-service-aws-west:
  tty: true
  command:
  ....
auth-service-digitalocean:
  tty: true
  command:
  ....

In addition add the *io.rancher.scheduler.global *label to all three auth-service definitions as well as the three load-balancer definitions. This will ensure that there is an instance of the container running on all hosts subject to the filter defined in io.rancher.scheduler.affinity. In the affinity label define where you want the service or load-balancer instances to be run. For example the entry below shows the affinity for containers running in aws us-west. With this setup we ensure that we have at least one auth service instance and one load-balancer instance in each of our two aws regions and one in digital ocean.

labels:
    io.rancher.scheduler.global: 'true'
    io.rancher.scheduler.affinity:host_label: provider=aws,region=us-west

####

Setup DNS

Now that we have all our containers defined we can start setting up DNS to route traffic and fail over automatically. To start off, follow the instructions here to setup route53 integration. Note that we will get three separate records, one for each of the load-balancers services. Each record may have multiple IPs depending on how many containers are running in each service. Having DNS entries to aggregate the containers in each region is useful but to be truly cross-region you must present and single domain to external clients and have the server side route traffic efficiently. We have several ways of setting up this routing in Route53 namely: Weighted, Geo-location, and Latency based. There are pros and cons to each approach which we will discuss next.

####

Weighted Routing

Weighted routing allows you to specify the portion of traffic that goes to each region and fail-over automatically if one of the regions goes down. This is a good way to control how much traffic goes to each region/cloud. For example we may want to keep the majority of our traffic on AWS servers as they tend to be more performant. Note this strategy can also be used for traffic shifting during blue-green deployments. The percentage of traffic going to each of the load-balancers will be more or less stable. The downside is that this strategy does not take into account where your traffic originates. The configured percentage of traffic will go to us-west regardless of whether the source of the traffic is New York or Seattle.

To use Weighted routing policy browse to AWS Console > Route 53 > Hosted Zonesand select your hosted zone. Now click Create Record Set. In the screen on the right select a name for your sub-domain (e.g. go-auth-prod) and select type A as Rancher creates A type records for each load-balancer. Now select yes on the Alias and select the name of one of the auto-created DNS entries as the alias target. For example we have selected the us-east load-balancer as our target. Select Weighted as the Routing policy and give this this route a weight and select 1 as the set id. Repeat the same process for the other two load-balancers making sure to use same Name, but unique Set ID for all three record sets. The traffic will now be split based on the relative weight of each route. If you used a weight of 33 for each of the routes then all three regions will get a third of the weight.

Latency Based Routing

With latency based routing traffic will be sent to the AWS data-center with the lowest latency from the client sending the DNS query. This will address the draw back of Weighted routing, i.e. clients frm New York will be sent to US East where as clients from Seattle are sent to US-West. The drawback of this approach is that you cannot balance traffic across regions. For example if 90% of your users are in New York 90% of your traffic will go the US East and your US West deployment will sit mostly idle. This may be fine for most use-cases but may be unsuitable if you want to keep deployments of roughly equal size. Also peak utilization is strongly correlated to time of day the traffic on each region will be more variable through the day because of time zone differences.

To setup Latency based routing follow the exact same procedure as weighted but select Latency as the Routing Policy. In addition you will now be asked to specify a Amazon Regions instead of a weight for each of the three route entries. For US-East and US-West select the respective regions, for Digital Ocean select the AWS region closest to your DigitalOcean region. For example if you launched nodes in DigitalOcean’s NYC datacenter then you should select US-East as the region for latency based routing to digital ocean. Traffic from the EAST coast will be split evenly between your load-balancer in AWS US East and DigitalOcean.

####

Geo-location Based Routing

Lastly, you can use geo-location based routing to explicitly specify where traffic originating in each region must go. In practice this is very similar to Latency based routing however, you can explicitly set region to target matching rather than relying on latency. This allows you to send Portugal and Brazil to the same target (You have your Portuguese localized servers in that target). The down side of this approach is you may increase latency by routing to a far away cluster. And the granularity is Continent and Country only, and some countries are very large.

For geo-location based routing follow a similar process of creating route sets but select Geolocation as the routing policy. In this case you will have to add one record set for each country or continent for which you would like to route traffic to a specific data center. You should also specify a record set for the default location in order to avoid having to explicitly specify entries for all possible countries and regions.

####

DNS Health Check

If you would like auto-failover for your various routes (regardless of routing policy) then you have to create DNS health checks. To do so, go to AWS Console > Route 53 > Health Checksand click Create Health check. In the configure health check screen enter a name for your health check, select Endpoint monitoring, and Domain name as endpoint. The protocol should be http (or https if you set it up), the domain name should be one of the ones created by Rancher that we used as Alias targets earlier. Specify the port as 9000 and the path as health. (Note that the go-auth service has a health check end point at /health). You should create a health check for each of the three load-balancers you created in Rancher.

[Now that you have the health checks created, go back to your hosted zone. For each of the aggregate record sets you created (i.e. the ones using the weighted/latency/geolocation routing) select yes for the Associate with Health Check setting and select the relevant health check. This will mean then if the health checks for all containers in a given region fails, route53 will automatically take that route out of the rotation and automatically switch all traffic to the remaining routes. This will help you react to outages in any one region or cloud provider without any human intervention. Such automatic fail-over for high availability services. ] Screen Shot
2015-11-19 at 11.12.49
PM

Building Auto-scaling Arrays

One of the significant difference between a production environment and the testing environments we covered earlier is that the load is variable and unpredictable. One of the major benefits of cloud-based and container based deployments is to minimize the overhead, both financial and technical, of dealing with this variability. Auto scaling arrays are an important part of realizing these goals. They can help scale to traffic spikes without human intervention and also help you save money by scaling resources up and down as you move through your daily traffic peak and troughs. We will use Amazon’s Auto-scaling arrays with Rancher to scale our service stacks.

Creating a launch configuration

The first step in creating your auto scaling arrays is to create a launch configuration. To do so, go to Amazon Console > Ec2 > Launch Configuration and select Create Launch Configuration. Follow the screen instructions to create a launch configuration using an AMI of your choice (we normally Amazon’s stock Linux AMI). When you get to step 3 Configure Details, select Advanced Details and user data section enter the commands shown below. In this we are using cloud-init to install docker and run our compute instance. Note that we tag our compute node instance with the name of the service. The Rancher URL needed for docker run can be retrieved from http://[RANCHER_SERVER]:[RANCHER_PORT]/infra/hosts/add/custom. It is specific to your server and environment.

#!/bin/bash

# Update all packages to pull in latest security updates etc
yum update -y

# Install docker
yum install docker -y
service docker restart

# Start Rancher Compute node
docker run
    -e CATTLE_HOST_LABELS='service=[SERVICE_NAME]' \
    -d --privileged                                \
    -v /var/run/docker.sock:/var/run/docker.sock   \
    rancher/agent:v0.8.2                           \
    http://[RANCHER_SERVER]:[RANCHER_PORT]/v1/scripts/EFA4EAD....

Creating Auto-scaling Group

Now that you have your launch configuration you can create a new auto scaling array. To do so browse to AWS Console > Ec2 > Autoscaling Groups and select Create Auto scaling Group. On the first screen select the launch configuration that you created in the previous section. On step to select Use scaling policies to adjust the capacity of this group and specify the scaling range in the scale between section. Note its probably best practice to estimate your maximum need and double it for the maximum scale option.

In the Increase Group Size section select Add new alarm. In the Create Alarm screen uncheck the send notification box and specify a scale up rule. For example we will scale up when the average CPU utilization of the array is higher than 70% for five minutes. Similarly select the Add new alarm option in the of the Decrease group size section and specify and alarm for when average CPU utilization is lower then 10% for five minutes.

In the Take Action field of Increase Group size section choose Add 1 instance and similarly in the Decrease Group Size section choose Remove 1 instance. Follow the remaining steps to create your auto scaling array and your auto scaling array is ready to go. We have used CPU based alarm however, you can use any cloudwatch metric you like to create alarms (and therefore scaling policies).

At this point our auto scaling array is ready to scale out to support additional containers when CPU utilization gets high. When new instances come up they register themselves with Rancher and are available for container launches however launching container on these hosts is still a manual process. We can automate this using a small modification to the docker-compose.yaml we defined for our service in an earlier article. We need to add the following two labels to the aut-service entry. This will specify that we want to enable global scheduling of containers fort this service and that we want to launch a container of this service on every host with the label service with value equal to the service name that we specified earlier in our launch configuration in the CATTLE_HOST_LABELS.

auth-service:
  labels:
    io.rancher.scheduler.global: 'true'
    io.rancher.scheduler.affinity:host_label: service=[SERVICE_NAME]

With this change every new autoscaled host will now automatically run an instance of your service container. This will allow you to scale to your daily traffic peak and release resources when you scale back down all without any human intervention. A feature soon to be released by Rancher allows you to run multiple service instances on each of the autoscaled hosts and thus better utilize host resources. This will help us further reduce the cost of running our servers as well as give us more redundancy in case of container failures.

Use Custom AMI

One optimization that you can apply when using auto-scaling arrays is to use a custom AMI with docker in which you apply all the latest security updates, install docker and download the required Rancher client container image. This will help you shave precious seconds of launch time if you need to scale out a large number of instances quickly. In addition this will make your scaling out operations independent of DockerHub and Package Manager (yum/apt) repositories. This could be critical if for example you need to launch instances while Dockerhub is down for maintenance. The easiest way to do this is to launch an Instance using an AMI of your choice, ssh into the instance and apply the required commands. Once you have done so you can right-click the instance in the Amazon Console and select Create Image. You can now use the image as your launch configuration AMI.

As an example, if you used the Amazon linux AMI (ami-60b6c60a in us east region), then you can use the following commands to get your AMI ready.

yum update -y
yum install docker -y
docker restart
docker run -d --privileged                            \
    -v /var/run/docker.sock:/var/run/docker.sock      \
    rancher/agent:v0.8.2                              \
    http://[RANCHER_SERVER]/v1/scripts/EFA4EAD......

# Stop and remove the container instance (but not the image)
docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)

Today we looked at running production deployments safely with zero downtime. We also looked at using DNS to support multi-region or even multi-cloud deployments with automated fail-over. Lastly, we looked at using global scheduling in Rancher in conjunction with Amazon autoscaling arrays to build services which are elastic to incoming load. This by no means an exhaustive list of considerations for a production environment as each production environment is a unique snow-flake. However, any large-scale production deployment needs to take these factors into consideration. In subsequent articles we will look at more considerations that are important for running and maintaining Dockerized production deployments especially in statefull workloads such as databases. To get started with Rancher, join the beta and start building your container service, or download the eBook on building a CI/CD pipeline with Rancher and Docker.

Usman and Bilal are server and infrastructure engineers, with experience in building large scale distributed services on top of various cloud platforms. You can read more of their work at techtraits.com, or follow them on twitter @usman_ismail and @mbsheikh respectively.

快速开启您的Rancher之旅