Kubernetes, Mesos, and Swarm: Comparing the Rancher Orchestration Engine Options


A Detailed Overview of Rancher's Architecture
This newly-updated, in-depth guidebook provides a detailed overview of the features and functionality of the new Rancher: an open-source enterprise Kubernetes platform.

kubernetes\_mesos\_swarm

Note: You can find an updated comparison of Kubernetes vs. Docker Swarm in a recent blog post here.

Recent versions of Rancher have added support for several common orchestration engines in addition to the standard Cattle. The three newly supported engines, Swarm (soon to be Docker Native Orchestration), Kubernetes and Mesos are the most widely used orchestration systems in the Docker community and provide a gradient of usability versus feature sets. Although Docker is the defacto standard for containerization, there are no clear winners in the orchestration space. In this article, we go over the features and characteristics of the three systems and make recommendations of use cases where they may be suitable.

Docker Native Orchestration is fairly bare bones at the moment but is getting new features at a rapid clip. Since it is part of the official Docker system, it will be the default choice for many developers and hence will have likely have good tooling and community support. Kubernetes is among the most widely used container orchestration systems today and has the support of Google. Lastly, Mesos with Mesosphere (or Marathon, its open source version) takes a much more compartmentalized approach to service managements where a lot of features are left to independent plug-ins and applications. This makes it easier to customize the deployment as individual parts can be swapped out or customized. However, this also means more tinkering is required to get a working setup. Kubernetes is more opinionated about how to build clusters and ships with integrated systems for many common use cases.

Docker Native Orchestration

Basic Architecture

Docker Engine 1.12 shipped with Native Orchestration, which is a replacement for stand alone Docker Swarm. The Docker native cluster (Swarm) consists of a set of nodes (Docker Engines/ Daemons) which can either be managers or workers. Workers run the containers you launch and managers maintain cluster state. You can have multiple managers for high-availability, but no more than seven are recommended. The masters maintain consensus using an internal implementation of the the RAFT algorithm. As with all consensus algorithms, having more managers has a performance implication. The fact that managers maintain consensus internally means that there are no external dependencies for Docker native orchestration which makes cluster management much easier.

###

Usability

Docker native uses concepts from single-node Docker and extends them to the Swarm. If you are up to date on Docker concepts, the learning curve is fairly gradual. The setup for a swarm is trivial once you have Docker running on the various nodes you want to add to your swarm: you just call docker swarm init on one node and docker swarm join on any other nodes you want to add. You can use the same Docker Compose templates and the same Docker CLI command set as with standalone Docker.

Feature Set

Docker native orchestration uses the same primitives as Docker Engine and Docker Compose to support orchestrations. You can still link services, create volumes and define expose ports. All of these operations apply on a single node. In addition to these, there are two new concepts, services and networks.

A docker service is a set of containers that are launched on your nodes and a certain number of containers are kept running at all times. If one of the the containers dies it is replaced automatically. There are two types of services, replicated or global. Replicated services maintain a specified number of containers across the cluster where as global services run one instance of a container on each of your swarm nodes. To create a replicated service use the command shown below.

docker service create          \
   –name frontend              \
   –replicas 5                 \
   -network my-network         \
   -p 80:80/tcp nginx:latest.

You can create named overlay networks using docker network create --driver overlay NETWORK_NAME. Using the named overlay network you can create isolated, flat, encrypted virtual networks across your set of nodes to launch your containers into.

You can use constraints and labels to do some very basic scheduling of containers. Using constraints you can add an affinity to a service and it will try to launch containers only on nodes which have the specified labels.

docker service create                        \
   –name frontend                            \
   –replicas 5                               \
   -network my-network                       \
   --constraint engine.labels.cloud==aws     \
   --constraint node.role==manager           \
   -p 80:80/tcp nginx:latest.

Furthermore, you can use the reserve CPU and reserve memory flags to define the resources consumed by each container of the service so that when multiple services are launched on a swarm the containers can be placed to minimize resource contention.

You can do rudimentary rolling deployments using the command below. This will update container image for the service but do so 2 containers at a time with a 10s interval between each set of two. However, health-checks and automatic rollbacks are not supported.

docker service update        \
   –name frontend            \
   –replicas 5               \
   -network my-network       \
   --update-delay 10s        \
   --update-parallelism 2    \
   -p 80:80/tcp nginx:other-version.

Docker supports persistent external volumes using volume drivers, and Native orchestration extends these using the mount option to service create command. Adding the following snippet to the command above will mount a NFS mount into your container. Note this requires NFS to be setup on your underlying host external to docker, some of the other drivers which add support for Amazon EBS volume drivers or Google container engine volume drivers have the ability to work without host support. Also this feature is not yet well documented and may require a bit of testing creating github issues on the docker project to get working.

    --mount type=volume,src=/path/on/host,volume-driver=local,\
    dst=/path/in/container,volume-opt=type=nfs,\
    volume-opt=device=192.168.1.1:/your/nfs/path

Kubernetes

Basic Architecture

Conceptually, Kubernetes is somewhat similar to Swarm in that it uses a manager (master) node with RAFT for consensus. However, that is where the similarities end. Kubernetes uses an external etcd cluster for this purpose. In addition you will need a network layer external to Kubernetes, this can be an overlay network like flannel, weave etc. With these external tools in place, you can launch the Kubernetes master components; API Server, Controller Manager and Scheduler. These normally run as a Kubernetes pod on the master node. In addition to these you would also need to run the kubelet and kubeproxy on each node. Worker nodes only run the Kubelet and Kubeproxy as well as a network layer provider such as flanneld if needed.

In this setup, the kubelet will control the containers (or pods) on the given node in conjunction with the Controller manager on the master. The scheduler on the master takes care of resource allocation and balancing and will help place containers on the worker node with the most available resources. The API Controller is where your local kubectl CLI will issue commands to the cluster. Lastly, the kubeproxy is used to provide load balancing and high availability for services defined in Kubernetes.

Usability

Setting up Kubernetes from scratch is a non-trivial endeavor as it requires setting up etcd, networking plugins, DNS servers and certificate authorities. Details of setting up Kubernetes from scratch are available here but luckily Rancher does all of this setup for us. We have covered how to setup a Kubernetes cluster in an earlier article.

Beyond initial setup, Kubernetes still has somewhat of a steep learning curve as it uses its own terminology and concepts. Kubernetes uses resource types such as Pods, Deployments, Replication Controllers, Services, Daemon sets and so on to define deployments. These concepts are not part of the Docker lexicon and hence you will need to get familiar with them before your start creating your first deployment. In addition some of the nomenclature conflicts with Docker. For example, Kubernetes services are not Docker services and are also conceptually different (Docker services map more closely to Deployments in the Kubernetes world). Furthermore, you interact with the cluster using kubectl instead of the docker CLI and you must use Kubernetes configuration files instead of docker compose files.

The fact that Kubernetes has such a detailed set of concepts independent of core Docker is not in itself a bad thing. Kubernetes offers a much richer feature set than core Docker. However, Docker will add more features to compete with Kubernetes with divergent implementations and divergent or conflicting concepts. This will almost surely repeat the CoreOS/rkt situation with large portions of the community working on similar but competing solutions. Today, Docker Swarm and Kubernetes target very different use cases (Kubernetes is much more suitable for large production deployments of service-oriented architectures with dedicated cluster-management teams) however as Docker Native Orchestration matures it will move into this space.

Feature Set

The full feature set of Kubernetes is much too large to cover in this article, but we will go over some basic concepts and some interesting differentiators. Firstly, Kubernetes uses the concept of Pods as its basic unit of scaling instead of single containers. Each pod is a set of containers (set may be size one) which are always launched on the same node, share the same volumes and are assigned a Virtual IP (VIP) so they can be addressed in the cluster. A Kubernetes spec file for a single pod may look like the following.

kind: Pod
metadata:
  name: mywebservice
spec:
  containers:
  - name: web-1-10
    image: nginx:1.10
    ports:
    - containerPort: 80

Next you have deployments; these loosely map to what services are in Docker Native orchestration. You can scale the deployment much like services in Docker Native and a deployment will ensure the requite number of containers is running. It is important to note that deployments only analogous to replicated service in docker native as Kubernetes uses the Daemon Set concept to support its equivalent of globally scheduled services. Deployments also support Health checks which use HTTP or TCP reachability or custom exec commands to determine if a container/pod is healthy. Deployments also support rolling deployments with automatic rollback using the health check to determine if each pod deployment is successful.

kind: Deployment
metadata:
  name: mywebservice-deployment
spec:
  replicas: 2 # We want two pods for this deployment
  template:
    metadata:
      labels:
        app: mywebservice
    spec:
      containers:
      - name: web-1-10
        image: nginx:1.10
        ports:
        - containerPort: 80

Next you have Kubernetes Services which provide simple load balancing to a deployment. All pods in a deployment will be registered with a service as they come and go, and services also abstract away multiple deployments so that if you want to run rolling deployments you will register two Kubernetes deployments with the same service, then gradually add pods to one while reducing pods from the other. You can even do blue-green deployments where you point the service at a new Kubernetes deployment in one go. Lastly, services are also useful for service discovery within your Kubernetes cluster, all services in the cluster get a VIP and are exposed to all pods in the cluster as docker link style environment variables as well as through the integrated DNS server.

In addition to basic services, Kubernetes supports Jobs, Scheduled Jobs, and Pet Sets. Jobs create one or more pods and wait until they terminate. A job makes sure that the specified number of pods terminate successfully. For example, you may start a job to start processing business intelligence data for 1 hour in the last day. You would launch a job with 24 pods for the previous day and once they are all run to completion the job is done. A scheduled job as the name suggests is a job that is automatically run, on a given schedule. In our example, we would probably make our BI processor a daily scheduled job. Jobs are great for issuing batch style work loads to your cluster which are not services that always need to be up but instead tasks that need to run to completion and then be cleaned up.

Another extension that Kubernetes provides to basic services is Pet Sets. Pet sets support stateful service workloads that are normally very difficult to containerize. This includes databases and real-time connected applications. Pet sets provide stable hostnames for each \“pet\” in the set. Pets are indexed; for example, pet5 will be addressable independently of pet3, and if the 3rd pet container/pod dies it will be relaunched on a new host with the same index and hostname.

Pet Sets also provide stable storage using persistent volumes, i.e if pet1 dies and is relaunched on another node it will get its volumes remounted with the original data. Furthermore you can also use NFS or other network file systems to share volumes between containers, even if they are launched on different hosts. This addressed one of the most problematic issues when transitioning from single-host to distributed docker environments.

Pet sets also provide peer-discovery, with normal services you can discover other services (through Docker linking etc) however, discovering other container within a service is not possible. This makes gossip protocol based services such as Cassandra and Zookeeper very difficult to launch.

Lastly, Pet Sets provide startup and tear down ordering which is essential for persistent, scalable services such as Cassandra. Cassandra relies on a set of seed nodes, and when you scale your service up and down you must ensure the seed nodes are the first ones to be launched and the last to be torn down. At the time of writing of this article, Pet Sets are one of the big differentiators for Kubernetes, as persistent stateful workloads are almost impossible to run at production scale on Docker without this support.

Kubernetes also provides namespaces to isolate workloads on a cluster, secrets management and auto-scaling support. All these features an more mean that Kubernetes is also to support large, diverse workloads in a way that Docker Swarm is just not ready for at the moment.

Marathon

Basic Architecture

Another common orchestration setup for large scale clusters is to run Marathon on top of Apache Mesos. Mesos is an open source cluster management system that supports a diverse arrays of workloads. Mesos is composed of a Mesos agent running on each host in the cluster which reports its available resources to the master. There can be one or more Mesos masters which coordinate using a Zookeeper cluster. At any given time one of the masters nodes is active using a master election process. The master can issue tasks to any of the Mesos agents, and will report on the status of those tasks. Although you can issue tasks through the API, the normal approach is to use a framework on top of Mesos. Marathon is one such framework which provides support for running Docker containers (as well as native Mesos containers).

Usability

Again compared to Swarm, Marathon has a fairly steep learning curve as it does not share most of the concepts and terminology with Docker. However, Marathon is not as feature rich, and is thus easier to learn than Kubernetes. However, the complexity of managing a Marathon deployment comes from the fact that it is layered on top of Mesos and hence there are two layers of tools to manage. Furthermore, some of the more advanced features of Marathon such as load balancing are only available as additional frameworks that run on top of Marathon. Some features such as authentication are only available if you run Marathon on top of DC/OS, which in turns run on top of Mesos - adding yet another layer of abstraction to the stack.

Feature Set

To define services in Marathon, you need to use its internal JSON format as shown below. A simple definition like the one below will create a service with two instances each running the nginx container.

{
  "id": "MyService"
  "instances": 2,
  "container": {
    "type": "DOCKER",
    "docker": {
      "network": "BRIDGE",
      "image": "nginx:latest"
    }
  }
}

A slightly more complete version of the above definition is shown below, we now add port mappings and the health check. In port mapping, we specify a container port, which is the port exposed by the docker container. The host port defines which port on the public interface of the host is mapped to the container port. If you specify 0 for host port, then a random port is assigned at run-time. Similarly, we may optionally specify a service port. The service port is used for service discovery and load balancing as described later in this section. Using the health check we can now do both rolling (default) and blue-green deployments.

{
  "id": "MyService"
  "instances": 2,
  "container": {
    "type": "DOCKER",
    "docker": {
      "network": "BRIDGE",
      "image": "nginx:latest"
      "portMappings": [
        { "containerPort": 8080, "hostPort": 0, "servicePort": 9000, "protocol": "tcp" },
      ]
    }
  },
  "healthChecks": [
    {
      "protocol": "HTTP",
      "portIndex": 0,
      "path": "/",
      "gracePeriodSeconds": 5,
      "intervalSeconds": 20,
      "maxConsecutiveFailures": 3
    }
  ]
}

[[In addition to single services, you can define Marathon Application Groups, with a nested tree structure of services. The benefit of defining application in groups is the ability to scale the entire group together. This can be very useful in microservice stacks where tuning individual services can be difficult. As of now, the scaling assumes that all services will scale at the same rate so if you require ‘n’ instances of one service, you will get ‘n’ instances of all services. ] ]

{
  "id": "/product",
  "groups": [
    {
      "id": "/product/database",
      "apps": [
         { "id": "/product/mongo", ... },
         { "id": "/product/mysql", ... }
       ]
    },{
      "id": "/product/service",
      "dependencies": ["/product/database"],
      "apps": [
         { "id": "/product/rails-app", ... },
         { "id": "/product/play-app", ... }
      ]
    }
  ]
}

In addition to being able to define basic services, Marathon can also do scheduling of containers based on specified constraints as detailed here, including specifying that each instance of the service must be on a different physical host \“constraints\“: [[\“hostname\“, \“UNIQUE\”]]. You can use the cpus and mem tags to specify the resource utilization of that container. Each Mesos agent reports its total resource availability hence the scheduler can place workloads on hosts in an intelligent fashion.

By default, Mesos relies on the traditional Docker port mapping and external service discover and load balancing mechanisms. However, recent beta features add support for DNS based service discovery using Mesos DNS or Load balancing using Marathon LB. Mesos DNS is an application that runs on top of Mesos and queries the Mesos API for a list of all running tasks and applications. It then creates DNS records for nodes running those tasks. All Mesos agents then manually need to be updated to use Mesos DNS service as its primary DNS server. Mesos DNS uses the hostname or IP address used to register Mesos agents with the master; and Port mappings can be queried as SRV records. Since Marathon DNS works on agent hostnames, and there for the host network ports must be exposed and hence must not collide. Mesos DNS does provide a way to refer to individual containers persistently for stateful workloads such as we would be able to using Kubernetes pet sets. In addition, unlike Kubernetes VIPs which are addressable on any container in the cluster, we must manually update /etc/resolve.conf to the set of Mesos DNS servers and update the configuration if the DNS servers change. Marathon-lb uses the Marathon Event bus to keep track of all service launches and tear-downs. It then launches a HAProxy instance on agent nodes to relay traffic to the requisite service node.

Marathon also has beta support for persistent volumes as well as external persistent volumes. However, both of these features are in a very raw state. Persistent volumes are only persistent on a single node across container restarts, volumes are deleted if the application using them is deleted however, the actual data on disk is not deleted and must be removed manually. External volumes require DC/OS and currently only allow your service to scale to single instance.

Final Verdict

Today we have looked at three options for Docker container orchestration: Docker Native (Swarm), Kubernetes and Mesos/Marathon. It is difficult to pick a system to recommend because the best system is highly dependent on your use case, scale and history. Furthermore, all three systems are under heavy development and some of the features covered are in beta and may be changed, removed or replaced very soon.

Docker Native gives you the quickest ramp-up with little to no vendor lock-in beyond dependence on Docker. The dependence on Docker is not a big issue, since it has become the defacto container standard. Given the lack of a clear winner in the orchestration wars and the fact that Docker native is the most flexible approach, it is a good choice for simple web/stateless applications. However, Docker Native is very bare bones at the moment and if you need to get complicated, larger-scale applications to production you need to choose one of Mesos/Marathon or Kubernetes.

Between Mesos/Marathon and Kubernetes is also not an easy choice as both have their pros and cons. Kubernetes is certainly the more feature rich and mature of the two, but it is also a very opinionated piece of software. We think a lot of those opinions make sense, but Kubernetes does not have the flexibility of Marathon. This makes sense when you consider the rich history of non-Docker, non-containerized applications that can run on Mesos in addition to Marathon (e.g. Hadoop clusters). If you are doing a green field implementation and either don’t have strong opinions about how to layout clusters, or your opinions agree with those of Google, then Kubernetes is a better choice. Conversely, if you have large, complicated legacy workloads that will gradually shift over to containers then Mesos/Marathon is the way to go.

Another concern is scale: Kubernetes has been tested to thousands of nodes, whereas Mesos has been tested to tens of thousands of nodes. If you are launching clusters with tens of thousands of nodes, you’ll want to use Mesos for the scalability of the underlying infrastructure - but note that scaling advanced features such as load balancing to that range will still be left to you. However, at that scale, few (if any) off-the-shelf solutions work as advertised without careful tuning and monkey patching.

Usman is a server and infrastructure engineer, with experience in building large scale distributed services on top of various cloud platforms. You can read more of his work at techtraits.com, or follow him on twitter @usman_ismailor on GitHub.

You might also be interested in:

A Detailed Overview of Rancher's Architecture
This newly-updated, in-depth guidebook provides a detailed overview of the features and functionality of the new Rancher: an open-source enterprise Kubernetes platform.
快速开启您的Rancher之旅