Reducing Deployment Downtime

The guide describes strategies to reduce the downtime clients might experience when updating Docker services.

Overview

MedStack Control supports two essential service configurations to help reduce the downtime clients might experience when a service is updated.

  1. Load balancer healthchecks
  2. Update strategy

🚧

Configure with caution

Load balancer healthchecks and update strategy use time-dependent parameters when rolling out service updates. Misconfiguring these values may result in the unintended outcome of increasing the downtime incurred during service updates when compared to the default values.

It is strongly recommended to test the configuration of these parameters in development and staging environments before making the changes in production environments.

Configuring Healthchecks

When a load balancer healthcheck is configured for a service, MedStack's managed load balancer service will determine if a container is not suitable for receiving traffic. Configuring a healthcheck with suitable parameters can be determined by:

  1. Defining an endpoint in the service that should respond with a 2xx or 3xx HTTP response code when container is ready to receive traffic. This endpoint does not need to be publicly available in order for Traefik to request it because this request is made over the internal Docker network.
  2. Beginning with an initial assumption (or default value) for the interval, starting with the estimated time taken to initialize the container upon runtime.
  3. Beginning with an initial assumption (or default value) for the timeout, starting with the estimated time it may take for the container to respond to the request.

Once a healthcheck is configured, the pass/fail output will appear in the load balancer logs. See the section on Maintain > Healthchecks to learn about accessing load balancer logs to inspect healthchecks. Monitoring the logs of healthchecks and toiling with the service's healthcheck configuration is an iterative process to optimize when services become available in the cluster.

Setting the interval to a very short period (e.g., 1 second) may make for very noisy load balancer logs, increase network bandwidth, and CPU load on the Manager node, especially if many services are using healthchecks with this short period.

To reduce deployment downtime for services with more than one replica, configuring time-dependent parameters of the healthcheck interval and update strategy delay will require thoughtful configuration. See the next section on configuring the update strategy, and the example below to learn more about thoughtfully configuring these parameters.

Configuring Update Strategy

The mechanics of starting new containers and stopping old containers upon an update is configured by the update strategy. The default update strategy configuration is designed to mitigate issues that could arise when a breaking change is introduced to a service dependency. Configuring an update strategy with suitable parameters can be determined by:

  1. Defining the most suitable order depending on the service's ability to run different versions simultaneously without adversely impacting other services. For services that are configured to handle this gracefully, it may be suitable to use the start-first value.
  2. Setting the parallelism value to represent the batch of containers updated at the rate defined by the delay value. This can work well for services with more than one replica.
  3. Setting the delay value to a long enough period that allows for a new container to startup and pass a healthcheck before rolling out the next batch of container updates.

Example

Let's say I want to reduce the downtime when deploying updates to Service A. This service:

  • Is stateless. I.e., does not depend on stateful data (ie. data stored on disk that is changed and retrieved again by the application)
  • Is configured to run four replicas.
  • Takes 20 seconds to start up.
  • Responds with 200 at /healthcheck when it's starts up correctly.

To reduce deployment downtime, here are some values I might consider settings for load balancer healthchecks, the update strategy, and justification for it.

SectionFieldValueJustification
Healthcheckspath/healthcheckThis is what has been set in my application layer as a testable endpoint reflecting the readiness of my container.
Healthchecksinterval10We would expect the first healthcheck to fail because the container takes 20 seconds to start up. However, the second or third healthcheck should pass.
Healthcheckstimeout5The default value is sufficient here since we don't know of any reason why the healthcheck endpoint would take more than 5 seconds to respond.
Update Strat.orderstart-stopNo concerns with state mismatch. We don't know of any reasons why the application couldn't handle simultaneous versions with a typical update.
Update Strat.parallelism2Given four replicas, we're breaking up the update mechanism to deploy two batches of two containers, separated by the delay value.
Update Strat.delay40This allows enough time between the update batches for the containers to start up and pass on their second or third healthcheck before rolling out the next batch of updates.