Reducing Deployment Downtime
The guide describes strategies to reduce the downtime clients might experience when updating Docker services.
Overview
MedStack Control supports two essential service configurations to help reduce the downtime clients might experience when a service is updated.
Configure with caution
Load balancer healthchecks and update strategy use time-dependent parameters when rolling out service updates. Misconfiguring these values may result in the unintended outcome of increasing the downtime incurred during service updates when compared to the default values.
It is strongly recommended to test the configuration of these parameters in development and staging environments before making the changes in production environments.
Configuring Healthchecks
When a load balancer healthcheck is configured for a service, MedStack's managed load balancer service will determine if a container is not suitable for receiving traffic. Configuring a healthcheck with suitable parameters can be determined by:
- Defining an endpoint in the service that should respond with a
2xx
or3xx
HTTP response code when container is ready to receive traffic. This endpoint does not need to be publicly available in order for Traefik to request it because this request is made over the internal Docker network. - Beginning with an initial assumption (or default value) for the interval, starting with the estimated time taken to initialize the container upon runtime.
- Beginning with an initial assumption (or default value) for the timeout, starting with the estimated time it may take for the container to respond to the request.
Once a healthcheck is configured, the pass/fail output will appear in the load balancer logs. See the section on Maintain > Healthchecks to learn about accessing load balancer logs to inspect healthchecks. Monitoring the logs of healthchecks and toiling with the service's healthcheck configuration is an iterative process to optimize when services become available in the cluster.
Setting the interval to a very short period (e.g., 1 second) may make for very noisy load balancer logs, increase network bandwidth, and CPU load on the Manager node, especially if many services are using healthchecks with this short period.
To reduce deployment downtime for services with more than one replica, configuring time-dependent parameters of the healthcheck interval
and update strategy delay
will require thoughtful configuration. See the next section on configuring the update strategy, and the example below to learn more about thoughtfully configuring these parameters.
Configuring Update Strategy
The mechanics of starting new containers and stopping old containers upon an update is configured by the update strategy. The default update strategy configuration is designed to mitigate issues that could arise when a breaking change is introduced to a service dependency. Configuring an update strategy with suitable parameters can be determined by:
- Defining the most suitable
order
depending on the service's ability to run different versions simultaneously without adversely impacting other services. For services that are configured to handle this gracefully, it may be suitable to use thestart-first
value. - Setting the
parallelism
value to represent the batch of containers updated at the rate defined by thedelay
value. This can work well for services with more than one replica. - Setting the
delay
value to a long enough period that allows for a new container to startup and pass a healthcheck before rolling out the next batch of container updates.
Example
Let's say I want to reduce the downtime when deploying updates to Service A. This service:
- Is stateless. I.e., does not depend on stateful data (ie. data stored on disk that is changed and retrieved again by the application)
- Is configured to run four replicas.
- Takes 20 seconds to start up.
- Responds with
200
at/healthcheck
when it's starts up correctly.
To reduce deployment downtime, here are some values I might consider settings for load balancer healthchecks, the update strategy, and justification for it.
Section | Field | Value | Justification |
---|---|---|---|
Healthchecks | path | /healthcheck | This is what has been set in my application layer as a testable endpoint reflecting the readiness of my container. |
Healthchecks | interval | 10 | We would expect the first healthcheck to fail because the container takes 20 seconds to start up. However, the second or third healthcheck should pass. |
Healthchecks | timeout | 5 | The default value is sufficient here since we don't know of any reason why the healthcheck endpoint would take more than 5 seconds to respond. |
Update Strat. | order | start-stop | No concerns with state mismatch. We don't know of any reasons why the application couldn't handle simultaneous versions with a typical update. |
Update Strat. | parallelism | 2 | Given four replicas, we're breaking up the update mechanism to deploy two batches of two containers, separated by the delay value. |
Update Strat. | delay | 40 | This allows enough time between the update batches for the containers to start up and pass on their second or third healthcheck before rolling out the next batch of updates. |
Updated 7 months ago