Canary Deployment Revealed. Answers on What it is, Pros and Cons
- Max S.
- 9min read
Software delivery is a complex process with multiple interconnected workflows. Releasing a new feature or software version is always a challenge and canary deployment is meant to improve the process and help with minimizing the potential risks. However, multiple questions might arise when considering a canary deployment strategy:
- What is canary deployment?
- What are its benefits and flaws?
- How to perform a canary deployment with AWS and other DevOps tools?
Today, DeployPlace answers these questions. Continue reading to get use cases and step-by-step guides on canary deployment implementation with various platforms — and how implementing canary deployment could have saved a game development company $100,000 a day.
A canary deployment strategy is an approach to software delivery when a new feature or product version is initially rolled out only for a small portion of users to test it in a live production environment while minimizing a potential bug impact.
The term refers to a practice of putting canaries in cages on the floor of coal or ore mines. The bird died if the concentration of dangerous gases in the air was increasing, but the dosage was not yet lethal for a human. While a canary sang, the miners worked. If a canary stopped singing, the miners immediately evacuated.
But how is this term related to software delivery? Nowadays, canary deployments in software delivery help businesses save money and provide an uninterrupted positive experience to their customers.
What is deployment strategy and why you need it?
We will not describe the horrors of manual deployments here. Every developer knows them by heart and a deployment strategy is a method of minimizing the manual labour required, removing the risks and reducing the potential for human error. Modern DevOps tools that enable Continuous Integration and Continuous Delivery pipelines (like Jenkins, Circle CI, Travis, DeployPlace, etc) help automate a wide variety of operations that form the release cycle, thus increasing predictability and reducing the risk of financial losses due to post-release crashes.
However, CI/CD pipelines are not operating in a vacuum, they are designed and implemented according to some deployment strategy — such a way to update the application that the user barely notices the changes and experiences no downtime. There are various types of deployment strategies, like canary deployments, feature toggles, Blue-Green deployments, etc.
What is a feature toggle deployment strategy?
Feature toggle (feature flags, conditional features, etc.) — a software deployment technique serving as an alternative to maintaining multiple code branches and merging them before release. A feature currently in development can be deployed to a production environment for a limited group of users using variables inside of conditional statements. If a feature works as intended for 10% of users, it is toggled for the next 10%, then the next, and so on. This incremental testing can be stopped and reverted at any time by hiding or disabling the feature, just by toggling the flags off. It is convenient for testing separate features or updating them in production without stopping an entire application.
Why should you use a Canary Deployment strategy?
In software development, the canary deployment strategy involves exposing new code in the production environment only to a small controlled group of users by rerouting a portion of traffic from the previous stable version of the software to the canary nodes. If everything works fine, all the traffic is rerouted to the canary version, effectively making it the new stable version. If some issues arise, the users are rerouted to the stable version and the canary nodes are rolled back to the previous code state.
The prerequisites to using canary releases are as follows:
- Several versions of your app can work in parallel with live traffic.
- Some sort of sticky token must be in place to ensure every visitor gets served by the same app version during their session (and not by prod-canary-prod in rotation).
There also are certain liabilities when using canary deployments:
- Manual releases are impossible with the canary technique, as there are many error-prone elements. The full release cycle must be automated to use it efficiently.
- Canary releases have value only in systems with in-depth monitoring to ensure rapid assessment of a variety of parameters in a short time and on a limited number of users.
- Major application updates (like the ones involving API structure or database schema changes) can be efficient only if a detailed migration/testing strategy is in place.
Benefits of Canary Deployment
There are several cases when using canary deployment is beneficial or inevitable:
- Your app consists of separate microservices that are updated independently and performance consistency must be checked in live production.
- Your platform or service serves hundreds of thousands of visitors daily (online gaming, online banking, eCommerce, social media, etc), so a faulty update can result in huge financial losses.
- Your product depends on some legacy or a third-party system that cannot be replicated within the testing or staging environment, so testing new functionality is possible only on live production servers.
In these cases, canary deployment is the only way to check if everything is alright before performing a full-scale update of your systems. It is very cost-efficient, too, as you don’t have to use two production environments in parallel, like when performing Blue-Green deployment.
Disadvantages of Canary Deployment
There are also several cases when using canary deployment can bring disadvantages or is simply prohibited:
- Updates to complex software that serves mission-critical or life-supporting systems. Applications controlling energy grids, life support systems in hospitals, nuclear reactors — all of these cannot be updated with canary releases.
- Failure results can be catastrophic economically or politically, even if not directly life-threatening — like updating government record storage systems.
- The update requires adjusting some backend functionality or data storage, making the new version incompatible with the rest of the production environment.
Real-life use case when using Canary Deployment could have prevented the loss of $100,000 a day
A European company developing online games with hundreds of thousands of players worldwide prepared a new update for one of their popular products, a browser MMORPG. The update performed well during testing and on the staging server and was rolled out successfully to CIS, Asian, EU and US servers.
After the update, the company noticed a sudden decrease in revenue from the EU and US servers, as their online traffic lowered significantly, while the CIS and Asian server worked as usual. The company could not find what was wrong, as the conversions were the same and the users did not leave any complaints with the technical support, nor any system performance incident reports were generated by the system.
The company spent lots of resources to locate the problem and did not succeed. QA and testing departments worked in shifts 24h per day for several days in a row, but to no avail. The code worked fine, but the players were not logging in. To make things worse, this was a long-awaited update that introduced new gameplay events that the players anticipated, so it could not have been rolled back without losing face.
Fortunately, a couple of days later, during a regular communication with the Google Ads team, they found out that their service could not be accessed through IPv6 addresses, while it worked perfectly well for IPv4 addresses. The US and EU users that used only an IPv6 address received an endless redirect, but they were warned of the update and simply waited for it to be over. This was impossible to predict or test on the staging server, as the QA and testing specialists accessed the application via IPv4 subnets. It was only in production that users who accessed the servers via IPv6 subnets encountered this issue.
As soon as the appropriate corrections were made and the app became accessible through IPv6, traffic recovered, but the company lost more than $100,000 a day for several days on end.
Canary deployment vs Blue-Green deployment
As we mentioned above, the canary deployment strategy is a worthy alternative to the Blue-Green deployment process. The main difference between them is that with the former you need to use load balancing to redirect the traffic between the nodes of a single production environment, while with the latter you use load balancing with the router to redirect the traffic between two production environments. Thus said, the Blue-Green deployment requires more resources but is simpler in implementation.
Canary Deployment implementation guide
The canary deployment technique can be used with several DevOps tools and approaches to ensure successful and error-proof automated deployments on various platforms. Below we list the steps of using canary updates in general and the technology required.
Canary Deployment steps
A canary deployment implementation includes the following steps:
- Prepare everything for deployment on the staging server (deployment manifests, configuration files, build artifacts, testing scripts).
- Exclude canary nodes from production with load balancing.
- Deploy new app version to canary nodes.
- Test new application version against automated test scripts.
- Connect the canary nodes to the traffic using load balancing, check for consistency, performance, system health, etc.
- Roll out the update to the rest of the production nodes if everything runs well. If the app crashes — roll the changes back and revert the traffic to the production nodes.
Partitioning strategies or how to choose your canary
When your product or service spans several regions, it might be easier to change your partitioning strategy. Instead of using the load balancer to split the traffic, it might be easier to roll the update out for one of the minor regions. This usually provides enough user diversity to test against all the possible scenarios, and isolating the traffic from one country is pretty simple.
Alternatively, you can split your production environment by instances. By default, Kubernetes nodes spread it automatically, so if you want to divert 10% of the traffic to a canary node, you need to have nine nodes running the production environment. The same goes for 25% or 50% of workloads.
How to know if your canary is dead?
What metrics can show that there is an issue with your canary version and that you need to start to roll back the changes?
There are various metrics you need to monitor to ensure the health of your application. Business metrics include conversion rates, session length, etc., while technical metrics include CPU load, RAM usage, network load, etc. You have to monitor them all to receive an integral picture of your IT operations.
DeployPlace allows adding various types of metrics to your dashboard and gathering the metrics for all your apps in one place. This greatly simplifies deployment monitoring and management in production for your applications and services.
Other tools like Spinnaker require placing various agents in order to gather live metrics and obtaining the data through APIs.
When should the canary become the production?
If the canary update is successful, how do you know when to roll it out to the rest of the users? Have a roadmap in place with periodic feedback collection and analysis, and a recovery plan in place for every stage:
- Launch the canary for 5% or 10% of users. Test if everything is working correctly through assessing the set of metrics using Spinnaker or any tool of your choice.
- Double or triple the size of the canary version. Test again, gather feedback from the metrics.
- Update 70% of the production environment using the same procedure.
- Roll out to 100% of users and continue monitoring.
As for the time frames for this testing — you must have one or two north star metrics, so you can ensure the update was successful in a couple of hours. Otherwise you would have to monitor the retention for weeks or months, which is useless.
Canary deployment with popular tools
Let’s take a closer look at how to perform canary deployments using various DevOps tools.
Canary deployments with DeployPlace
While DeployPlace is currently in active development and canary deployments are not yet supported out of the box, we value this technique a lot and this feature will be available in one of our next product versions shortly. If you want this feature added sooner, please let us know and we will increase the priority for it.
How to enable canary with GitLab
GitLab supports canary deployments after the correct configuration of Deploy Boards. You just need to enable track: canary labels for your Kubernetes pods and deployments. Auto Deploy configuration will help you with that.
The blank or stable label denotes your live environment, while canary or any other track stands for your temporary/canary deployments. When everything is configured correctly and runs at least once, the canary deployments are highlighted by a yellow dot inside the pod square, which allows quick and easy assessment of the current canary deploy stage.
Performing canary deployment with Octopus
The Octopus platform provides two major approaches to canary release. The easier approach is to enable the feature called “Deploy to a subset of deployment targets”. It allows you to easily deploy new updates to only some production instances, and after disabling it you can deploy to all the instances again. This is a convenient approach if you do not have to deploy new code several times a day.The second approach involves assigning the canary role to some of your servers and enabling the manual intervention step to limit the deployment with these servers only. Once the new code is tested, you can continue to deploy to the rest of your instances.
Yet another approach is to configure a canary environment with the deployment target being the same as the production target. This way, once you’re done with testing, you can easily deploy to production.
Canary deployment with AWS
Amazon Web Services provides a simple step-by-step guide to canary updates using their Lambda serverless computing feature.
Here is the basic example of canary deployment using AWS:
This function launches the following script https://github.com/awslabs/aws-lambda-deploy that executes a Lambda function following the standard Kubernetes logic described above. Install it with the following command:
The parameters above are as follows:
- function-name: The name of the Lambda function to deploy.
- alias-name: The name of the alias used to invoke the Lambda function.
- new-version: The version identifier for the new version to deploy.
- steps: The number of times the new version weight is increased.
- interval: The amount of time (in seconds) to wait between weight updates.
- type: The function to use to generate the weights. Supported values: “linear”.
The problem here is that this Lambda function operates for five minutes max, so longer deployments require another solution — step function workflow.
With it, the process described above becomes just one step in a workflow, and the AWS Step Machine can execute it for as long as needed — up to a year, instead of five minutes.
As you can see, canary deployment with AWS provides a reliable and error-proof way to update your software using Lambda functions.
Canary deployment with Istio
Istio is a configurable service mesh for establishing custom parameters (like traffic percentage per instance) for your Kubernetes clusters. Therefore, you can implement any scenario you need, regardless of your actual production environment configuration.
In general, using Istio ensures flexible traffic balancing on any number of production podes.
Canary deployment with OpenShift
OpenShift provides canary updates as the default technique for all updates under the rolling strategy. The detailed process can be found by the link above. In short, the OpenShift platform performs readiness checks to determine if a pod is operational and if it can be safely deleted. This ensures no downtime during deployment.
- updatePeriodSeconds — The time to wait between individual pod updates. If unspecified, this value defaults to 1.
- intervalSeconds — The time to wait between polling the deployment status after update. If unspecified, this value defaults to 1.
- timeoutSeconds — The time to wait for a scaling event before giving up. Optional; the default is 600. Here, giving up means automatically rolling back to the previous complete deployment.
- maxSurge is optional and defaults to 25% if not specified.
- maxUnavailable is optional and defaults to 25% if not specified.
- pre and post are both lifecycle hooks.
The rolling update strategy works as follows:
- Perform any pre hook needed.
- Scale up the new replication controller according to the maxSurge count parameter specified.
- Scale down the existing replication controller based on the maxUnavailable count.
- Repeat this scaling until the new replication controller has reached the desired replica count and the old replication controller has been fully replaced.
- Execute any post hook.
Companies that implement canary deployment strategy
Many industry-leading companies have already integrated canary updates into their workflows:
- Microsoft: Microsoft has had its staff as a canary user base for all their product updates since the days of Windows Vista and MS Azure inception. The Windows 10 upgrade was done as a canary update for Microsoft staff.
- Instagram: Mike Krieger, the co-founder and CTO of Instagram explains that canary releases help him ensure that bugs cannot do much damage and affect a very small percentage of people.
- Google: All the latest Chrome features are available through Google Chrome Canary. This is more of a developer’s gimmick, but if you want to experience the cutting edge of technology — feel free to enjoy yourself!
- Netflix: Netflix has used Kayenta for years as a part of Automated Canary Analysis - enabling rapid delivery of updates into the Netflix production environment (like optimizing the Netflix API) before outsourcing it in collaboration with Google.
- Facebook: Facebook was optimizing their continuous delivery processes immensely using canary deployment at scale, from releasing their web app three times a day to enabling nearly instant updates for their mobile app.
How do we plan to use canary deployment strategy at DeployPlace?
As you can see, canary releases are a popular, useful and very convenient tool that helps to greatly reduce the risks associated with updating applications in production. We will launch DeployPlace without this feature, but it is included in our roadmap and will surely be added in the future. If you want to have this technique available with DeployPlace quickly — let us know and we will increase the priority for canary deployment implementation.
In case you have further questions on canary update best practices — we are glad to provide them! Let us know in the comments below and we will reply promptly!
What is DeployPlace?
DeployPlace is a team of DevOps specialists and software development enthusiasts that strives to deliver a convenient and powerful tool for software delivery automation. We build a platform that seamlessly integrates with more than 100 popular DevOps tools to help your business simplify the software delivery process and make it more manageable and predictable.
If this sounds like a tool you would like to use — order early access and tell us what tools YOU want us to integrate with DeployPlace!