In Northpass we are using GitOps to handle all the Kubernetes configuration. In general this means that we store all our cluster state in a Git repository and use tools such as Flux CD or Argo CD to pull the repository and apply all the configuration to the cluster. In our case it does that in 5 minute intervals.
We are also using Flagger to handle Blue/Green deployments.
To configure Flagger you create
Canary resource where:
- you provide reference to your
- you provide specification for the
- you define what kind of strategy you want to use (Blue/Green, Canary etc.)
What Flagger does with it is:
- it copies your original Deployment (let’s call it
app) and adds
-primarysuffix to this(so we have
- it creates a
Servicefor for the
appand another one for the
- it upscales the
app-primarypods and - if it booted up properly - it switches the traffic to it
- it downscales the
When the new version of the application is deployed it again:
- scales up
- checks if everything is OK,
- switches traffic to
app-primarywith new pods
- switches the traffic to
That’s the full cycle.
The Update Workaround
I wanted to switch Flagger from manual installation (we had all the Flagger installation resources copied into our GitOps repository) to Helm installation in order to get rid of all the code that someone else can maintain.
The problem that came up was that when you uninstall Flagger you also remove
CanaryCRD , and deleting Canary means also deleting everything that was defined by it -
Service including. For those who don’t know Kubernetes - if there is no
Service no traffic can reach your pods.
I came up with a simple workaround for that:
appdeployment replicas - because Flagger downscales it to 0
appingress (it is kind of routing definition) to point to
- remove Flagger
- point ingress to
- install Flagger with Helm
I tested it on staging - worked like a charm!
Time for production! Worked beautifully as well…until our team reported that they are getting 503 rather often (5-8% of all the traffic).
At first I blamed our gateway (NGINX Ingress Controller) for having stale configuration - I thought that maybe all those afternoon changes from
app-tmp, then to
app-primary were to blame. Restarted all the controller pods but it didn’t help.
After debugging it for another two hours I accidentaly noticed that the endpoints for the
app Service changed for a brief moment and then switched back to the original configuration. What was strange was that it happened kind of regularly - in 5-minute intervals…
It turned out that I was missing one step in my reinstallation workaround - after recreating
Canary resource I should have removed
app Service from it. Because I didn’t Flagger and Flux where overwriting it’s configuration in a loop. Every five minutes Flux applied the configuration from our GitOps repository (where
app Service pointed to
app Deployment pods). When Flagger detected that
app Service have changed it changed it back to the configuration that it needs i.e.
app Service should point to
app-primary Deployment pods.