Flagger Reinstall Misadventure

The configuration

In Northpass we are using GitOps to handle all the Kubernetes configuration. In general this means that we store all our cluster state in a Git repository and use tools such as Flux CD or Argo CD to pull the repository and apply all the configuration to the cluster. In our case it does that in 5 minute intervals.

We are also using Flagger to handle Blue/Green deployments.
To configure Flagger you create Canary resource where:

  • you provide reference to your Deployment resource
  • you provide specification for the Service
  • you define what kind of strategy you want to use (Blue/Green, Canary etc.)
    That’s it!

What Flagger does with it is:

  • it copies your original Deployment (let’s call it app) and adds -primary suffix to this(so we have app-primary)
  • it creates a Service for for the app  and another one for the app-primary
  • it upscales the app-primary pods and - if it booted up properly - it switches the traffic to it
  • it downscales the app pods.

When the new version of the application is deployed it again:

  • scales up app,
  • checks if everything is OK,
  • switches traffic to app
  • substitutes app-primary with new pods
  • switches the traffic to app-primary
  • downscales app
    That’s the full cycle.

The Update Workaround

I wanted to switch Flagger from manual installation (we had all the Flagger installation resources copied into our GitOps repository) to Helm installation in order to get rid of all the code that someone else can maintain.
The problem that came up was that when you uninstall Flagger you also remove CanaryCRD , and deleting Canary means also deleting everything that was defined by it -  Service including. For those who don’t know Kubernetes - if there is no Service no traffic can reach your pods.

I came up with a simple workaround for that:

  • create app-tmp service
  • increase app deployment replicas - because Flagger downscales it to 0
  • change app ingress (it is kind of routing definition) to point to app-tmp service
  • remove Canary
  • remove Flagger
  • create app service
  • point ingress to app service
  • remove app-tmp service
  • install Flagger with Helm
  • create Canary

503s

I tested it on staging - worked like a charm!
Time for production! Worked beautifully as well…until our team reported that they are getting 503 rather often (5-8% of all the traffic).

At first I blamed our gateway (NGINX Ingress Controller) for having stale configuration - I thought that maybe all those afternoon changes from app to app-tmp, then to app-primary were to blame. Restarted all the controller pods but it didn’t help.
After debugging it for another two hours I accidentaly noticed that the endpoints for the app Service changed for a brief moment and then switched back to the original configuration. What was strange was that it happened kind of regularly - in 5-minute intervals…

Solution

It turned out that I was missing one step in my reinstallation workaround -  after recreating Canary resource I should have removed app Service from it. Because I didn’t Flagger and Flux where overwriting it’s configuration in a loop. Every five minutes Flux applied the configuration from our GitOps repository (where app Service pointed to app Deployment pods). When Flagger detected that app Service have changed it changed it back to the configuration that it needs i.e. app Service should point to app-primary Deployment pods.