Our infrastructure runs on Kubernetes with the following setup:
- NGINX Ingress Controller as gateway
- rails application with Puma application server
- horizontal pod autoscaler
Horizontal Pod Autoscaler scales the application pods depending on the CPU usage. We try to keep it on the average level of 60%. When it goes above the application is scaled up. When the value goes lower the application is downscaled.
The problem we had was that, when the application was downscaled and the NGINX Ingress Controller proxied the connection to the pod that is being removed due to low level CPU usage NGINX raised
connect() failed (111: Connection refused) while connecting to upstream or
connect() failed (113: Host is unreachable) while connecting to upstream or
upstream timed out (110: Operation timed out) while connecting to upstream This caused errors to around 1% of the traffic, especially when there were a lot of traffic spike because there was a lot of downscaling and upscaling.
The solution I found was to use
preStop hook in Kubernetes pod lifecycle with the command
sh -c „sleep 5”. This caused the pod to be marked as Not Ready, which meant that Nginx Ingress Controller did not proxy the requests to it, letting all incomming requests to finish during those 5 seconds before shutting down the application pod.
Below you can see the change in errors after applying this fix. The blue bars are all failed requests that were caused by the above mentioned error.