How we reduced 502 errors by caring about PID 1 in Kubernetes
This blog post and linked pages contain information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned in this blog post and linked pages are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc. Our SRE on call was getting paged daily that one of our SLIs was burning through our SLOs for the GitLab Pages service. It was intermittent and short-lived, but enough to cause user-facing impact which we weren’t comfortable with. This turned into alert fatigue because there wasn’t enough time for the SRE on call to investigate the issue and it wasn’t actionable since it recovered on its own. We decided to open up an investigation issue for these alerts. We had to find out what the issue was since we were showing 502 errors to our users and we needed a DRI that wasn’t on call to investigate. What is even going on? As an SRE at GitLab, you get to touch a lot of services that you didn’t build yourself and interact with system dependencies that you might have not touched before. There’s always detective work to do! When we looked at the GitLab Pages logs we found that it’s always returning ErrDomainDoesNotExist errors which result in a 502 error to our users. GitLab Pages sends a request to GitLab Workhorse, specifically the /api/v4/internal/pages route. GitLab Workhorse is a Go service in front of our Ruby on Rails monolith and it’s deployed as a sidecar inside of the webservice pod, which runs Ruby on Rails using the Puma web server. We used the internal IP to correlate the GitLab Pages requests with GitLab Workhorse containers. We looked at multiple requests and found that all the 502 requests had the following error attached to them: 502 Bad Gateway with dial tcp 127.0.0.1:8080: connect: connection refused. This means that GitLab Workhorse couldn’t connect to the Puma web server. So we needed to go another layer deeper. The Puma web server is what runs the Ruby on Rails monolith which has an internal API endpoint but Puma was never getting these requests since it wasn’t running. What this tells us is that Kubernetes kept our pod in the service even when Puma wasn’t responding, despite having readiness probes configured. Below is the request flow between GitLab Pages, GitLab Workhorse, and Puma/Webservice to try and make it more clear: Attempt 1: Red herring We shifted our focus on GitLab Workhorse and Puma to try and understand how GitLab Workhorse was returning 502 errors in the first place. We found some 502 Bad Gateway with dial tcp 127.0.0.1:8080: connect: connection refused errors during container startup time. How could this be? With the readiness probe, the pod shouldn’t be added to the Endpoint until all readiness probes pass. We later found out that it’s because of a polling mechanisim that we have for Geo which runs in the background, using a Goroutine in GitLab Workhorse, and pings Puma for Geo information. We don’t have Geo enabled on GitLab.com so we simply disabled it to reduce […]
