How we reduced 502 errors by caring about PID 1 in Kubernetes

This blog post and linked pages contain information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned in this blog post and linked pages are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Our SRE on call
was getting paged daily that one of our
SLIs was
burning through our
SLOs for the GitLab
Pages
service. It was
intermittent and short-lived, but enough to cause user-facing impact which we
weren’t comfortable with. This turned into alert fatigue because there wasn’t
enough time for the SRE on call to investigate the issue and it wasn’t
actionable since it recovered on its own.

We decided to open up an investigation issue
for these alerts. We had to find out what the issue was since we were
showing 502 errors to our users and we needed a
DRI
that wasn’t on call to investigate.

What is even going on?

As an SRE
at GitLab, you get to touch a lot of services that you didn’t build yourself and
interact with system dependencies that you might have not touched before.
There’s always detective work to do!

When we looked at the GitLab Pages logs we found that it’s always returning
ErrDomainDoesNotExist
errors which result in a 502 error to our users. GitLab Pages sends a request
to GitLab Workhorse,
specifically the /api/v4/internal/pages route.

GitLab Workhorse is a Go service in front of our Ruby on Rails monolith and
it’s deployed as a sidecar
inside of the webservice pod, which runs Ruby on Rails using the Puma web
server.

We used the internal IP to correlate the GitLab Pages requests with GitLab Workhorse
containers. We looked at multiple requests and found that all the 502 requests
had the following error attached to them: 502 Bad Gateway with dial tcp 127.0.0.1:8080: connect: connection refused.
This means that GitLab Workhorse couldn’t connect to the Puma web server. So we
needed to go another layer deeper.

The Puma web server is what runs the Ruby on Rails monolith which has an
internal API endpoint but Puma was never getting these requests since it wasn’t
running. What this tells us is that Kubernetes kept our pod in the
service
even when Puma wasn’t responding, despite having readiness probes
configured.

Below is the request flow between GitLab Pages, GitLab Workhorse, and Puma/Webservice to try and make it more clear:

overview of the request flow

Attempt 1: Red herring

We shifted our focus on GitLab Workhorse and Puma to try and understand how
GitLab Workhorse was returning 502 errors in the first place. We found some
502 Bad Gateway with dial tcp 127.0.0.1:8080: connect: connection refused
errors during container startup time. How could this be? With the readiness
probe, the pod shouldn’t be added to the
Endpoint
until all readiness probes pass.
We later found out that it’s because of a polling
mechanisim

that we have for Geo which
runs in the background, using a Goroutine in GitLab Workhorse, and pings Puma for Geo information.
We don’t have Geo enabled on GitLab.com so we simply disabled it
to reduce the noise.

We removed the 502 errors, but not the ones we want, just a red herring.

Attempt 2: Close but not quite

At this time, we were still burning through our SLO from time to time, so this
was still an urgent thing that we needed to fix. Now that we had cleaner logs for
502 errors it started to become a bit clearer that this is happening on pod
termination:

2022-04-05 06:03:49.000 UTC: Readiness probe failed
2022-04-05 06:03:51.000 UTC: Puma (127.0.0.1:8080) started shutdown.
2022-04-05 06:04:04.526 UTC: Puma shutdown finished.
2022-04-05 06:04:04.000 UTC - 2022-04-05 06:04:46.000 UTC: workhorse started serving 502 constantly.  42 seconds of serving 502 requests for any request that comes in apart from /api/v4/jobs/request

In the timeline shown above, we see that we’ve kept serving requests well after
our Puma/webservice container exited, and the first readiness probe failed.
If we look at the readiness probes we had on that pod we see the following:

$ kubectl -n gitlab get po gitlab-webservice-api-785cb54bbd-xpln2 -o jsonpath='{range .spec.containers[*]} {@.name}{":ntliveness:"} {@.livenessProbe} {"ntreadiness:"} {@.readinessProbe} {"n"} {end}'
 webservice:
        liveness: {"failureThreshold":3,"httpGet":{"path":"/-/liveness","port":8080,"scheme":"HTTP"},"initialDelaySeconds":20,"periodSeconds":60,"successThreshold":1,"timeoutSeconds":30}
        readiness: {"failureThreshold":3,"httpGet":{"path":"/-/readiness","port":8080,"scheme":"HTTP"},"initialDelaySeconds":60,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":2}
  gitlab-workhorse:
        liveness: {"exec":{"command":["/scripts/healthcheck"]},"failureThreshold":3,"initialDelaySeconds":20,"periodSeconds":60,"successThreshold":1,"timeoutSeconds":30}
        readiness: {"exec":{"command":["/scripts/healthcheck"]},"failureThreshold":3,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":2}

This meant that for the webservice pod to be marked unhealthy and removed
from the endpoints, Kubernetes had to get 3 consecutive failures with an
interval of 10 seconds, so in total that’s 30 seconds. That seems a bit slow.

Our next logical step was to reduce the periodSeconds for the readiness probe
for the webservice pod so we don’t wait 30 seconds before removing the pod
from the service when it becomes unhealthy.

Before doing so we had to understand if sending more requests to /-/readiness
endpoint would have any knock-on effect with using more memory or anything
else. We had to understand what the /-/readiness endpoint was doing
and if it was safe to increase the frequency at which we send requests. We
decided it was safe, and after enabling it on
staging,
and
canary
we didn’t see any increase in CPU/Memory usage, as expected, and saw an
improvement in the removal of 502 errors, which made us more confident that
this was the issue. We rolled this out to
production
with high hopes.

As usual, Production is a different story than Staging or Canary, and it showed
that it didn’t remove all the 502 errors, just enough to stop triggering the SLO every day,
but at least we removed the alert fatigue on the SRE on call. We were close, but not quite.

Attempt 3: All gone!

At this point, we were a bit lost and weren’t sure what to look at next. We had
a bit of tunnel vision and kept focusing/blaming that we aren’t removing the
Pod from the Endpoint quickly enough. We even looked at Google Cloud Platform
NEGs
to
see if we could have faster readiness probes and remove the pod quicker. However,
this wasn’t ideal because we wouldn’t have solved this for our self-hosting customers
which seem to be facing the same problem.

While researching we also came across a known problem with running Puma in
Kubernetes
,
and thought that might be the solution. However, we already implemented a
blackout window
just for this specific reason, so it couldn’t be that either…in other words, it was another dead end.

We took a step back and looked at the timelines one more time
and then it hit us. The Puma/webservice container is terminating within a
few seconds, but the GitLab Workhorse one is always taking 30 seconds. Is it because
of the long polling from GitLab Runner? 30 seconds
is a “special” number for Kubernetes pod termination.
When Kubernetes deletes a pod it firsts sends the TERM signal to the
container and waits 30 seconds, if the container hasn’t exited yet, it will
send a KILL signal. This indicated that maybe GitLab Workhorse was never
shutting down and Kubernetes had to kill it.

Once more we looked at GitLab Workhorse source code and searched for the SIGTERM usage
and it did seem to support graceful termination and
it also had explicit logic about long polling requests, so is this just another
dead end? Luckily when the TERM signal is sent, Workhorse logs a message that
it’s shutting down
. We looked
at our logs for this specific message and didn’t see anything. Is this it? We
aren’t gracefully shutting down? But how? Why does it result in 502 errors?
Why do the GitLab Pages keep using the same pod that is terminating?

We know that the TERM signal is being sent to PID 1 inside of the container,
and that process should handle the TERM signal for graceful shutdown. We
looked at the GitLab Workhorse process tree and this is what we found:

git@gitlab-webservice-default-5d85b6854c-sbx2z:/$ ps faux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        1015  0.0  0.0 805036  4588 ?        Rsl  13:12   0:00 runc init
git         1005  0.3  0.0   5992  3784 pts/0    Ss   13:12   0:00 bash
git         1014  0.0  0.0   8592  3364 pts/0    R+   13:12   0:00  _ ps faux
git            1  0.0  0.0   2420   532 ?        Ss   12:52   0:00 /bin/sh -c /scripts/start-workhorse
git           16  0.0  0.0   5728  3408 ?        S    12:52   0:00 /bin/bash /scripts/start-workhorse
git           19  0.0  0.3 1328480 33080 ?       Sl   12:52   0:00  _ gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml

Bingo! gitlab-workhorse is PID 19 in this case, and a child process of a
script
that we invoke. Taking a close look at the
script
we check if it listens to TERM and it doesn’t! So far everything indicated
that GitLab Workhorse was never getting the TERM signal which ended up in receiving
KILL after 30 seconds. We updated our scripts/start-workhorse to use
exec(1) so that gitlab-workhorse
replaced the PID of our bash script, that should have worked, right? When we tested
this locally we then saw the following process tree.

git@gitlab-webservice-default-84c68fc9c9-xcsnm:/$ ps faux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
git          167  0.0  0.0   5992  3856 pts/0    Ss   14:27   0:00 bash
git          181  0.0  0.0   8592  3220 pts/0    R+   14:27   0:00  _ ps faux
git            1  0.0  0.0   2420   520 ?        Ss   14:24   0:00 /bin/sh -c /scripts/start-workhorse
git           17  0.0  0.3 1328228 32800 ?       Sl   14:24   0:00 gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml

This changed a bit: this shows that gitlab-workhorse was no longer a child
process of /scripts/start-workhorse however /bin/sh was still PID 1. What is even
invoking /bin/sh that we didn’t see anywhere in our
Dockerfile?
After some thumb-twiddling, we had an idea that the container runtime is invoking
/bin/sh. We went back to basics and looked at the
CMD documentation to
see if we were missing something, and we were. We read the following:

If you use the shell form of the CMD, then the will execute in /bin/sh -c:

FROM ubuntu
CMD echo "This is a test." | wc -

If you want to run your without a shell then you must express the command as a JSON array and give the full path to the executable. This array form is the preferred format of CMD. Any additional parameters must be individually expressed as strings in the array:

FROM ubuntu
CMD ["/usr/bin/wc","--help"]

This was exactly what we were doing!
we weren’t using CMD in exec form, but in shell form. Changing this confirmed
that gitlab-workhorse is now PID 1, and also receives the termination signal
after testing it locally:

git@gitlab-webservice-default-84c68fc9c9-lzwmp:/$ ps faux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
git           65  1.0  0.0   5992  3704 pts/0    Ss   15:25   0:00 bash
git           73  0.0  0.0   8592  3256 pts/0    R+   15:25   0:00  _ ps faux
git            1  0.2  0.3 1328228 32288 ?       Ssl  15:24   0:00 gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml
{"level":"info","msg":"shutdown initiated","shutdown_timeout_s":61,"signal":"terminated","time":"2022-04-13T15:27:57Z"}
{"level":"info","msg":"keywatcher: shutting down","time":"2022-04-13T15:27:57Z"}
{"error":null,"level":"fatal","msg":"shutting down","time":"2022-04-13T15:27:57Z"}

Ok, then we just needed to update exec and CMD [] and we would have been
done, right? Almost. GitLab Workhorse proxies all of the requests for the API, Web, and Git requests so we couldn’t just do a big change and expect that everything is going to be OK. We had to progressively roll this out to make
sure we didn’t break any existing working behavior since this affects all the
requests we get to GitLab.com. To do this, we hid it behind a feature
flag
so GitLab
Workhorse is only PID 1 when the GITLAB_WORKHORSE_EXEC environment variable
is set. This allowed us to deploy the change and only enable it on a small part
of our fleet to see if we see any problems. We were a bit more careful here and
rolled it out zone by zone in Production since we run on 3 zones. When we
rolled it out in the first
zone

we saw all 502 errors disappear! After fully rolling this out we see that the
problem is fixed and it had no negative side
effects
. Hurray!

We still had one question unanswered, why were GitLab Pages still trying to use
the same connection even after the Pod was removed from the Service because it was
scheduled for deletion? When we looked at Go internals we see that Go reuses
TCP connections

if we close the body of the request. So even though it’s not part of the Service
we can still keep the TCP connection open and send requests – this explains why
we kept seeing 502 on pod being terminated and always from the same GitLab
Pages pod.

Now it’s all gone!

More things that we can explore

  1. We’ve made graceful termination for GitLab Workhorse as default.
  2. Audit all of our Dockerfiles that use CMD command and fix them. We’ve found 10, and fixed all of them.
  3. Better readiness Probe defaults for webservice pod.
  4. Add linting for Dockerfiles.
  5. See if any of our child processes need zombie process reaping.

Takeaways

  1. We should always care about what is PID 1 in a container.
  2. Always try and use CMD ["executable","param1","param2"] in your Dockerfile.
  3. Pods are removed from the Service/Endpoint in async.
  4. If you are on GKE NEGs might be better for readinessProbes.
  5. By default, there is a 30 second grace period between the TERM signal and the KILL signal when Pods terminate. You can update the time between the signals terminationGracePeriodSeconds.
  6. The Go http.Client will reuse the TCP connection if the body is closed which in this case made the issue worse.

Thank you to @igorwwwwwwwwwwwwwwwwwwww, @gsgl, @jarv, and @cmcfarland for helping me debug this problem!