Summary
This article will guide you through how to:
1. Add a tcpdump sidecar to a dremio cluster statefulset for troubleshooting purposes
2. Run an ephemeral debug container for on-the-fly troubleshooting
Reported Issue
When network communications between a pod and a source fail for example, or inter-pod communication becomes an issue it is useful to analyse pod traffic using tcpdump, which is not installed by default on the dremio container build.
Relevant Versions
All versions.
Steps to Resolve
Note: The following procedure is intrusive in that it requires a restart of all pods in a statefulset so can impact workload.
For the first example we will add the tcpdump container to executors using a rollout method.
1. Confirm all pods are operating as expected:
$ kubectl get pods -n <namespace>2. Generate a patch file to create the tcpdump sidecar:
$ cat <<EOF >patch.yaml
spec:
template:
spec:
containers:
- name: tcpdump
image: docker.io/dockersec/tcpdump
EOF3. Apply the patch file to the relevant statefulset:
$ kubectl patch statefulset dremio-executor -n <namespace> --patch "$(cat patch.yaml)"
$ kubectl rollout status statefulset/dremio-executor -n <namespace>
partitioned roll out complete: 2 new pods have been updated...You will see each executor pod restart in turn. After the restart, run a describe against a pod to confirm the container has been added:
$ kubectl describe pod dremio-executor-0 -n <namespace> | more
Containers:
tcpdumper:
Container ID: containerd://3fba9fa94fd08c05a94c0a5df69767b71ccad89599cb0a4c99499e70fd7062f9
Image: docker.io/dockersec/tcpdump
Image ID: docker.io/dockersec/tcpdump@sha256:aaf093185359e2fc0f04002e0cf8dfa34d71c2bc2120ef550833fe882783284e
Port: <none>
Host Port: <none>
State: Running4. At this point you are ready to start reviewing tcpdump output. The container will dump to STDOUT, so you simply tail the logs for that container:
$ kubectl logs -n <namespace> pod/dremio-executor-1 -c tcpdumper -f
ptions [nop,nop,TS val 3706115649 ecr 725671183], length 47
15:08:15.135188 IP dremio-executor-0.dremio-cluster-pod.test1.svc.cluster.local.54562 > dremio-executor-1.dremio-cluster-pod.test1.svc.cluster.local.45678: Flags [P.], seq 3203020:3203041, ack 3080716, win 52899, options [nop,nop,TS val 725671186 ecr 3706115648], length 21
15:08:15.135504 IP dremio-executor-0.dremio-cluster-pod.test1.svc.cluster.local.54562 > dremio-executor-1.dremio-cluster-pod.test1.svc.cluster.local.45678: Flags [P.], seq 3203041:3203062, ack 3080716, win 52899, options [nop,nop,TS val 725671186 ecr 3706115648], length 21
15:08:15.135967 IP dremio-executor-0.dremio-cluster-pod.test1.svc.cluster.local.54562 > dremio-executor-1.dremio-cluster-pod.test1.svc.cluster.local.45678: Flags [P.], seq 3203062:3203083, ack 3080716, win 52899, options [nop,nop,TS val 725671187 ecr 3706115648], length 21
15:08:15.136263 IP dremio-executor-1.dremio-cluster-pod.test1.svc.cluster.local.45678 > dremio-executor-0.dremio-cluster-pod.test1.svc.cluster.local.54562: Flags [.], ack 3203083, win 52896, options [nop,nop,TS val 3706115650 ecr 725671186], length 05You can use all the usual command line tools to filter the output, or simply redirect/ tee to a file.
5. When you are done, remove the sidecar by rolling back the patch. Be aware that this will restart every container in the statefulset:
$ kubectl rollout undo statefulset/dremio-executor -n test1
statefulset.apps/dremio-executor rolled back
$ kubectl rollout status -n test1 statefulset/dremio-executor
Waiting for partitioned roll out to finish: 1 out of 2 new pods have been updated...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
partitioned roll out complete: 2 new pods have been updated...$ kubectl debug -n <namespace> -it pod/dremio-executor-0 --image=dockersec/tcpdump --target dremio-executor -- shkubectl debug -n <namespace> -it pod/dremio-executor-0 --image=dockersec/tcpdump --target dremio-executor -- tcpdump -n -i any -s0 -v port 2181
Common Challenges
Note: You may find that the rollback results in pods stuck in terminating, this appears due to the tcpdump process failing to accept the SIGKILL (you will see the tcpdump sidecar continue to log as before). In this event, scale the statefulset to 0 then back up to cleanly finish the rollback per below examples.
$ kubectl scale statefulset/dremio-executor --replicas=0 -n <namespace>
statefulset.apps/dremio-executor scaled
$ kubectl scale statefulset/dremio-executor --replicas=2 -n <namespace>
statefulset.apps/dremio-executor scaledAdditional Resources
https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/
https://github.com/nicolaka/netshoot