Here is the converted HTML content in the new format:
Summary
On Kubernetes deployments it's possible there may be an issue with DNS service name resolution, which can lead to connectivity issues between ZooKeeper (ZK) pods and dremio pods.
This article discusses troubleshooting steps to verify whether any issues are present.
Note: this is an issue at the Kubernetes layer and is external to the Dremio product suite.
Reported Issue
If a customer experiences a down cluster where the master and executor pods either (a) can't connect to ZooKeeper, or (b) are stuck on the wait-for-zookeeper container initialisation during startup, it's possible there may be an issue with DNS resolution on the underlying Kubernetes cluster. Example scenarios include when a new cluster is initialised, or if the Cloud K8S vendor migrates the cluster to new worker nodes.
Relevant Versions
All Kubernetes Dremio clusters where dremio pods are unable to reach the ZK pods.
Troubleshooting Steps
1. For running ZK pods, you can use netcat to verify connectivity...
root@ubuntu:~# kubectl get service -n inst1
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
dremio-cluster-pod ClusterIP None <none> 9999/TCP 158d
zk-cs ClusterIP 10.152.183.223 <none> 2181/TCP 158d
dremio-client LoadBalancer 10.152.183.154 10.0.0.200 31010:30468/TCP,9047:31638/TCP,32010:31544/TCP 158d
zk-hs ClusterIP None <none> 2181/TCP,2888/TCP,3888/TCP 158d
root@ubuntu:~# kubectl exec -n <namespace> -it zk-0 -- bash
zookeeper@zk-0:/apache-zookeeper-3.8.0-bin$ nc -v zk-hs 2181
Connection to zk-hs (10.1.243.228) 2181 port [tcp/*] succeeded!
^C
zookeeper@zk-0:/apache-zookeeper-3.8.0-bin$ nc -v zk-cs 2181
Connection to zk-cs (10.152.183.223) 2181 port [tcp/*] succeeded!
2. For running pods, you can also test from co-ordinator/ executor pods using curl:
root@ubuntu:~# kubectl exec -n <namespace> -it dremio-master-0 -- bash
Defaulted container "dremio-master-coordinator" out of: dremio-master-coordinator, start-only-one-dremio-master (init), wait-for-zookeeper (init), chown-data-directory (init), upgrade-task (init), generate-ui-keystore (init)
dremio@dremio-master-0:/opt/dremio$ curl -v telnet://zk-hs:2181
* Trying 10.1.243.228:2181...
* Connected to zk-hs (10.1.243.228) port 2181 (#0)
3. For starting pods, stuck on the wait-for-zookeeper container phase, you should check the logs for the container...
root@ubuntu:~# kubectl logs -n <namespace> dremio-master-0 -c wait-for-zookeeper
ping: bad address 'zk-hs'
Waiting for Zookeeper to be ready
ping: bad address 'zk-hs'
Waiting for Zookeeper to be ready
ping: bad address 'zk-hs'
Waiting for Zookeeper to be ready
ping: bad address 'zk-hs'
Waiting for Zookeeper to be ready
ping: bad address 'zk-hs'
Waiting for Zookeeper to be ready
ping: bad address 'zk-hs'
4. For starting pods, you may also see an issue where 1 ZK pod initialises, but the remaining ZK pods are stuck restarting. You should run a describe here:
root@ubuntu:~# kubectl describe pod zk-1 -n inst1
Sample output where the issue is seen:
At this point you have identified the issue is with hostname resolution. As a quick final check, ensure you see running DNS pods in the kube-system namespace...
root@ubuntu:~# kubectl get pods -n kube-system|grep -i dns
coredns-autoscaler-569f6ff56-tvqxv 1/1 Running 0 20d
coredns-fb6b9d95f-hllq2 1/1 Running 0 20d
coredns-fb6b9d95f-xsftw 1/1 Running 0 20d
Cause
The problem here has been identified as an issue with the internal Kubernetes hostname resolution.
Steps to Resolve
The customer at this point should open a case with their Cloud Hosting Provider.
Tips & Tricks
There is an excellent Kubernetes DNS troubleshooting guide here, with steps to launch a non-intrusive dnsutils pod on the cluster. This will give you the correct tools to further check resolution. All steps are on the page, to install the pod you simply call the URL YAML:
kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
One the pod has launched, run a shell in the normal way:
kubectl exec -it dnsutils -n <namespace> -- bash
You then have both nslookup and dig available. To check external resolution:
root@dnsutils:/# nslookup google.com
Server: 10.0.0.10
Address: 10.0.0.10#53
Non-authoritative answer:
Name: google.com
Address: 172.217.168.238
root@dnsutils:/# dig google.com
; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5334
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com. IN A
;; ANSWER SECTION:
google.com. 25 IN A 172.217.168.238
;; Query time: 20 msec
;; SERVER: 10.0.0.10#53(10.0.0.10)
;; WHEN: Tue Mar 12 11:09:29 UTC 2024
;; MSG SIZE rcvd: 65
To check internal resolution:
root@dnsutils:/# nslookup zk-hs
Server: 10.0.0.10
Address: 10.0.0.10#53
Name: zk-hs.default.svc.cluster.local
Address: 10.244.3.10
root@dnsutils:/# dig zk-hs
; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> zk-hs
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 62213
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;zk-hs. IN A
;; AUTHORITY SECTION:
. 30 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2024031200 1800 900 604800 86400
;; Query time: 13 msec
;; SERVER: 10.0.0.10#53(10.0.0.10)
;; WHEN: Tue Mar 12 11:10:37 UTC 2024
;; MSG SIZE rcvd: 109
Again, in the examples above all resolution is working. Errors will be returned if there are issues. When you have completed these steps, don't forget to delete the dnsutils pod.
Additional Resources
netshoot - Kubernetes/ Docker troubleshooting pod
nettools - Kubernetes network troubleshooting pod