Fix Kubernetes incorrect PV mount – Dremio Support

Summary

This article discusses how to correct the issue when a Dremio master pod in a Kubernetes deployment loses its persistent volume claim (PVC) mapping and restarts with a new persistent volume (PV) and PVC.

Reported Issue

In a Kubernetes deployment, the Dremio master pod may lose its correct PV/PVC mapping. This can occur due to incorrect human interaction at the storage layer. The master pod will restart and attempt to mount its PVC, but if no pre-existing PVC with the correct label is found, the StatefulSet will auto-generate a new PV and PVC pair. As a result, the master pod will be reinitialized with an empty disk, rather than mounting all existing KVStore data.

Note: this is an issue at the Kubernetes layer and is external to the Dremio product suite.

Relevant Versions

All Kubernetes Dremio clusters where the Dremio master loses the correct PV/PVC mapping.

Troubleshooting Steps

When an existing dremio-master pod is restarted, it will initialise with a blank KVStore and configuration.

You can validate the old PV still exists checking the PVs:

kubectl get pv

You should see 2 PVs, the original PV with status "Released" and the new volume with "Bound":

Cause

The issue occurs when the Dremio master pod loses its correct PV/PVC mapping, often due to incorrect human interaction at the storage layer.

Steps to Resolve

Ensure the original PV still exists and its reclaim policy is set to "Retain". To change the reclaim policy:

kubectl patch pv <pvc name> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

2. Export the new PVC and original PV definitions to YAML files.

kubectl get pvc dremio-master-volume-dremio-master-0 -o yaml > /tmp/pvc.yaml

3. Edit the PVC:

kubectl edit pv pvc-ca6f2a7b-4c6c-43be-bdd1-3ecfd51d3727 -o yaml

.... and remove the claimRef block....

.... then ensure the reclaim policy is set to "Retain":

You should see the PV STATUS switch to available:

kubectl get pv

4. Edit the PVC YAML exported at (2), update with the original PV volumeName, and remove the "uid" property.

5. Scale down the Dremio master StatefulSet to 0 replicas.

kubectl scale statefulsets dremio-master --replicas=0

6. Remove the new PVC claim:

kubectl delete pvc dremio-master-volume-dremio-master-0

7. Apply the PVC file you just updated to redefine the PVC with the correct volume. You should see that PV state set to "Bound".

$ kubectl apply -f /tmp/pvc.yaml
persistentvolumeclaim/dremio-master-volume-dremio-master-0 created

8. Scale the dremio-master statefulset back to 1 replica. Once the pod is online, exec onto the pod and confirm you see the original logfiles, archived logs etc. You should also see the original KVStore files under the data/db/catalog location.

kubectl scale statefulsets dremio-master --replicas=1

9. Once you are happy the procedure has completed successfully, you will need to scale down and back up the executor statefulset, otherwise you will receive the error "Default engine is not online" when submitting any queries.

kubectl scale statefulsets dremio-executor --replicas=0
....
kubectl scale statefulsets dremio-executor --replicas=<previous count>

Note that you will need to repeat this step for all executor engines.

10. Once you are happy your cluster is back and stable, you should delete the new blank PV orphaned at step 5.

Additional Resources

Avisi Runbook