Summary
This article helps resolve the situation when the Dremio service fails to start and is stuck at the Zookeeper Server due to Zookeeper DB corruption. It is applicable if Dremio is using external Zookeeper servers.
Reported Issue
The Dremio service uses the Zookeeper database to store service states. If there is an issue at the Zookeeper server, the Dremio Zookeeper client is indefinitely stuck. The coordinator server.log stops at the line: "INFO c.d.s.coordinator.zk.ZKClusterClient - Starting ZKClusterClient, ZK_TIMEOUT: 5000, ZK_SESSION_TIMEOUT:90000, ZK_RETRY_MAX_DELAY:300000, ZK_RETRY_UNLIMITED:true, ZK_RETRY_LIMIT:-1, CONNECTION_HANDLE_ENABLED:false"
Relevant Versions
Any Zookeeper version
Troubleshooting Steps
- SSH into the Zookeeper server
- Check the Zookeeper service status with "ps -ef | grep -i zookeeper", and if you notice the PID is frequently changing, check the service logs.
- Go to the Zookeeper log dir, and in the zookeeper.log file, look for the error message: "ERROR [main:Util@239] - Last transaction was partial."
Cause
The error message "ERROR [main:Util@239] - Last transaction was partial" indicates that the Zookeeper Database transaction log file was incomplete. Because of this, service crashes happen and affect any application that tries to store the ZNodes.
Steps to Resolve
- Back up the Zookeeper DATA directory
- Delete the DATA directory
- Restart the Zookeeper service
- Once the Zookeeper service is stable, restart the Dremio service
Tips & Tricks
N/A
Best Practices
N/A
Recommendations
N/A
FAQ
N/A
Additional Resources
https://stackoverflow.com/questions/44155045/unable-to-start-zookeeper-last-transaction-was-partial