Summary
A "502 Bad Gateway" error in Dremio often indicates communication failures between Dremio and its dependencies. In this case, the error arose from a full root filesystem, which correpted Zookeeper transaction logs. After freeing disk space, resolving Zookeeper data corruption restored access to the Dremio UI.
Reported Issue
- Users cannot access the Dremio UI and receive a "502 Bad Gateway" error.
- Dremio UI remain unresponsive even after restarting the dremio service.
Relevant Versions
Dremio AWSE
Troubleshooting Steps
1. Identify and resolve the full filesystem issue
- Check disk usage with df -h
- Locate and delete large temporary files (e.g., .pcap files in this case) as a full filesystem prevents services like ZK from writing transaction logs
2. Restart services
- Attempt to restart Dremio service and check the status
- If the dremio service is active but the Dremio UI is still unresponsive, investigate ZooKeeper logs (critical for cluster coordination)
3. Investigate Zookeeper Issues
- Check the logs to understand why Zookeeper was failing. In case of AWSE deployment, check server.log and journalctl output.
- In this case, a partial transaction error was found in the logs as shown below
server.log output:
2025-05-16 06:55:19,380 [main] INFO c.d.dac.resource.AwsBackupService - Scheduling auto backup, every 1440 minutes and retain for 720 hours
2025-05-16 06:55:19,585 [main] INFO c.d.s.coordinator.zk.ZKClusterClient - Connect: 10.25.66.252:2181, zkRoot: , clusterId: dremio
2025-05-16 06:55:19,682 [main] INFO c.d.s.coordinator.zk.ZKClusterClient - Creating new Zookeeper client with arguments: 10.25.66.252:2181, 90000, false.
2025-05-16 06:55:19,721 [main] INFO c.d.s.coordinator.zk.ZKClusterClient - Starting ZKClusterClient, ZK_TIMEOUT:5000, ZK_SESSION_TIMEOUT:90000, ZK_RETRY_MAX_DELAY:300000, ZK_RETRY_UNLIMITED:true, ZK_RETRY_LIMIT:-1, CONNECTION_HANDLE_ENABLED:false, SUPERVISOR_INTERVAL:30000, SUPERVISOR_READ_TIMEOUT:10000, SUPERVISOR_MAX_FAILURES:5
ZK service failing:
partial transaction error message
which indicated ZK data corruption issue.
3. Fix Zookeeper Data Corruption
- Backup and delete corrupted files in ZooKeeper’s data directory.
4. Restart Zookeeper and Dremio:
- Restart both the Zookeeper and Dremio services, following the below sequence to ensure they come back online smoothly
1. Stop the Dremio service
2. Stop the ZK service
3. After deleting the corrupted ZK files, restart the ZK service first to ensure it is running correctly.
4. Finally, restart the Dremio service to restore access to the UI
Cause
The root cause of the issue was the root filesystem being completely full, which led to Zookeeper encountering problems in reading snapshot data and managing transactions, resulting in the partial transaction error and the eventual failure to access the Dremio UI.
Resolution
1. Delete large temporary files to free up disk space and resolve the full filesystem issue.
2. Back up and remove corrupted files from ZooKeeper’s data directory.
3. Stop the Dremio service.
4. Stop the ZooKeeper service.
5. Restart the ZooKeeper service first and ensure it started successfully.
6. Restart the Dremio service to bring the system back online and restore UI access.