Summary
This article helps troubleshoot the "java.lang.RuntimeException: org.rocksdb.RocksDBException: IOError(StaleFile)" exception reported in the Dremio logs.
Reported Issue
One of the following errors is displayed in the Dremio logs:
[dremio-general-*] ERROR c.d.s.reflection.ReflectionManager - Reflection manager failed
java.lang.RuntimeException: org.rocksdb.RocksDBException: IOError(StaleFile)
at com.dremio.datastore.RocksDBStore.put(RocksDBStore.java)
at com.dremio.datastore.RocksDBStore.put(RocksDBStore.java)
at com.dremio.datastore.ByteStoreManager$StoreMetadataManagerImpl.setLatestTransactionNumber(ByteStoreManager.java)
at com.dremio.datastore.ByteStoreManager$StoreMetadataManagerImpl.setLatestTransactionNumber(ByteStoreManager.java)
at com.dremio.datastore.CoreStoreProviderImpl$1.onClose(CoreStoreProviderImpl.java)
at com.dremio.datastore.indexed.CommitWrapper$CommitCloser.close(CommitWrapper.java)
at com.dremio.datastore.indexed.LuceneSearchIndex.commit(LuceneSearchIndex.java)
at com.dremio.datastore.indexed.LuceneSearchIndex$CommitterThread.commitLoop(LuceneSearchIndex.java)
at com.dremio.datastore.indexed.LuceneSearchIndex$CommitterThread.access$000(LuceneSearchIndex.java)
at com.dremio.datastore.indexed.LuceneSearchIndex$CommitterThread$1.run(LuceneSearchIndex.java)
at java.lang.Thread.run(Thread.java)
Caused by: org.rocksdb.RocksDBException: IOError(StaleFile)
at org.rocksdb.RocksDB.put(Native Method)
at org.rocksdb.RocksDB.put(RocksDB.java)
at com.dremio.datastore.RocksDBStore.put(RocksDBStore.java)
... 10 common frames omitted
[scheduler-*] WARN c.d.s.s.LocalSchedulerService - Execution of task com.dremio.service.jobs.LocalJobsService$JobResultsCleanupTask@* failed
java.lang.RuntimeException: org.rocksdb.RocksDBException: IOError(StaleFile)...
[scheduler-*] WARN c.d.s.s.LocalSchedulerService - Execution of task com.dremio.service.reflection.voting.VotingServiceImpl$AutomaticReflectionsEnabler@* failed
java.lang.RuntimeException: org.rocksdb.RocksDBException: IOError(StaleFile)...
[QUERY_UUID:job-submission] ERROR c.d.s.commandpool.CommandWrapper - command QUERY_UUID:job-submission failed
java.lang.RuntimeException: org.rocksdb.RocksDBException: IOError(StaleFile)...
Relevant Versions
This can happen in all Dremio releases.
Troubleshooting Steps
1. Check the KV Store (RocksDB) directory (the "data" directory -- exact path is configured in the dremio.conf file) and make sure that's accessible.
2. Check that the disk is not full on the master coordinator.
3. Check the OS logs (var log messages or syslog) for any connectivity issue to the storage.
4. Check the disk that stores the "data" directory for bad blocks.
5. Check if there are too many open files on the master coordinator.
6. Check if another instance of Dremio is running, or if dremio-admin runs in the same time.
Cause
If you are using a Posix file system (e.g. NFS), the error means that the file system returns ESTALE (stale file handle): https://man7.org/linux/man-pages/man3/errno.3.html
There could be a conflict in regards to the lock taken for the KV Store (which only accepts one at a time, but if multiple processes are running, then they would get into a conflict), or there could be issues with accessing the files in the "data" directory (most common with the NFS filesystems).
Steps to Resolve
Follow the troubleshooting steps above.
Additional Resources
https://www.dremio.com/wp-content/uploads/2023/12/Cleanup-KV-Store.pdf
https://docs.dremio.com/current/get-started/cluster-deployments/customizing-configuration/dremio-conf/dist-store-config/