Summary
This document covers a situation where Dremio is running normally and then unexpectedly fails without warning, access to the UI is no longer available.
Relevant Versions
All versions of Dremio software
Reported Issue
As the number of objects grow in the KVStore, such as datasets and sources, the number of associated files created by the KVStore will increase. When these files are opened, they need to be tracked by the host OS, and the number of open files you can have at any one time is managed by the ulimit "open files". If there are insufficient open files configured, Dremio will not be able to open all files backing the KVStore and will fail.
Troubleshooting Steps
Checking the server.log shows the following as the fatal call stack:
2024-02-14 04:44:10,212 [1a33bae4-88ac-4bbb-3434-cc0ac8ca6b00:job-submission] ERROR c.d.s.commandpool.CommandWrapper - command 1a33bae4-88ac-4bbb-3434-cc0ac8ca6b00:job-submission failed
java.lang.RuntimeException: org.rocksdb.RocksDBException: While open a file for random read: /opt/dremio/db/catalog/012345.sst: Too many open files
...
Caused by: org.rocksdb.RocksDBException: While open a file for random read: /opt/dremio/db/catalog/012345.sst: Too many open files
...
Cause
One possible cause is that the Open Files limit on the kernel is too low. Run a `ulimit -a` and check the current limit for open files, also known as file descriptors. If this is still the default of 4096, this is too low and needs to be changed.
% ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-v: address space (kbytes) unlimited
-l: locked-in-memory size (kbytes) unlimited
-u: processes 2784
-n: file descriptors 4096
Steps to Resolve
Per Dremio's guidance under System Requirements, please increase the open file limit on the node to 65536. Once updated, restart Dremio for the change to take effect.
Next Steps
A restart of Dremio is required for the ulimit change to take effect.