Summary
This article helps troubleshoot the "waited too long in the command pool" warning reported in the Dremio logs.
Reported Issue
Under high workload, Dremio logs can report warnings like:
[QUERY_UUID:job-submission] WARN c.d.service.jobs.LocalJobsService - Job submission QUERY_UUID waited too long in the command pool: wait was 123456ms
[QUERY_UUID:job-submission] WARN c.d.service.jobs.LocalJobsService - Work submission QUERY_UUID waited too long in the command pool: wait was 123456ms
Relevant Versions
This can happen in all Dremio releases.
Troubleshooting Steps
1. Check the queries.json and the metadata refresh log for the query IDs reported in the warning.
2. Check if those jobs are for inline metadata refreshes. The metadata inline refresh (which is an expensive operation) takes place when the metadata for the datasets becomes stale. If multiple datasets have expired in the same time, then all the command pool slots could get occupied by queries doing inline refreshes. If only one of them is causing the issue, consider raising the metadata expiry time for that dataset. For bigger tables, one could move the refresh to ALTER PDS instead, to reduce the background refresh time.
3. Check if those jobs have excessive planning times (generally over one minute). If they do, then that's the new issue that needs to be troubleshooted. Planning time is significantly impacted by thread context switching. UI threads, background tasks (ex. reflection management and cache maintenance), bulk catalog API threads, etc., will for certain result in context switching on the planning threads.
4. Check the Disk I/O activity. An active KVStore (for example when a backup, cleanup or reindexing takes place) can cause other queries to fight for the same command pool. Also, a slow Disk, like a NAS, is a common cause for such issues.
5. Check if all the query that wait in the command pool have something in common (e.g. same source, storage type, etc.). If they do, troubleshoot the common item for performance issues.
6. Check if an increase in workload took place when these warnings were reported. Jobs waiting in the command pool can also indicate there are too many jobs being submitted in a given period of time for the coordinator to keep all the jobs progressing.
7. Is the support key "dremio.exec.rpc.user.server.threads" configured to a different value than the default one? If it's higher than 1 (the default), that will mean fewer threads available for the regular tasks that run on coordinator, like the command pool, which in turn can generate the warning discussed here. An increased command pool size, on the other hand, does not increase query throughput. Even if we plan more queries in parallel than CPUs, each query ends up taking longer to plan.
8. Use the JMX metrics to monitor the active threads in the pool. Additionally, one can use Prometheus to create a graph and alerts for high number of threads in the pool.
The JMX metrics are documented here: https://docs.dremio.com/software/advanced-administration/monitoring/#monitoring-jmx-metrics
After doing so, one can use Jconsole (which ships with java and is run from the command line) and attach to the coordinator <host/ip>:<port> (the port is the one specified for jxmremote when configuring the JMX metrics). Once connected, one can inspect the individual mbean counters to identify the current value. Jconsole is great to get a quick look at the metrics at a given point in time, but ideally though you would want to to export these metrics to a 3rd party tool such as Prometheus or Newrelic, to graph the metrics over time and potentially alert on them.
Cause
The "waited too long in the command pool" warning message is an indication that the command pool (threads, num CPUVPs -1) were exhausted and there were none available to process new work.
Steps to Resolve
Identify the root cause using the steps described above, then work on addressing the underlying problem.
Tips & Tricks / Best Practices / Recommendations / FAQ
Dremio can only plan a limited number of queries concurrently at any one time ( number of cores -1). So if you have 50 concurrent queries and 8 cpu cores, the coordinator will do the planning for 7 queries and the remaining 43 queries will wait in the command pool. If queries that are currently being planned are slow for different reasons, one will start see a long "Command Pool Wait" time. If the queries spend too long in the command pool, they will eventually time out.
The coordinator uses a bounded thread pool called command pool to limit the number of concurrent cpu/heap intensive tasks during planning, that's why queries may have to wait to get scheduled on the command pool if there are many concurrent queries. This wait time is captured in poolWaitTime (in Milliseconds)
The command pool is used in the following stages of the query flow:
METADATA RETRIEVAL
PLANNING
EXECUTION PLANNING