Summary
This article deals with a known issue that can render the coordinator unresponsive.
Reported Issue
The user may see the coordinator become unresponsive with calls to the web UI yielding a 403 error or just timing out.
Relevant Versions
Dremio Enterprise version up to an including:
- 22.2.3
- 23.2.4
- 24.2.6
Troubleshooting Steps
There is a known issue detailed in internal jira DX-83120 where an infinite loop can occur spawning additional threads until the process runs out of resources.
The following error maybe repeated in the logs over and over for the same job ID
2023-09-29 12:50:15,141 [grpc-default-executor-116593] WARN c.d.service.jobs.LocalJobsService - Unable to fetch query profile for JobId{id=1ae95180-58fd-1810-a051-0477bfba2000} on Node NodeEndpoint{address=dremio-master-0.dremio-cluster-pod.dp-dremio.svc.cluster.local, userPort=31010, fabricPort=45678, roles=Roles{sqlQuery=true, logicalPlan=true, physicalPlan=true, javaExecutor=false, distributedCache=true, master=true}, startTime=1695990799352, provisionId=null, maxDirectMemory=111585263616, availableCores=16, nodeTag=, conduitPort=45679, engineId=null, subEngineId=null, dremioVersion=24.2.1-202309200227450010-7d1592a0}
Usually this issue only occurs if the conduit port is set (by default is it a random port)
Steps to Resolve
The user can remove the following conduit port config from their dremio.conf file or the helm templates:
services.conduit.port=<port num>
Next Steps
Upgrading to the following Dremio Versions or later:
- 23.2.4
- 22.2.3
- 24.2.7
- 24.3.0
Additional Resources