Coordinator spawning multiple threads and becoming unresponsive – Dremio Support

Summary

This article deals with a known issue that can render the coordinator unresponsive.

Reported Issue

The user may see the coordinator become unresponsive with calls to the web UI yielding a 403 error or just timing out.

Relevant Versions

Dremio Enterprise version up to an including:

22.2.3
23.2.4
24.2.6

Troubleshooting Steps

There is a known issue detailed in internal jira DX-83120 where an infinite loop can occur spawning additional threads until the process runs out of resources.

The following error maybe repeated in the logs over and over for the same job ID

2023-09-29 12:50:15,141 [grpc-default-executor-116593] WARN  c.d.service.jobs.LocalJobsService - Unable to fetch query profile for JobId{id=1ae95180-58fd-1810-a051-0477bfba2000} 
    on Node NodeEndpoint{address=dremio-master-0.dremio-cluster-pod.dp-dremio.svc.cluster.local, userPort=31010, fabricPort=45678, roles=Roles{sqlQuery=true, logicalPlan=true, physicalPlan=true, javaExecutor=false, distributedCache=true, master=true},
    startTime=1695990799352, provisionId=null, maxDirectMemory=111585263616, availableCores=16, nodeTag=, conduitPort=45679, engineId=null, subEngineId=null, dremioVersion=24.2.1-202309200227450010-7d1592a0}

Usually this issue only occurs if the conduit port is set (by default is it a random port)

Steps to Resolve

The user can remove the following conduit port config from their dremio.conf file or the helm templates:

services.conduit.port=<port num>

Next Steps

Upgrading to the following Dremio Versions or later:

23.2.4
22.2.3
24.2.7
24.3.0

Additional Resources