Summary
This article helps troubleshoot the "CANCELLED, transitioned taskState from DONE to BLOCKED_ON_DOWNSTREAM" message reported in the Dremio logs.
Reported Issue
[e* - QUERY_UUID:frag:*:*] INFO c.d.s.exec.fragment.FragmentExecutor - retire() state: CANCELLED, transitioned taskState from DONE to BLOCKED_ON_DOWNSTREAM since there are * messages to flush for fragment QUERY_UUID:*:*
[e* - QUERY_UUID:frag:*:*] INFO c.d.s.exec.fragment.FragmentExecutor - retire() state: FINISHED, transitioned taskState from DONE to BLOCKED_ON_DOWNSTREAM since there are * messages to flush for fragment QUERY_UUID:*:*
Relevant Versions
This can happen in all Dremio releases.
Troubleshooting Steps
1. Check the logs of the executors that fail to reply in time, make sure they are up and running properly.
2. Check if this message takes place over a large time window (normally this type of behaviour can be brief).
3. Look at the queries.json which will help understand which queues are under heavy load when the message is logged, frequency of concurrent jobs, etc. If the cancellations occur over a long period, it suggests an extended load saturation.
4. Check the logs from the ODBC/JDBC client at the same time as the failures. Look for log entries that could indicate:
- the data taking a long time to be downloaded, so being cancelled from the client side;
- network timeouts (specially when coinciding with the same type of error from the client);
- other reasons for cancellation on the client;
- cancellation of long-running sessions.
Cause
This message indicates that for a particular job the completed work fragments have failed to be received by the next node in the workflow, probably another executor. BLOCKED_ON_DOWNSTREAM has exactly the same meaning as in job profiles - the local node has finished it's work but the downstream node has failed to respond, in this case causing the query to timeout and fail it appears.
Steps to Resolve
Follow the troubleshooting steps.