ReflectionManager reports "DATA_READ ERROR: Failed to load results for job" for completed REFRESH REFLECTION job – Dremio Support

Summary/Reported Issue

A REFRESH REFLECTION job reports as COMPLETED in Job History but the reflections materialization is not written to distributed storage. This prevents use of the reflection in queries.

The master/coordinator log reports a "Failed to load results for job" error.

An example of the error is as follows:

2024-12-01 09:32:35,244 [grpc-default-executor-28563] INFO  c.d.service.jobs.JobResultsStore - User Error Occurred [ErrorId: a9c71988-2aee-40a3-bf2e-42fe4219c91f]
com.dremio.common.exceptions.UserException: Failed to load results for job 19044327-c04d-42db-cb59-b6a10f58c200
        at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:926)
        at com.dremio.service.jobs.JobResultsStore.loadJobData(JobResultsStore.java:232)
        ...
        ...
 Caused by: java.io.IOException: Timeout occurred during I/O request for sabot://dremio-executor-1.dremio-cluster-pod.svc.cluster.local:45678
        at com.dremio.exec.store.dfs.RemoteNodeFileSystem$RemoteNodeInputStream.getData(RemoteNodeFileSystem.java:426)
        at com.dremio.exec.store.dfs.RemoteNodeFileSystem$RemoteNodeInputStream.seek(RemoteNodeFileSystem.java:329)
        ...
        ...
Caused by: java.util.concurrent.TimeoutException: Waited 30000 milliseconds (plus 95034 nanoseconds delay) for SettableFuture@1faf696e[status=PENDING]
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:531)
        at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:119) 
        
               
2024-12-01 09:32:35,247 [dremio-general-577] WARN  c.d.s.reflection.ReflectionManager - Failed to handle done job for reflection 9a2cc8c9-6235-4df9-84c8-210b19dc0ca0[Raw Reflection]
com.dremio.common.exceptions.UserRemoteException: DATA_READ ERROR: Failed to load results for job 19044327-c04d-42db-cb59-b6a10f58c200


[Error Id: a9c71988-2aee-40a3-bf2e-42fe4219c91f ]

  (java.io.IOException) Timeout occurred during I/O request for sabot://dremio-executor-6.dremio-cluster-pod.svc.cluster.local:45678
    com.dremio.exec.store.dfs.RemoteNodeFileSystem$RemoteNodeInputStream.getData():426
    com.dremio.exec.store.dfs.RemoteNodeFileSystem$RemoteNodeInputStream.seek():329     
    ..
    ..
    
  Caused By (java.util.concurrent.TimeoutException) Waited 30000 milliseconds (plus 95034 nanoseconds delay) for SettableFuture@1faf696e[status=PENDING]

Relevant Versions

All versions

Troubleshooting Steps

1. Search the master/coordinator and executor server.logs for reference to the REFRESH REFLECTION job id.

2. The executor server.log will show the job as FINISHED and the master/coordinator server.log will initially show the job as COMPLETED.

3. If you are experiencing this problem the master/coordinator server.log reports the following INFO message 30 seconds after the REFRESH REFLECTION job is marked as COMPLETED

2024-12-01 09:32:35,244 [grpc-default-executor-28563] INFO c.d.service.jobs.JobResultsStore - User Error Occurred [ErrorId: a9c71988-2aee-40a3-bf2e-42fe4219c91f]com.dremio.common.exceptions.UserException: Failed to load results for job 19044327-c04d-42db-cb59-b6a10f58c200

Cause

The "Failed to load results for job" is an indicator that communication between the master/coordinator and/or the executors and the results distributed storage location is slow.

Steps to Resolve

1. Identify the location of the results store. This is configured in dremio.conf using the "paths.dist" parameter or the "paths.results" parameter.

2. Test the speed of transfer between the master/coordinator and the "results" distributed storage location and between your executors and the "results" distributed storage location.

Various tools such as "iperf" are available for this purpose. Speak to your network administrator for advice on the tool to use.

If such tools are not available then a linux "dd" command can be used to transfer a file between two machines to determine throughput . For example:

dd if=/dev/zero of=/<path>/<to>/<file>/<new filename>  bs=1G count=1

Transfer speeds in low double digit MB/s may indicate that network or disk are not fast enough.

3. Ensure that your network bandwidth meets the minimum recommended throughput to overcome this error. The minimum recommended disk throughput required can be referenced here:

https://docs.dremio.com/current/get-started/cluster-deployments/architecture/metadata-storage/

Summary/Reported Issue

Relevant Versions

Troubleshooting Steps

Cause

Steps to Resolve

Related articles