Summary
After migrating an environment from IMDSv1 to IMDSv2, Dremio is now failing to start.
Reported Issue
The following is an example error that one might find in Dremio's logging pertaining to the above failure to start:
2024-07-05 12:47:38,976 [main] ERROR c.dremio.exec.catalog.PluginsManager - Exception while creating source. com.dremio.common.exceptions.UserException: Source is not currently available. at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:979) at com.dremio.exec.catalog.ManagedStoragePlugin.createOrUpdateSource(ManagedStoragePlugin.java:555) at com.dremio.exec.catalog.ManagedStoragePlugin.createSource(ManagedStoragePlugin.java:415) Caused by: java.util.concurrent.ExecutionException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: Credential Verification failed. at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2028) at com.dremio.exec.catalog.ManagedStoragePlugin.replacePlugin(ManagedStoragePlugin.java:1376) at com.dremio.exec.catalog.ManagedStoragePlugin.createOrUpdateSource(ManagedStoragePlugin.java:466) ... 16 common frames omitted
Relevant Versions
Dremio v25.0.4 has been identified as the impacted version.
Troubleshooting Steps
Review your available logging and look for the signs indicated in the Reported Issue section.
Cause
The AWS SDKs use IMDSv2 calls by default. If the IMDSv2 call receives no response, the SDK retries the call and, if still unsuccessful, uses IMDSv1. This can result in a delay, especially in a container environment. In a container environment, if the hop limit is 1, the IMDSv2 response does not return because going to the container is considered an additional network hop. To avoid the process of falling back to IMDSv1 and the resultant delay, in a container environment we recommend that you set the hop limit to 2.
Steps to Resolve
As you may have guessed from the above section, we are going to increase the hop limit to 2 from our default value of 1.
Please add the following parameter into your cluster's build files:
http_put_response_hop_limit = 2
Additional Resources
There will be code changes made to the product that will set our default "hops" to the suggested value for the updated IMDS version. Internal Reference: DX-92734