Summary
When a data source's configuration changes, the ctime (change time) could become out of sync.
Reported issue
Due to Dremio being a distributed environment, one can see errors similar to the following:
java.util.ConcurrentModificationException: Source [SOURCE_NAME] was updated, and the given configuration has older ctime (current: 1687363560000, given: 1683123793333)The error indicates that the create time of either, a data source and Dremio's nodes, or the Dremio's coordinator and executors, is out of sync. This can happen in distributed environments due to config options, config changes (e.g. changes made to the source / keep_metadata_on_replace), systems restarts or outages.
Relevant Versions
This article is relevant to all Dremio versions.
Troubleshooting Steps
If the error persists, please perform the checks/workarounds below, under Steps to Resolve.
Cause
Many times this should be a transient error, because the source's configuration changed and the Dremio executors haven't yet receive the latest configuration for the source from the coordinator.
Steps to Resolve
Follow the below steps, which are posted in the order that's least intrusive for your environment:
- Log out, then log back into Dremio as an admin and retry the failing query.
- If the error occurs when removing a data source, then try to Refresh the source by running:
alter source <source-name> refresh status
If source refreshing doesn't work because the source doesn't exist any more (the error still occurs because the ctime was out of sync when the original data source was decommissioned), then edit the source settings, add the connection details to another similar working data source, then you should be able to remove the data source from the UI. - Refresh the metadata for the data source indicated in the error message by running:
ALTER PDS <source_name> REFRESH METADATA
-
This could also be cause because the cached plan is NOT invalidated and continues to refer to the source ctime from prior to a change, thus leading to the error. To workaround it:
Disable plan cache - set support key planner.query_plan_cache_enabled to false
Run the failing query once -- this will pick up the correct version of the source configuration and the query should run successfully
Enable plan cache - set support key planner.query_plan_cache_enabled to trueNOTE: The UI ran queries use a different cache. As a result, when testing using the UI, the queries could not hit the ctime error.
- If disabling the plan cache is not an option or doesn't help, another similar workaround is to modify the planner support key ( for eg. planner.broadcast_factor - change it from default(2) to 1.9) , run the failing query, then reset the support key to original value.
Since the planner support key changes, the plan is invalidated and a new plan is built, thus possibly resolving the issue. - If the error occurs during a reflection refresh, then mismatch is probably between the create time for the reflection on some executors and the create time on the co-ordinator. The resolution is to do a full cluster restart (cluster restart is also recommended if none of the other options work):
* Stop executor processes (dremio stop) on all executors;
* Stop co-ordinator process (if there are multiple coordinators in use, stop node 2 co-ordinator first to prevent a failover from occurring)
* Start co-ordinator (and then restart standby)
* Start executors -- this should force a synchronization
Retry the reflection refresh that was failing. - If you know the error is caused because there is a mis-configuration in regards to the timezone setting of different Dremio executors, you can fix that by updating dremio-env on all the executors read the same timezone setting, for example:
DREMIO_JAVA_EXTRA_OPTS="-Duser.timezone=US/Pacific"
Then restart Dremio on the executors for the change to apply. - Open the job profiles of the failed queries and check if the same executor is causing the failure by clicking on the error tab and looking at the host name. If there is only one or a few executors causing the failures, the you have the option to blacklist those executors. If a significant number of executors hit the issue, that the cluster can't function properly with them being blacklisted, then restarting the entire cluster is indicated instead of blacklisting (follow the steps described under option #6 above).
- If only the executors of an existing Elastic Engine are hitting the issue, you can Stop the engine in question, then stop all the nodes of that engine, restart the coordinator, start the nodes of that engine.
- Try to delete and re-create the data source. Deleting and re-creating a source should not disturb the VDSs of that source.
- Make sure there are no inconsistencies in between how the coordinator and the executors are being restarted, as such inconsistencies could cause the issue to re-occur.
- If you're using a YARN deployment log in to Hadoop cluster and make sure there are no zombie Dremio processes running on the data nodes. If there are, make sure there are no zombie Yarn applications that could spawn those processes. If there are, please kill them, then you may have to restart the cluster.