Overview
After enabling Apache Iceberg for the unlimited splits feature, some users have encountered issues when subsequently upgrading. In such cases the PDS metadata may well have to be rebuilt.
Applies To
Dremio Versions 19.0 onwards.
Details
A number of different errors can occur depending on the source type and metadata. Example snippets:
2022-01-24 16:38:17,901 [1e112a05-c98e-723a-23a2-7b2ffdec5000/0:foreman-planning] ERROR c.d.s.commandpool.CommandWrapper - command 1e112a05-c98e-723a-23a2-7b2ffdec5000/0:foreman-planning failed
com.dremio.common.exceptions.UserException: Bad Request (HTTP/400): Unknown type ICEBERG_METADATA_POINTER
2022-01-25 15:20:33,120 [grpc-default-executor-384] ERROR c.d.s.nessie.ContentsApiService - GetContents failed with a NessieNotFoundException.
org.projectnessie.error.NessieContentsNotFoundException: Could not find contents for key 'dremio.internal./dremio-me-2f120e18-880d-44b5-b2cf-555e644eb91d-8a1b96622bc96392/dremio/metadata/338e27db-960d-4797-bc19-1eb768680fba' in reference 'main'.
2022-01-25 15:20:33,120 [1e0feabe-882b-1536-4377-ed7e27e3be00/0:foreman-planning] ERROR c.d.s.commandpool.CommandWrapper - command 1e0feabe-882b-1536-4377-ed7e27e3be00/0:foreman-planning failed
com.dremio.common.exceptions.UserException: Failed to get iceberg metadata
Cause
Apache Iceberg support can be enabled with the following support keys. (Note in Release 21.0 onwards these are enabled by default)
dremio.iceberg.enabled = true
dremio.execution.support_unlimited_splits = true
Once this feature is enabled, the metadata for physical datasets is stored in a different path and format. There have been some known issues observed when upgrading between major releases (versions 19 onwards) where the metadata can either become corrupted or lost, resulting in a need to rebuild it.
Solution
If there are only a handful of datasets then a simple SQL will suffice:
ALTER PDS <name> FORGET METADATA
ALTER PDS <name> REFRESH METADATA
However, if the user has a large number of datasets then the following procedure will help to streamline the process:
1 - Create a Virtual dataset (VDS) somewhere for example in Shared/PDS:
SELECT TABLE_SCHEMA || '.' || TABLE_NAME from INFORMATION_SCHEMA."TABLES" where TABLE_TYPE ='TABLE'
2 - Run the script attached below. Note it's using Shared.pds
as the VDS, that can be changed by altering this line:
SQL_QUERY="select * from Shared.pds"
3 - Use the output in refresh_info.out
generated by the script to then run from a JDBC / OBDC client that allows multiple SQL statements (or if using Dremio 21.0 onwards you can run multiline scripts in the UI).
Further Reading
Apache Iceberg in Dremio - https://docs.dremio.com/software/data-formats/apache-iceberg/
Metadata refreshing - https://docs.dremio.com/software/advanced-administration/metadata-caching/