Identifying full metadata refresh and incremental metadata refresh – Dremio Support

Summary

If REFRESH DATASET jobs for a particular table seem to be taking a longer time to complete than usual, inspect the query plan in the job's profile to see if a full or incremental metadata refresh is occurring. The full metadata refresh query plan does not have any joins and will take much longer to complete than incremental refresh because it has to process metadata for every file in the dataset.

Reported Issue

For some tables, REFRESH DATASET jobs suddenly take much longer to complete than previously observed.

Overview

Physical datasets for which Dremio maintains Iceberg metadata (often called "unlimited splits" tables), need to have that metadata refreshed when the table is changed in some way, for example when data files are added or removed, or when the table schema is changed. This is done through a REFRESH DATASET job, which is either scheduled through the metadata settings for the data source that contains the table or triggered by a user through an ALTER TABLE <physical dataset> REFRESH METADATA command.

Most of the time, metadata refresh is performed incrementally. In an incremental metadata refresh, the job does a set difference operation between the list of files in the dataset and the files for which Dremio already has information in the Iceberg metadata. Using this difference, the query determines the added files and the removed files. Iceberg APIs are then used to atomically update the Iceberg metadata.

A full metadata refresh occurs when a table is first added or when the metadata has been forgotten. Full metadata refresh for a table can take significantly longer time to complete than an incremental one, sometimes 10x or 100x longer. If you find that a table's metadata is suddenly taking much longer to refresh, it's important to determine which type of refresh is being performed before coming to any conclusion about why it's taking so long.

Relevant Versions

"Unlimited splits" tables were introduced in Dremio 18.

Steps to Resolve

The REFRESH DATASET jobs do not explicitly indicate which type of metadata refresh is being performed, but that can be inferred from the structure of their query plans.

Incremental metadata refresh jobs have a join which implements the set difference operation described above. In contrast, full metadata refresh queries have just a single branch of operators, with no join.

For example, here are the first few operators of a full metadata refresh for a table called "S3-source"."tpch-orders". The table initially contains 5 Parquet data files and you can see that a metadata record is emitted from the TABLE_FUNCTION for each. Downstream of this (not pictured), there is another operator that reads the footer of every Parquet file in order to create Iceberg metadata for each.

Screenshot 2025-02-01 at 11.30.15 AM.png

After another Parquet data file is added to the table, an incremental refresh is performed. You can observe the join and filtering that implements the set difference. Only a single metadata record for the added Parquet file is sent downstream for its footer to be read and processed.

Screenshot 2025-02-01 at 11.30.58 AM.png

For a dataset with man files, a full refresh takes much longer because every Parquet file must have its footer read and metadata written for it.

Next Steps

If you find that a REFRESH DATASET job is suddenly taking much longer for a particular table and you use the above steps to determine that it's because a full metadata refresh is being performed, you should look to your job history to see if a user recently issue a FORGET METADATA command on the same table. As the SQL suggests, this command removes the table from the Dremio catalog and discards its metadata. A subsequent formatting ("promotion") of the table root directory would initiate the full refresh you observe.

Additional Resource

Dremio docs - Refreshing Metadata

Dremio docs - Optimize Metadata Refresh Frequency