Summary
Partition pruning is extremely important to ensure good performance when working with large datasets to both improve query performance and reduce resource consumption. This article is written with Apache Iceberg in mind. As unlimited split uses Apache Iceberg for queries, most of it applies to Unlimited Split. In addition DeltaLake is similar, but there might be some slight differences.
Reported Issue
Some queries may be slow because partition pruning doesn't take place.
Overview
This article assumes basic understanding of Apache Iceberg, Partition transforms supported by it, Manifest files and min-max values. One can use the Dremio doc and blog articles to get familiar with these.
Relevant Versions, Tools & Integrations
This applies all Dremio releases newer than v23.
Steps to Resolve
Look at the profile of your query, in particular “Raw Profile”->”Planning”->”Final Physical Transformation” section.
The following applies for Iceberg datasets and Unlimited Split (which uses an Iceberg dataset internally). Delta Lake function names might be slightly different.
Look for a TableFunction with Table Function Type=[DATA_FILE_SCAN] and table =[the_table_we_are_interested_in]. Under it there might be some exchanges, you can ignore those for now.
Find IcebergManifestList operator in the very bottom of the query. Does it have any “ManifestList Filter Expression”?
Find Table Function operator with Table Function Type=[SPLIT_GEN_MANIFEST_SCAN]. Does it have a “ManifestFile Filter AnyColExpression”?
If there are filters in IcebergManifestList and SPLIT_GEN_MANIFEST_SCAN table function we are doing partition pruning and if your filter is restrictive enough we will not scan the whole dataset.
In addition, sometimes you will see that IcebergManifestList does not have ManifestList Filter Expression, however Table Function Type=[SPLIT_GEN_MANIFEST_SCAN] does have a ManifestFile Filter AnyColExpression. This means that files are getting pruned too, however, it is on a non-partitioned column. In the manifest file, Iceberg stores the min and max value for each field, and it can be used to eliminate whole files, even if the filter is not on a partition column. We prefer partition based pruning, as it usually allows more files to be eliminated and not read, but non-partition based pruning using min-max fields can help eliminate files to scan too.
If one goes to Visual Profile, Sort by "Record Processed", Operators, one will see how filtering took place through operators by the way the number of records being scanned decreases.
Additional Resources
https://www.dremio.com/wp-content/uploads/2024/01/Query-Performance-Analysis-and-Improvement.pdf