Promoted Parquet files being Processed with ICEBERG_SUB_SCAN instead of PARQUET_ROW_GROUP_SCAN – Dremio Support

Summary

This article discusses the change from reading parquet files (from a NAS, S3, or any other object storage source) with PARQUET_ROW_GROUP_SCAN as the initial leaf scan operator to now reading them with the ICEBERG_SUB_SCAN operator.

Reported Issue

Upgrading from any previous version of Dremio (such v22.x which may have had the unlimited splits functionality explicitly disabled) to v25.x, would cause parquet files in S3 storage to be processed with ICEBERG_SUB_SCAN instead of the expected PARQUET_ROW_GROUP_SCAN operator as per below screenshot.

Visual Profile.JPG

Relevant Versions

This issue applies to Dremio versions 25.x and later.

Troubleshooting Steps

1. Confirm you have promoted a parquet file from NAS, S3, or any other object storage source

2. Query the promoted dataset

3. Navigate to the job profile as per the following documentation link steps: https://docs.dremio.com/current/sonar/monitoring/jobs/raw-profile/

4. After navigating to the "Raw Profile", click on the "Planning" tab of the profile.

5. Confirm that the bottom most/lowest operator in the Final Physical Transformation section is the below operator.

IcebergManifestList

e.g.

....
00-06 TableFunction(columns=[`MYCOL_ID`, `MYCOL`, `CATEGORY`, `ANOTHER_COL`], Table Function Type=[DATA_FILE_SCAN], table=[repo.mys3."mytest.parquet""]) : rowType = RecordType(INTEGER MYCOL_ID, VARCHAR(65536) MYCOL, VARCHAR(65536) CATEGORY, VARCHAR(65536) ANOTHER_COL): rowcount = 34.0, cumulative cost = {36.0 rows, 36.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1375
00-07 TableFunction(columns=[`splitsIdentity`, `splits`, `colIds`], Table Function Type=[SPLIT_GEN_MANIFEST_SCAN])
00-08 IcebergManifestList(table=[repo.mys3."mytest.parquet"],

4. Carry out the "Steps to Resolve" described below.

Cause

The usage of ICEBERG_SUB_SCAN as the initial scan on parquet files whether they are normal parquet files or Apache Iceberg will always happen due to the unlimited splits feature as unlimited splits should always be generating query plans with ICEBERG_SUB_SCAN so this is expected behaviour.

The support key to turn off unlimited splits was removed in v25 per below release note so if unlimited splits was previously disabled in an older version of Dremio such as v22.x before upgrading to v25.x, this would not have been apparent with ICEBERG_SUB_SCAN as the initial scan on parquet files instead of PARQUET_ROW_GROUP_SCAN.

Removed the following support keys because they were enabled by default over several major releases:

dremio.execution.support_unlimited_splits (introduced as enabled by default in 21.0)
dremio.iceberg.enabled (introduced in 11.0, enabled by default in 21.0)

Steps to Resolve

No further action needed as this is expected behaviour.

Additional Resources

Viewing a Raw Profile: https://docs.dremio.com/current/sonar/monitoring/jobs/raw-profile/
v25.0.0 Release Notes: https://docs.dremio.com/current/release-notes/version-250-release/#whats-new-8