Summary
When accessing an Iceberg table from Dremio that contains Parquet files with missing or invalid field IDs in the Parquet schema, all columns will show NULL values. This is due to how Dremio maps field names (column names) to field IDs based on the Iceberg specification.
Reported Issue
An Iceberg table has Parquet files added that do not contain field IDs in the schema or have invalid field IDs. When queried from Dremio, all columns in the table show NULL values.
Relevant Versions
Dremio v21+
Troubleshooting Steps
Examine the Parquet schema in one or more of the files in the table for invalid or absent field IDs. The pyarrow libraries provide straightforward functions for reading Parquet data and metadata.
Cause
When querying an Iceberg table, Dremio uses the mapping between field names in the table's schema and the field IDs in the data files' schema to project the correct columns. If field IDs are missing or invalid, the Iceberg format specifies returning NULL for each row in that column.
As an example, consider the Parquet schema for the data files from the table in the previous screenshot:
required group field_id=-1 spark_schema {
optional int64 field_id=-1 customer_id;
optional int64 field_id=-1 purchase_count;
optional binary field_id=-1 source_tag (String);
optional int32 field_id=-1 subscription_init_date (Date);
optional int96 field_id=-1 last_login_timestamp;
}
All of the columns have the same invalid field ID (field_id
) value of -1, so a query engine strictly adhering to the Iceberg specification will render values for each as null
.
Steps to Resolve
Option 1 - Enable the dremio.iceberg.fallback_to_name_based_reader support key at the session or system level to bypass strict mapping from field:
ALTER SESSION set "dremio.iceberg.fallback_to_name_based_reader"=true;
-- or
ALTER SYSTEM SET "dremio.iceberg.fallback_to_name_based_reader"=true;
If you set the key for the session, your queries against the table need to following this command.
Option 2 - If you are using Dremio v24.3.10+ or v25.1+ you can set a default mapping between field names and IDs in an Iceberg table's metadata JSON file with the schema.name-mapping.default
property. This mapping is used to assign IDs to column names when a data files' schema do not have field IDs. With our example file, this would look like:
[
{
"field-id": 1,
"names": [
"customer_id"
]
},
{
"field-id": 2,
"names": [
"purchase_count"
]
},
{
"field-id": 3,
"names": [
"source_tag"
]
},
{
"field-id": 4,
"names": [
"subscription_init_date"
]
},
{
"field-id": 5,
"names": [
"last_login_timestamp"
]
}
]
Best Practices
Files generated outside of the Iceberg context may have absent or invalid field IDs, but can still be imported to an Iceberg table via functions in some catalogs, such as the Spark add_files
procedure. In these cases, you should be ensure there are valid field IDs in the Parquet file schemas.
If you use Iceberg DML operations to add data to your table, the resulting files will have valid field IDs because they will be written to conform to the Iceberg specification.