Iceberg table shows NULL values for all columns – Dremio Support

Summary

When accessing an Iceberg table from Dremio that contains Parquet files with missing or invalid field IDs in the Parquet schema, all columns will show NULL values. This is due to how Dremio maps field names (column names) to field IDs based on the Iceberg specification.

Reported Issue

An Iceberg table has Parquet files added that do not contain field IDs in the schema or have invalid field IDs. When queried from Dremio, all columns in the table show NULL values.

Relevant Versions

Dremio v21+

Troubleshooting Steps

Examine the Parquet schema in one or more of the files in the table for invalid or absent field IDs. The pyarrow libraries provide straightforward functions for reading Parquet data and metadata.

Cause

When querying an Iceberg table, Dremio uses the mapping between field names in the table's schema and the field IDs in the data files' schema to project the correct columns. If field IDs are missing or invalid, the Iceberg format specifies returning NULL for each row in that column.

As an example, consider the Parquet schema for the data files from the table in the previous screenshot:

required group field_id=-1 spark_schema {
  optional int64 field_id=-1 customer_id;
  optional int64 field_id=-1 purchase_count;
  optional binary field_id=-1 source_tag (String);
  optional int32 field_id=-1 subscription_init_date (Date);
  optional int96 field_id=-1 last_login_timestamp;
}

All of the columns have the same invalid field ID (field_id ) value of -1, so a query engine strictly adhering to the Iceberg specification will render values for each as null.

Steps to Resolve

Option 1 - Enable the dremio.iceberg.fallback_to_name_based_reader support key at the session or system level to bypass strict mapping from field:

ALTER SESSION set "dremio.iceberg.fallback_to_name_based_reader"=true;
-- or
ALTER SYSTEM SET "dremio.iceberg.fallback_to_name_based_reader"=true;

If you set the key for the session, your queries against the table need to following this command.

Option 2 - If you are using Dremio v24.3.10+ or v25.1+ you can set a default mapping between field names and IDs as a table property in an Iceberg table's metadata JSON file with the schema.name-mapping.default. This mapping is used to assign IDs to column names when a data files' schema do not have field IDs. With our example file, this would look like:

"properties" : {
"schema.name-mapping.default" : "[ { \"field-id\": 1, \"names\": [ \"customer_id\" ] }, { \"field-id\": 2, \"names\": [ \"purchase_count\" ] }, { \"field-id\": 3, \"names\": [ \"source_tag\" ] }, { \"field-id\": 4, \"names\": [ \"subscription_init_date\" ] }, { \"field-id\": 5, \"names\": [ \"last_login_timestamp\" ] } ]"
}

This can also be added with an ALTER TABLE command from Dremio:

ALTER TABLE "my_iceberg_table" SET TBLPROPERTIES ('schema.name-mapping.default' = '[ { \"field-id\": 1, \"names\": [ \"customer_id\" ] }, { \"field-id\": 2, \"names\": [ \"purchase_count\" ] }, { \"field-id\": 3, \"names\": [ \"source_tag\" ] }, { \"field-id\": 4, \"names\": [ \"subscription_init_date\" ] }, { \"field-id\": 5, \"names\": [ \"last_login_timestamp\" ] } ]')

Best Practices

Files generated outside of the Iceberg context may have absent or invalid field IDs, but can still be imported to an Iceberg table via functions in some catalogs, such as the Spark add_files procedure. In these cases, you should be ensure there are valid field IDs in the Parquet file schemas.

If you use Iceberg DML operations to add data to your table, the resulting files will have valid field IDs because they will be written to conform to the Iceberg specification.

Additional Resources

Iceberg - column projection

Iceberg Spark procedures - add_files