Summary
Individual parquet files dictate their own schema, this means that the schema can be different for individual files for a single PDS. When Dremio reads the PDS and finds discrepancies of field/column data types, Dremio will try to perform an implicit conversion where possible.
Reported Issue
If conversion is not possible, the query will fail with an error:
UNSUPPORTED_OPERATION ERROR: Field [field1] has incompatible types in file and table. Type in fileschema: [field1: VARCHAR], type in tableschema: [field1: INTEGER]
If the data type found within a source file does not match the data type for the table, one of the following actions is taken:
- During a read, Dremio performs an implicit coercion/conversion from the source file data type to the table data type so it matches the data type defined in Dremio.
- After coercion all SQL processing is performed using the table’s defined schema and data types.
- Dremio automatically updates the table’s data type to incorporate the new data types found and then reruns the query, a process called schema learning.
- If the data types cannot be coerced due to belonging to a different data type family, the query fails to run and you see an error message.
Relevant Versions
All versions of Dremio.
Troubleshooting Steps
Using Parquet Tools
The more efficient way to establish the deviating parquet file would be to review the files using parquet-tools, and the schema option: parquet-tools schema <parquet file name>
Using Dremio
If the dataset is partitioned, run the SELECT MAX ...
query with a filter on partition. Once you have the partition, copy that subset of files to a new location and then promote them one by one, checking the schema for deviations with a DESCRIBE <pds name>
Steps to Resolve
Once you have identified the problem parquet file, this should be removed and recreated using the tools already in place, making sure to correct the schema.
Additional Resources
There are several sources for parquet-tools; here are a couple of references.
https://github.com/apache/parquet-mr
https://pypi.org/project/parquet-tools/