Individual parquet files dictate their own schema, this means that the schema can be different for individual files for a single PDS. When Dremio reads the PDS and finds discrepancies of field/column data types, Dremio will try to perform an implicit conversion where possible.
List of supported implicit conversions: https://docs.dremio.com/software/sql-reference/data-types/#supported-implicit-coercionsconversions-for-file-formatted-data-types
If conversion is not possible, the query will fail with an error:
UNSUPPORTED_OPERATION ERROR: Field [field1] has incompatible types in file and table.
Type in fileschema: [field1: VARCHAR], type in tableschema: [field1: INTEGER]
Supported Implicit Coercions/Conversions for File-formatted Data Types
If the data type found within a source file does not match the data type for the table, one of the following actions is taken:
- During a read, Dremio performs an implicit coercion/conversion from the source file data type to the table data type so it matches the data type defined in Dremio.
- After coercion all SQL processing is performed using the table’s defined schema and data types.
- Dremio automatically updates the table’s data type to incorporate the new data types found and then reruns the query, a process called schema learning.
- If the data types cannot be coerced due to belonging to a different data type family, the query fails to run and you see an error message.
Using Parquet Tools
The more efficient way to establish the deviating parquet file would be to review the files using parquet-tools, and the schema option:
parquet-tools schema <parquet file name>
If the dataset is partitioned, run the
SELECT MAX ... query with a filter on partition. Once you have the partition, copy that subset of files to a new location and then promote them one by one, checking the schema for deviations with a
DESCRIBE <pds name>
There are several sources for parquet-tools; here are a couple of references.