Summary
During dataset promotion or while querying an established Parquet dataset, an error is encountered with a message saying "Unable to coerce from the file's data type 'int64' to the column's data type 'timestamp'". This indicates that one of the files in the dataset has a microsecond or nanosecond precision timestamp column, which Dremio does not currently read.
Reported Issue
After having formatted (promoted) a Parquet dataset containing multiple files, some or all queries fail with an error message similar to the following example:
Unable to coerce from the file's data type "int64" to the column's data type "timestamp" in table "Lakehouse_S3.user_stats", column "JoinDate" and file "/lake_bucket/user_stats/data_file_1.parquet"
The identified column and file will be different for your instance of the problem, but the messaging about the type mismatch should be the same.
If the dataset has not been previously formatted, an initial promotion attempt may also fail with this error.
Relevant Versions
All currently supported versions of Dremio Software (24+) and Dremio Cloud.
Troubleshooting Steps
Check the Parquet schema of the file identified in the error message. There will be Parquet logical type annotations that indicate the field should be interpreted as a microsecond or nanosecond precision timestamp.
For the example described here, the "user_stats" table has just 2 files; data_file_0.parquet and data_file_1.parquet. Their schemas, as reported by the pyarrow.parquet Python module, are shown below:
// schema for data_file_0.parquet
required group field_id=-1 schema {
optional int32 field_id=-1 UserID;
optional int64 field_id=-1 JoinDate (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
optional boolean field_id=-1 IsActiveUser;
optional int64 field_id=-1 BlockDate (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
optional int64 field_id=-1 UnsubscribeDate (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
optional int32 field_id=-1 UserType;
optional int64 field_id=-1 Last_Email_Sent_Date (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
}
// schema for data_file_1.parquet
required group field_id=-1 schema {
optional int32 field_id=-1 UserID;
optional int64 field_id=-1 JoinDate (Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
optional boolean field_id=-1 IsActiveUser;
optional int64 field_id=-1 BlockDate (Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
optional int64 field_id=-1 UnsubscribeDate (Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
optional int32 field_id=-1 UserType;
optional int64 field_id=-1 Last_Email_Sent_Date (Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
}
In both schemas, the column "BlockDate" has a primitive type of int64 and a logical type of Timestamp. However, this column's annotations have timeUnit=milliseconds in the schema of for data_file_0.parquet and timeUnit=nanoseconds in the schema for data_file_1.parquet, the file reported in the error.
Cause
Dremio does not currently read microsecond precision or nanosecond precision timestamps. When initially promoting the dataset, Dremio samples one of the Parquet files to determine the schema and assumes this is (more-or-less) uniform across the whole dataset. In our example, if it sampled data_file_0.parquet, then "BlockDate" along with the other timestamp columns have millisecond precision, which Dremio can read, then the promotion is successful. Later, when more of the files in the dataset are read during a query, those files with nanosecond precision cause the error.
Steps to Resolve
There is no truncation or casting workaround within Dremio that addresses this problem, so the data files will have to be re-rewritten with with millisecond precision timestamps for the identified columns.
Support for microsecond and nanosecond precision timestamps is scheduled for Dremio 26