Summary
Avro schema changes on external Hive tables do not propagate to Dremio.
Reported Issue
Changes to an Avro schema on a Hive external table will not be synced with Dremio.
Relevant Versions
This affects all Dremio versions with a Hive 2 source.
Troubleshooting Steps
Changes to an Avro schema on a Hive external table will not be synced with Dremio.
Cause
This occurs because updates to schemas via TBLPROPERTIES do not automatically update the Hive Metastore which is used by Dremio for schema discovery. This is a documented bug in Hive. Details can be found here: HIVE-15995
For example, if we create an external table in Hive like so:
CREATE EXTERNAL TABLE tab1
PARTITIONED BY (part1 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://<hostname>:8020/hivedata/testdb.db/tab1'
TBLPROPERTIES (
'avro.schema.url'='hdfs://<hostname>:8020/hivedata/schema/tab1_schema.avsc',
'orc.compress'='SNAPPY');
where the tab1_schema.avsc avro schema looks like so:
% cat tab1.avsc
{
"type": "record",
"name": "topLevelRecord",
"fields": [
{"name":"COL1","type":["string","null"],"default":"null"}
,{"name":"COL2","type":["string","null"],"default":"null"}
,{"name":"COL3","type":["string","null"],"default":"null"}
,{"name":"COL4","type":["string","null"],"default":"null"}]
}
Dremio will, when data is loaded and selected, return columns COL1, COL2, COL3 and COL4
So far so good.
If we then amend the Avro schema file to add an additional column "COL5":
% cat tab1_schema_amended.avsc
{
"type": "record",
"name": "topLevelRecord",
"fields": [
{"name":"COL1","type":["string","null"],"default":"null"}
,{"name":"COL2","type":["string","null"],"default":"null"}
,{"name":"COL3","type":["string","null"],"default":"null"}
,{"name":"COL4","type":["string","null"],"default":"null"}
,{"name":"COL5","type":["string","null"],"default":"null"}]
}
and run ALTER TABLE in hive to update the schema:
hive> ALTER TABLE tab1 SET TBLPROPERTIES ('avro.schema.url' = 'hdfs://<hostname>:8020/hivedata/schema/tab1_schema_amended.avsc');
The new column will be visible when queried in Hive and Beeline but will NOT be visible in Dremio. Dremio will continue to return data for columns COL1 to COL4 but will not return data for COL5.
Steps to Resolve
Upgrade to Hive 3 or a release containing the fix for HIVE-15995. Once upgraded, issue the following command in Hive AFTER making changes to an Avro schema file:
ALTER TABLE <table name> UPDATE COLUMNS
This will sync the schema changes with the Hive Metastore. Then in Dremio, run:
ALTER PDS <table_name> REFRESH METADATA
At this point Dremio will return data against the amended schema.
Additional Resources
https://issues.apache.org/jira/browse/HIVE-15995 - Syncing Metastore table with serde schema.