Summary
This article addresses the issue of being unable to read Iceberg tables managed by Google BigQuery in Dremio.
Reported Issue
Users are unable to read Iceberg tables generated via Google BigQuery when using Dremio.
Overview
When attempting to format Iceberg tables in Dremio, you may encounter the following error message:
“This folder does not contain a filesystem-based Iceberg table. If the table in this folder is managed via a catalog such as Hive, Glue, or Nessie, please use a data source configured for that catalog to connect to this table.”
Relevant Versions, Tools, and Integrations:
This issue applies to all versions of Dremio and all deployment types.
Steps to Resolve
N/A
Details
To maintain consistency, Dremio does not allow filesystem source access to Iceberg tables managed by an external catalog. For filesystem sources, Dremio uses HadoopCatalog, which is strictly disk-based. This catalog requires a specific table layout to work properly:
- An Iceberg table must have a directory structure that includes a metadata directory.
- The metadata directory should contain the version-hint.txt file, along with at least one file in the metadata.json format.
Dremio verifies that the provided filesystem and file structure adhere to this Iceberg table format. Specifically, Dremio checks for the existence of both the metadata directory and the version hint file.
Since the BigQuery Managed Iceberg table does not follow this format, Dremio is unable to recognize and read those tables.
Work Around
If you already have a Parquet file stored in Google Cloud Storage (GCS) and Dremio has access to it, you can convert the Parquet folder into an Iceberg table in Dremio using a CTAS (Create Table As Select) query.
To do this, ensure that the default CTAS format is set to ICEBERG under the Advanced Options of your source.
Note:
- The workaround creates a new Iceberg table in Dremio, so future updates to the original BigQuery-managed table won’t be reflected. To get the latest data, you’ll need to refresh or recreate the Iceberg table in Dremio. Automating the CTAS operation on a schedule can help keep the table updated.
- Performing a CTAS operation may include older snapshot data, as it copies the data at the time of creation. To avoid this, use the most current snapshot or adjust the query to specify the desired timeframe.
Common Challenges
N/A
Additional Resources
https://docs.dremio.com/current/sonar/query-manage/data-formats/apache-iceberg/