Summary
Some queries with hash joins and runtime filtering were giving incorrect and inconsistent results. The root cause is a defect in the runtime filtering that effects Parquet datasets with files that have missing column statistics. The defect is fixed in Dremio 24.3.9, 24.4.1, 25.0.8 and 25.1.0
Reported Issue
Multiple runs of the same query in relatively short succession were giving different results. Usually precise information about which records were different between the runs was not discernible because of the large size of the result sets, but the COUNT aggregate variations of the same queries (i.e. replacing SELECT * <SQL> with SELECT COUNT(*) FROM <SQL>) showed fluctuating record counts.
The types of queries showing this problem all had one or more hash joins directly in their SQL or somewhere in their query plans after view expansion.
Relevant Versions
The defect was reported in 24.3.4 but could possibly effect any of the minor versions leading up to the fix in 24.3.9, 24.4.1, 25.0.8 and 25.1.0.
Troubleshooting Steps
After identifying the query with this problem try the following:
1. Inspect the query plan, either in the visual profile or with the results of from running EXPLAIN PLAN FOR <your query SQL>. Does the query have one or more hash joins? If it does, you can proceed to the next troubleshooting step. If it does not, the problem is not caused by this defect.
2. Try running the query multiple times with runtime filtering disabled for non-partition columns at the query session level:
ALTER SESSION SET "exec.non_partitioned_parquet.enable_runtime_filter"=false;
Do the results still change between runs? If the results remain unchanged between runs with the runtime filtering disabled, then query is likely affected by the defect.
If the problem persists and results are still changing between runs, create a ticket with Dremio Support with the query profiles.
The next steps are optional but can help you narrow down the problem to specific tables.
3. Review the query plans of the problematic jobs as pictured in the visual profile. The plan branches at every hash join. Follow the left hand side branches, called the "probe" side, to the TABLE_FUNCTION operators with dataset names below them. These are the actual scans of the data files where the runtime filtering is applied. Note the operator numbers of these scans. There may be multiple across the profile.
4. Compare the same operators between the profiles. Do the number of records differ noticeably between them? The visual profile does round the output of the operators when the sizes get large, so you may not be able to discern small differences in record counts.
5. Review the final physical plan for the query as reported by the EXPLAIN PLAN FOR <your query SQL> or as printed in the raw profile under "Planning"->"Final Physical Transformation". Search for the keyword "runtimeFilters". Do you find them in the data file scan operators identified in the last step?
6. If possible collect some files from the probe side tables you identified.
Cause
Parquet files usually have column-level statistics within their rowgroup metadata, but sometimes they do not. The defect causes runtime filters to not match the rows in the columns that lack statistics, thus "filtering them out" when they may actually match the filter condition.
Dremio Jira reference number: DX-90854
Steps to Resolve
You can disable runtime filtering for non-partition columns across the entire Dremio application to be sure the defect does not effect any query. However, this can degrade performance for many queries.
The recommended solution is to upgrade to at least 24.3.9, 24.4.1, 25.0.8, or 25.1.0, depending on your current Dremio version and upgrade policy.
Additional Resources
Dremio docs on runtime filtering