Summary
This article discusses the "OUT_OF_MEMORY ERROR: Query was cancelled because it exceeded the memory limits set by the administrator" error that occurs during the execution planning stage in Dremio versions prior to 22.x. The error is more likely to occur when running memory-intensive SQL queries involving operations like UNION or COUNT(DISTINCT).
Reported Issue
When attempting to run queries that make use of memory-intensive SQL functions such as UNION, COUNT(DISTINCT col), or similar, the query may be canceled with the "OUT_OF_MEMORY ERROR: Query was cancelled because it exceeded the memory limits set by the administrator. Expected at least XXX GB bytes, but only had XXX GB available" error.
Relevant Versions
This issue applies to Dremio versions earlier than 22.x.
Troubleshooting Steps
1. Navigate to the failing job profile as per the following documentation link steps: https://docs.dremio.com/current/sonar/monitoring/jobs/raw-profile/
2. After navigating to the Raw Profile, click on the Error tab of the profile.
3. Confirm the Error is the same as mentioned in this article.
As an example (with specific memory requirements), the following error was seen in the job profile:
OUT_OF_MEMORY ERROR: Query was cancelled because it exceeded the memory limits set by the administrator. Expected at least 539.95GB bytes, but only had 343.75GB available.
with the below corresponding error being seen in the server.log
of the coordinator node
2023-02-14 12:27:03,471 [1c1481fe-c87c-e4f0-65f0-911bd47b2100:execution-planning] ERROR c.d.e.util.MemoryAllocationUtilities - Query was cancelled because it exceeded the memory limits set by the administrator. Expected at least 539.95GB bytes, but only had 343.75GB available.
Size requirement for memory intensive ops is 91.69GB bytes.
Missing memory = 196.20GB bytes, Number of Memory intensive ops = 278849, Other ops = 3461401, Endpoint = xxxx
com.dremio.common.exceptions.UserException: Query was cancelled because it exceeded the memory limits set by the administrator. Expected at least 539.95GB bytes, but only had 343.75GB available.
Size requirement for memory intensive ops is 91.69GB bytes.
Missing memory = 196.20GB bytes, Number of Memory intensive ops = 278849, Other ops = 3461401, Endpoint = xxxx
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:885)
at com.dremio.exec.util.MemoryAllocationUtilities.setMemory(MemoryAllocationUtilities.java:165)
at com.dremio.exec.util.MemoryAllocationUtilities.setupBoundedMemoryAllocations(MemoryAllocationUtilities.java:104)
at com.dremio.exec.maestro.planner.ExecutionPlanCreator.getExecutionPlan(ExecutionPlanCreator.java:112)
at com.dremio.exec.maestro.QueryTrackerImpl.planExecution(QueryTrackerImpl.java:122)
at com.dremio.exec.maestro.MaestroServiceImpl.lambda$executeQuery$2(MaestroServiceImpl.java:167)
at com.dremio.service.commandpool.ReleasableBoundCommandPool.lambda$getWrappedCommand$3(ReleasableBoundCommandPool.java:138)
at com.dremio.service.commandpool.CommandWrapper.run(CommandWrapper.java:62)
at com.dremio.context.RequestContext.run(RequestContext.java:95)
at com.dremio.common.concurrent.ContextMigratingExecutorService.lambda$decorate$3(ContextMigratingExecutorService.java:199)
at com.dremio.common.concurrent.ContextMigratingExecutorService$ComparableRunnable.run(ContextMigratingExecutorService.java:180)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
4. Carry out the steps to resolve below.
Cause
After Dremio has created a query plan, it estimates how much memory each operator in that plan may use during query execution, and adds them up. It may overestimate this memory requirement and prevent the query from executing if the memory required for execution is more than the direct memory available on each executor.
This OOM error is more likely to occur when the query includes memory intensive operators such as UNION and COUNT(DISTINCT col). It is possible to verify this within the raw query profile in the "Planning" tab under the "Convert To Rel" phase, where lots of LogicalUnion and LogicalAggregate relational operators will be seen.
Steps to Resolve
Set the support key planner.memory.aggressive and retry the query. This key allows the planner to make a more optimistic estimate of the required memory, potentially allowing the query to execute successfully.
Tips & Tricks
It is important to note this particular OOM error occurs at the "Execution Planning" stage (before execution) of the query and as such, should not be confused with the similar error message:
OUT_OF_MEMORY ERROR: Query was cancelled because it exceeded the memory limits set by the administrator. This message occurs during execution and does not include the "Expected at least" and "but only had" section.
This works in general cases and may not be ideal for scenarios where there are other underlying issues with the query or cluster. For queries where this key does not make an improvement, this may indicate the query itself needs to be optimized or re-worked, or may in fact, require more memory to execute successfully.
FAQ
Q. Do I need to restart Dremio for the support key change to take effect?
A. A restart of Dremio is not required to apply this change, and this setting is enabled by default within Dremio versions 22.x and onwards.
Additional Resources
Viewing a Raw Profile: https://docs.dremio.com/current/sonar/monitoring/jobs/raw-profile/
Heap Monitor and Out of Memory Errors: https://docs.dremio.com/current/get-started/cluster-deployments/customizing-configuration/dremio-env/memory-config/#heap-monitor