What is an OOM Killer?
When the system is running out of memory, the Kernel selects a process to kill and does so by sending a signal (either a SIGTERM followed by a SIGKILL, or directly a SIGKILL). SIGKILL is the hard OOM killer signal we address in this article. Please note that in case of an OOM Killer (as opposed to a soft OOM, when the JVM runs out of memory) there will be no heap dump generated.
How to identify that an OOM Killer took place?
If the Dremio process stops running and doesn't log any clues in the server.log files regarding the reasons why, then check the system error messages (generally in /var/log/messages*) for any signs of an OOM Killer. The line in question would look like this:
Day 01 01:23:45 hostname kernel: Out of memory: Kill process 1234 (java) score 789 or sacrifice child
Day 01 01:23:45 hostname kernel: Killed process 1234, UID 567, (java) total-vm:30122500kB, anon-rss:28457764kB, file-rss:171432kB
A score 789 means the process was using 78.9% of the memory when it was killed.
The most common command to list the latest OOM Killers is grep -i 'killed process' /var/log/messages
.
How to configure an OOM Killer
The database nodes can be configured so the system will exhibit certain behaviors when an OOM Killer takes place.
Automatically rebooting the system after an OOM Killer may be desired, instead of waiting for an administrator intervention. A reboot will restore the database if it is configured to automatically start after a reboot. Configuring alerts so an administrator is automatically notified if the Dremio node went down is strongly desired. Finding the root cause of the OOM Killer is important, as it can happen again.
The following settings will cause the system to panic and reboot in an out-of-memory condition. The sysctl commands will set this in real time, and appending the settings to sysctl.conf will allow these settings to survive reboots. The X for kernel.panic is the number of seconds before the system should be rebooted. This setting should be adjusted to meet the needs of your environment.
sysctl vm.panic_on_oom=1
sysctl kernel.panic=X
echo "vm.panic_on_oom=1" >> /etc/sysctl.conf
echo "kernel.panic=X" >> /etc/sysctl.conf
One can also tune the system so a certain process is more/less likely to be killed by an OOM killer. If multiple processes are running on the system, one might want to select for one process to be killed first over the other.
To make a process less likely to be killed first, run:
echo -15 > /proc/<pid>/oom_adj
To make a process more likely to be killed first, run:
echo 10 > /proc/<pid>/oom_adj
Replace <pid> with the ID of the process desired to be affected.
Common causes of OOM Killer
-
The heap size is too large
-
The node doesn't have enough memory
-
Improper settings of the Kernel -- In particular the swap and Java Hugepages settings.
-
Memory fragmentation caused in general by either a JAVA bug or a product misconfiguration (like using Native memory allocation (malloc) instead of JEMalloc, JEMalloc results in less memory fragmentation than malloc)
-
Other processes running on the machine are using more memory.
-
An Operating System memory leak bug. Known examples of the supported Dremio platforms:
https://bugs.centos.org/view.php?id=14303 -
Improper manual setting of the Direct Memory.
-
A possible product issue (for existing product bugs, check the release notes).
Troubleshooting an OOM Killer
Please follow the following common troubleshooting steps for OOM Killers:
-
Check in the messages logs if there are any pauses or memory pressure (events controlled by /sys/fs/cgroup/memory/.../memory.pressure_level - see https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt) prior to the OOM-Killer.
-
Check the score of the Dremio process. For example:
Day 01 01:23:45 hostname kernel: Out of memory: Kill process 1234 (java) score 789 or sacrifice child
Day 01 01:23:45 hostname kernel: Killed process 1234, UID 567, (java) total-vm:30122500kB, anon-rss:28457764kB, file-rss:171432kB
A score 789 means the process was using 78.9% of the RAM when it was killed. The higher the score, the more chances this issue is related to capacity planning. The lower the score, the more chances the issue is caused by memory fragmentation or a sub-optimal kernel setting. (For more details see OOM Killer common causes)
3. Check if there are large partitions being compacted prior to the OOM Killer, as they will have to be loaded into memory and this can push up the JVM heap usage.
4. Check in the server.gc if there were large GC pauses (over 500ms) prior to the OOM Killer (such pauses could indicate an overloaded system and require further troubleshooting), for example:
2020-01-00 01:23:34,001 GCInspector.java:282 - G1 Old Generation GC in 25206ms. G1 Old Gen: 4955501376 -> 43558558312;
2020-01-00 01:23:46,001 GCInspector.java:282 - G1 Old Generation GC in 1511ms. G1 Old Gen: 48438337984 -> 47990282352;
2020-01-00 01:23:57,001 GCInspector.java:282 - G1 Old Generation GC in 7980ms. G1 Old Gen: 48447153264 -> 48138959480;
5. Check if the process memory usage is growing over time by monitoring the free -m
and/or sssd_nss over a 24/48 hour period (you can use a command like ps -C sssd_nss -o size,pid,user,args
).
You could include the Dremio process too: ps -C java -o size,pid,user,args
cat oom | cut -d\] -f2,3 | sort -k4 -nr | awk '{rrss=$4*4096; total+=rrss; print $0, "Real RSS = " rrss} END {print "TOTAL: " total}'
6. You can monitor total memory usage on the node for all processes (ps, top or dstat), and Dremio process over time, if you identify the Dremio process is growing, generate a heap dump for further investigation. For example you can add the following script to cron on all nodes to run every fifteen minutes:date >> /tmp/ps-node.out; ps -eo pid,cmd,%mem,%cpu --sort=-%mem | head >> /tmp/ps-NODE-IP_ADDRESS.out
The output will look something like this:
Thu Jan 01 00:00:00 EST 2020
PID CMD %MEM %CPU
18416 /usr/lib/chromium-browser/c 5.9 2.3
6781 /usr/lib/firefox/firefox -c 5.6 9.6
32759 /usr/local/idea/jre64/bin/j 5.2 4.3
7235 /usr/lib/firefox/firefox -c 4.7 36.7
29528 /usr/lib/slack/slack --type 1.9 8.3
Each run will only be ~ 470 bytes or about 45kb per day. When you set up the cron job make sure you use >> to redirect the output of each command to a file, instead of >. Using a single > would overwrite the file each time, but using double >> will append to the existing file, which is what we want to so we can capture a historical record.
-
There are a number of other tools available for monitoring memory and system performance for investigating issues of this nature. Tools such as sar (System Activity Reporter) and dtrace (Dynamic Tracing) are quite useful for collecting specific data about system performance over time. For even more visibility, the dtrace stability and data stability probes even have a trigger for OOM conditions that will fire if the kernel kills a process due to an OOM condition.
-
One can set up the Oracle Java Mission Control "Flight recorder" to record the JVM activity when the problem is occurring. The resulting jar file can be investigated. Details about the Flight recorder can be found here:
JDK Mission Control / Repository for OpenJDK Mission Control
You will need JDK installed on the machine you are running the mission controller on ( this does not need to be a Dremio node, it could be your workstation ).
Enable flight recorder
Java Flight Recorder gives us an in-depth look at what is happening within the JVM and further information to analyse.
Running JFR is fairly straightforward.
To start JFR recording, you can run the following on the server in question:
First, make sure that ps -ef | grep [d]remio
returns only one result - only the Dremio process. For instance, make sure you are not editing a file with dremio in the filename.
Then you can run:
export DREMIO_PID=$(ps ax | grep dremio | grep -v grep | awk '{print $1}')
sudo -u dremio jcmd $DREMIO_PID VM.unlock_commercial_features ## This only needs to be done if you are on an earlier version of OpenJDK
sudo -u dremio jcmd $DREMIO_PID JFR.start name="DR_JFR" settings=profile maxage=3600s filename=<FULL_PATH_TO_FILE>/coordinator.jfr dumponexit=true
As an FYI, the overhead on performance is cited at 1% by Oracle.
OOM-Killer in Kubernetes
The OOMKilled error, also indicated by exit code 137, means that a container or pod was terminated because they used more memory than allowed. Kubernetes allows pods to limit the resources their containers are allowed to utilize on the host machine, once the memory reaches its limit (either overall, or per pod), the pod can be terminated with an OOMKilled status
One can identify the error by running the kubectl get pods
command — the pod status will appear as Terminating:
NAME READY STATUS RESTARTS AGE
my-pod-1 0/1 OOMKilled 0 1m2s
The Linux kernel maintains an oom_score for each process running on the host. The higher this score, the greater the chance that the process will be killed. Another value, called oom_score_adj, allows users to customize the OOM process and define when processes should be terminated.
Kubernetes uses the oom_score_adj value when defining a Quality of Service (QoS) class for a pod. There are three QoS classes that may be assigned to a pod: Guaranteed, Burstable, BestEffort.
Each QoS class has a matching value for oom_score_adj:
QUALITY OF SERVICE OOM_SCORE_ADJ
Guaranteed -997
BestEffort 1000
Burstable min(max(2, 1000—(1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)
Because “Guaranteed” pods have a lower value, they are the last to be killed on a node that is running out of memory. “BestEffort” pods are the first to be killed.
A pod that is killed due to a memory issue is not necessarily evicted from a node—if the restart policy on the node is set to “Always”, it will try to restart the pod.
To see the QoS class of a pod, run the following command:
kubectl get pod -o jsonpath=’{.status.qosClass}’
To see the oom_score of a pod:
Run kubectl exec -it /bin/bash
To see the oom_score, run cat/proc//oom_score
To see the oom_score_adj, run cat /proc//oom_score_adj
The pod with the lowest oom_score is the first to be killed when the node runs out of memory.