'Why does the yarn node manager die when running spark application?

I am running a spark-java application on yarn with dynamic allocation enabled. The Yarn Node Manager halts, and I see java.lang.OutOfMemoryError: GC overhead limit exceeded in the Node Manager logs.

Naturally, I increased the memory for the Node Manager from 1G to 2G and then to 4G and I still see the same issues.

The strange thing is that this app used to work well in the Cloudera cluster now that we have switched to Horton works I see these issues.

When looking at Grafana charts for the node manager, I can see that the node that has died was using only 60% of its heap.

One side question is it normal for spark to use netty & nio simultaneously...because I see things like:

ERROR server.TransportRequestHandler (TransportRequestHandler.java:lambda$respond$2(226)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=2003440799655, chunkIndex=197}, buffer=FileSegmentManagedBuffer{file=/folder/application_1643748238319_0002/blockmgr-70443374-6f01-4960-90f9-b045f87798af/0f/shuffle_0_516_0.data, offset=55455, length=1320}} to /xxx.xxx.xxx.xxx:xxxx; closing connection
java.nio.channels.ClosedChannelException
        at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.close(...)(Unknown Source)

Anyway, I see the outOfMemoryError exception in several scenarios.

YarnUncaughtExceptionHandler

yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread Thread[Container Monitor,5,main] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.io.BufferedReader.<init>(BufferedReader.java:105)
        at java.io.BufferedReader.<init>(BufferedReader.java:116)
        at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:528)
        at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:457)

TransportRequestHandler Error

ava.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.spark.network.util.ByteArrayWritableChannel.<init>(ByteArrayWritableChannel.java:32)
        at org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.<init>(SaslEncryption.java:160)
        at org.apache.spark.network.sasl.SaslEncryption$EncryptionHandler.write(SaslEncryption.java:87)

and

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at com.sun.crypto.provider.CipherCore.update(CipherCore.java:666)
        at com.sun.crypto.provider.DESedeCipher.engineUpdate(DESedeCipher.java:228)
        at javax.crypto.Cipher.update(Cipher.java:1795)

Long Pause

util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1326ms

apache-spark hadoop-yarn

Solution 1:^[1]

The main reason for that issue is that your containers are over customized with more memory than the actual physical memory another thing is the number of vcores should be aligned with the number of vcore=(CPU * core), if you set 16GB and your physical machine is only 8GB your container will try to allocate 16GB then yarn you kill the container due OOM

check this setting at YARN: yarn.nodemanager.resource.memory-mb=(value for a single machine memory, not for the sum of all machines) yarn.nodemanager.resource.cpu-vcores=(cpu * cores) and for all vcores related params

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Dharman