Summary
This article helps troubleshoot whether or not the "java.io.IOException: Connection reset by peer" exception reported in the Dremio logs is generating issues.
Reported Issue
Users can notice in the logs warning messages about an unexpected connection being closed, with a similar stack trace to the one below:
[UserServer-*] WARN c.d.exec.rpc.RpcExceptionHandler - Exception occurred with closed channel. Connection: /SERVER_IP:31010 <--> /CLIENT_IP:port_number (user client)
java.io.IOException: Connection reset by peer
at java.base/sun.nio.ch.FileDispatcherImpl.readv0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.readv(SocketDispatcher.java)
at java.base/sun.nio.ch.IOUtil.read(IOUtil.java)
at java.base/sun.nio.ch.IOUtil.read(IOUtil.java)
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java)
at java.base/java.nio.channels.SocketChannel.read(SocketChannel.java)
at io.netty.buffer.NettyArrowBuf.setBytes(NettyArrowBuf.java)
at io.netty.buffer.MutableWrappedByteBuf.setBytes(MutableWrappedByteBuf.java)
at io.netty.buffer.ExpandableByteBuf.setBytes(ExpandableByteBuf.java)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java)
at java.base/java.lang.Thread.run(Thread.java)
Relevant Versions
This can take place in all Dremio releases.
Troubleshooting Steps
1. Find out who the CLIENT_IP belong to. The client can be another Dremio node, the zookeeper or actually a client application. Check the client logs, see if there is a pattern by which after a certain amount of time the server connections are being closed. Try to identify if there is a timeout configured to that amount of time.
2. Is this happening every time while a certain query runs? If yes, does that query fails every time after the same amount of seconds and on the same client?
3. Check the network connectivity between the server and the client in the warning message, as any firewall, proxy, client gateway or load balancer can close a connection. If there is a SSL problem, it should be displayed after the stack trace.
3. Is the TCP keepalive configured on the client? Try to increase it's value and see if that helps.
4. Troubleshoot the network, if in doing so you see a "Broken pipe" error, that can indicate that you reproduced the problem outside the product.
5. If there are failing queries when this warning is logged, check the query profiles, see what connector they use. Try to run the queries from the UI/REST API and see if the problem reproduces.
6. Is the connection being successfully re-established after it's terminated, or is there a "Connection refused" error being returned? If a "Connection refused" is being returned, then this becomes the new error to troubleshoot.
7. Are there any errors logged in the OS logs (var/log/messages or syslog)?
Cause
The warning means that the client (CLIENT_IP in the stack trace) has abruptly aborted the connection in midst of a transaction. That can have many causes which are not controllable from the server side on. E.g. the enduser decided to shutdown the client or change the server abruptly while still interacting with the server, or the client program has crashed, or the enduser's internet connection went down, or the enduser's machine crashed, a JAVA component experienced issues (e.g. full or long GC pauses, the heap monitor cancelling a query), etc.
Steps to Resolve
Most of the time it shouldn't matter that the connection gets closed, as its normal behaviour for such connections to be closed and the servers/clients should deal with them gracefully and re-open new connections if it's needed. However, if this is reporting when the queries are failing, then the troubleshooting steps above can help identify the root cause. Also, sometimes such warnings mean that an executor is brought down when live queries are running or certain timeouts were reached.