Caused by: org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException: Connection for partition d87c5c2f1872716d009307feb03a35eb#80@3eedaab6997deb0832eb197938159478 not reachable.
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:183)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:322)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:291)
at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:94)
at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:101)
at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:48)
at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
at org.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java:111)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:519)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:360)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:758)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:573)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager '/192.168.1.*:*' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:145)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:114)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:81)
at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:70)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:179)
... 12 more
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /192.168.1.*:*
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
at org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
at org.apache.flink.shaded.netty4.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
报错信息如上,
环境:flink1.12.7 版本,standalone集群
执行大数据量(1T)批处理情况下偶发性报错。
查看报错信息,是taskmanager-a连接taskmanager-b报错连接被拒绝.
猜测原因可能是以下三种
1.taskmanager到jobmanager心跳超时,被jobmanager远程关闭进程。
2.taskmanager进程被linux杀死
3.taskmanager gc时间过长导致响应超时引发的异常?
随即查看taskmanager-b所在主机,jps发现taskmanager进程不存在。
查看jobmanager和taskmanager-b的日志 未发现因超时被jobmanager远程关闭的情况。
查看taskmanager-b配置在bin/config.sh里配置的tmp目录
找到当时的pid
执行dmesg |grep [pid]
发现有Out of memory和Killed process字样。
找到真凶。
好文推荐
发表评论