Caused by: org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException: Connection for partition d87c5c2f1872716d009307feb03a35eb#80@3eedaab6997deb0832eb197938159478 not reachable.

at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:183)

at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:322)

at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:291)

at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:94)

at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:101)

at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:48)

at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)

at org.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java:111)

at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:519)

at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:360)

at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:758)

at org.apache.flink.runtime.taskmanager.Task.run(Task.java:573)

at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager '/192.168.1.*:*' has failed. This might indicate that the remote task manager has been lost.

at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:145)

at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:114)

at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:81)

at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:70)

at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:179)

... 12 more

Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /192.168.1.*:*

Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused

at org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)

at org.apache.flink.shaded.netty4.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)

at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)

at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)

at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)

at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)

at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)

at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)

at java.lang.Thread.run(Thread.java:748)

报错信息如上,

环境:flink1.12.7 版本,standalone集群

执行大数据量(1T)批处理情况下偶发性报错。

查看报错信息,是taskmanager-a连接taskmanager-b报错连接被拒绝.

猜测原因可能是以下三种

1.taskmanager到jobmanager心跳超时,被jobmanager远程关闭进程。

2.taskmanager进程被linux杀死

3.taskmanager gc时间过长导致响应超时引发的异常?

随即查看taskmanager-b所在主机,jps发现taskmanager进程不存在。

查看jobmanager和taskmanager-b的日志 未发现因超时被jobmanager远程关闭的情况。

查看taskmanager-b配置在bin/config.sh里配置的tmp目录

找到当时的pid

执行dmesg |grep [pid]

发现有Out of memory和Killed process字样。

找到真凶。

好文推荐

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: