[GH-ISSUE #1367] 心跳检测在节点异常后下线,然后节点恢复后无法恢复 #1086

Closed
opened 2026-05-05 12:41:45 -06:00 by gitea-mirror · 5 comments
Owner

Originally created by @luyaotang on GitHub (Aug 9, 2019).
Original GitHub issue: https://github.com/fatedier/frp/issues/1367

描述:
在windows7中使用frpc_386的版本。配置了心跳检测,采用的tcp检测方式。在后端节点有问题时,能正常检测到并下线它。但几了几个小时后,后端节点恢复了,但frpc没有再检测到状态,并上线它。

环境:
使用的是:0.22.0
go : go1.11.5 linux/amd64

rrpc的配置:
[common]
server_addr = 192.168.21.254
server_port = 7100
log_level = info
log_max_days = 7
protocol = tcp
token = xxxxx
login_fail_exit = false
[zzz]
type = tcp
local_ip = 127.0.0.1
local_port = 3389
health_check_type = tcp
health_check_interval_s = 10
health_check_max_failed = 2
health_check_timeout_s = 3

分析:
1)在04:30后端节点检测到下线。然后根据检测时间间隔,会每几秒输出如下的检测状态。
2019/08/09 04:30:47 [W] [011f4ad2ffa5591f] [9001E855D9DE93B5DA100230BDEE1F61.zzz] do one health check failed: dial tcp xx.xx.xxx.xxx:xxxx: connectex: No connection could be made because the target machine actively refused it.
2)但在08:54:59之后就再也不输出心跳检测情况了,感觉检测线程退出了一样。正常来说他应该会检测到后端节点服务恢复,并change status。
2019/08/09 08:54:59 [W] [011f4ad2ffa5591f] [9001E855D9DE93B5DA100230BDEE1F61.zzz] do one health check failed: dial tcp xx.xx.xxx.xxx:xxxx: connectex: No connection could be made because the target machine actively refused it.

Originally created by @luyaotang on GitHub (Aug 9, 2019). Original GitHub issue: https://github.com/fatedier/frp/issues/1367 描述: 在windows7中使用frpc_386的版本。配置了心跳检测,采用的tcp检测方式。在后端节点有问题时,能正常检测到并下线它。但几了几个小时后,后端节点恢复了,但frpc没有再检测到状态,并上线它。 环境: 使用的是:0.22.0 go : go1.11.5 linux/amd64 rrpc的配置: [common] server_addr = 192.168.21.254 server_port = 7100 log_level = info log_max_days = 7 protocol = tcp token = xxxxx login_fail_exit = false [zzz] type = tcp local_ip = 127.0.0.1 local_port = 3389 health_check_type = tcp health_check_interval_s = 10 health_check_max_failed = 2 health_check_timeout_s = 3 分析: 1)在04:30后端节点检测到下线。然后根据检测时间间隔,会每几秒输出如下的检测状态。 2019/08/09 04:30:47 [W] [011f4ad2ffa5591f] [9001E855D9DE93B5DA100230BDEE1F61.zzz] do one health check failed: dial tcp xx.xx.xxx.xxx:xxxx: connectex: No connection could be made because the target machine actively refused it. 2)但在08:54:59之后就再也不输出心跳检测情况了,感觉检测线程退出了一样。正常来说他应该会检测到后端节点服务恢复,并change status。 2019/08/09 08:54:59 [W] [011f4ad2ffa5591f] [9001E855D9DE93B5DA100230BDEE1F61.zzz] do one health check failed: dial tcp xx.xx.xxx.xxx:xxxx: connectex: No connection could be made because the target machine actively refused it.
gitea-mirror 2026-05-05 12:41:45 -06:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@fatedier commented on GitHub (Aug 9, 2019):

可复现?
当时的网络连接情况有吗?当时异常状态时的连续日志?

<!-- gh-comment-id:519768282 --> @fatedier commented on GitHub (Aug 9, 2019): 可复现? 当时的网络连接情况有吗?当时异常状态时的连续日志?
Author
Owner

@luyaotang commented on GitHub (Aug 9, 2019):

应该是不可复现的,因为之前也有这种情况,后端节点恢复后,会输出,类似如下:
2019/08/08 12:57:23 [I] [011f4ad2ffa5591f] [9001E855D9DE93B5DA100230BDEE1F61.zzz] health check status change to success
2019/08/08 12:57:23 [I] [011f4ad2ffa5591f] [9001E855D9DE93B5DA100230BDEE1F61.zzz] health check success
2019/08/08 12:57:24 [I] [9001E855D9DE93B5DA100230BDEE1F61.zzz] start proxy success.
但是这次就没有。感觉检测线程已经不运行了,就更无法谈起说会重新上线。

<!-- gh-comment-id:519768885 --> @luyaotang commented on GitHub (Aug 9, 2019): 应该是不可复现的,因为之前也有这种情况,后端节点恢复后,会输出,类似如下: 2019/08/08 12:57:23 [I] [011f4ad2ffa5591f] [9001E855D9DE93B5DA100230BDEE1F61.zzz] health check status change to success 2019/08/08 12:57:23 [I] [011f4ad2ffa5591f] [9001E855D9DE93B5DA100230BDEE1F61.zzz] health check success 2019/08/08 12:57:24 [I] [9001E855D9DE93B5DA100230BDEE1F61.zzz] start proxy success. 但是这次就没有。感觉检测线程已经不运行了,就更无法谈起说会重新上线。
Author
Owner

@luyaotang commented on GitHub (Aug 9, 2019):

1)可以确定,程序是运行的。增加的每30分钟的系统信息输出是有日志的。
2)重启服务正常了。
3)怀疑是这部分不知道什么原因退出了:
func (monitor *HealthCheckMonitor) checkWorker() {
for {
ctx, cancel := context.WithDeadline(monitor.ctx, time.Now().Add(monitor.timeout))
err := monitor.doCheck(ctx)

	// check if this monitor has been closed
	select {
	case <-ctx.Done():
		cancel()
		return
	default:
		cancel()
	}
<!-- gh-comment-id:519770090 --> @luyaotang commented on GitHub (Aug 9, 2019): 1)可以确定,程序是运行的。增加的每30分钟的系统信息输出是有日志的。 2)重启服务正常了。 3)怀疑是这部分不知道什么原因退出了: func (monitor *HealthCheckMonitor) checkWorker() { for { ctx, cancel := context.WithDeadline(monitor.ctx, time.Now().Add(monitor.timeout)) err := monitor.doCheck(ctx) // check if this monitor has been closed select { case <-ctx.Done(): cancel() return default: cancel() }
Author
Owner

@luyaotang commented on GitHub (Aug 9, 2019):

日志太大,分享在此https://c-t.work/s/71d3c3b652d948

<!-- gh-comment-id:519770548 --> @luyaotang commented on GitHub (Aug 9, 2019): 日志太大,分享在此https://c-t.work/s/71d3c3b652d948
Author
Owner

@fatedier commented on GitHub (Aug 9, 2019):

确实有一定的可能性,我修复一下。

<!-- gh-comment-id:519774648 --> @fatedier commented on GitHub (Aug 9, 2019): 确实有一定的可能性,我修复一下。
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/frp#1086
No description provided.