现象:
机房反馈9点左右,机房交换机故障,导致网络出现问题
业务人员反馈某个接口超时
初查:通过业务日志查看分析发现,在连接mongo的某个collections时候,报错错误如下:
在写入数据的时候报错:
Mongo::Error::OperationFailure - no progress was made executing batch write op in jdb3.images after 5 rounds (0 ops completed in 6 rounds total) (82):
因此初步确定问题出在mongo分片集群上
进入mongos节点,进行findOne操作,提示如下:
"errmsg" : "None of the hosts for replica set configReplSet could be contacted."
查看shard信息:
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("58c99a8257905f85f1828f52")
}
shards:
{ "_id" : "shard01", "host" : "shard01/100.106.23.22:27017,100.106.23.32:27017,100.111.9.19:27017" }
{ "_id" : "shard02", "host" : "shard02/100.106.23.23:27017,100.106.23.33:27017,100.111.9.20:27017" }
{ "_id" : "shard03", "host" : "shard03/100.106.23.24:27017,100.106.23.34:27017,100.111.17.3:27017" }
{ "_id" : "shard04", "host" : "shard04/100.106.23.25:27017,100.106.23.35:27017,100.111.17.4:27017" }
active mongoses:
"3.2.7" : 6
balancer:
Currently enabled: yes
Currently running: no
Balancer active window is set between 2:00 and 6:00 server local time
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
9 : Success
databases:
{ "_id" : "jdb3", "primary" : "shard01", "partitioned" : true }
jdb3.images
shard key: { "uuid" : 1 }
unique: false
balancing: true
chunks:
shard01 41109
shard02 41109
shard03 41108
shard04 41108
too many chunks to print, use verbose if you want to force print
{ "_id" : "gongan", "primary" : "shard02", "partitioned" : true }
{ "_id" : "tmp", "primary" : "shard03", "partitioned" : false }
{ "_id" : "1_n", "primary" : "shard04", "partitioned" : true }
{ "_id" : "upload", "primary" : "shard04", "partitioned" : true }
upload.images
shard key: { "uuid" : 1 }
unique: false
balancing: true
chunks:
shard01 259
shard02 258
shard03 258
shard04 259
too many chunks to print, use verbose if you want to force print
{ "_id" : "test", "primary" : "shard03", "partitioned" : false }
没有发现异常,然后挨个检查shard节点日志
发现在shard4节点的100.106.23.25副本上,找不到master,然后在shard4的master上查看错误日志
100.106.23.25日志报错信息:
2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.
master100.106.23.35日志报错信息:
2018-12-10T09:12:02.282+0800 W SHARDING [conn7204619] could not remotely refresh metadata for jdb3.images :: caused by :: None of the hosts for replica set configReplSet could be contacted.
并且在35服务器上进行查询的时候,跟在mongos上查询报的错误是一样的:
"errmsg" : "None of the hosts for replica set configReplSet could be contacted."
定位问题:
在其他shard1-3上查询一条数据,然后通过索引在mongos节点进行查询,均可查询到数据,从shard04节点上查询到的所有信息,在mongos上均报错,
解决:重启slave,25,观察日志,已经没有了报错,
重启master,35服务器,报错消失了,并且查看状态,master已经切换到了25服务器上,
业务反馈,故障已经解决。
疑点:
1、网络问题导致,为何在网络恢复后,还是报如下错误:
2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.
难道mongo shard连接mongos用的是长连接么?
有知道的大神欢迎告知!万分感谢 |