Java自学者论坛

 找回密码
 立即注册

手机号码,快捷登录

恭喜Java自学者论坛(https://www.javazxz.com)已经为数万Java学习者服务超过8年了!积累会员资料超过10000G+
成为本站VIP会员,下载本站10000G+会员资源,会员资料板块,购买链接:点击进入购买VIP会员

JAVA高级面试进阶训练营视频教程

Java架构师系统进阶VIP课程

分布式高可用全栈开发微服务教程Go语言视频零基础入门到精通Java架构师3期(课件+源码)
Java开发全终端实战租房项目视频教程SpringBoot2.X入门到高级使用教程大数据培训第六期全套视频教程深度学习(CNN RNN GAN)算法原理Java亿级流量电商系统视频教程
互联网架构师视频教程年薪50万Spark2.0从入门到精通年薪50万!人工智能学习路线教程年薪50万大数据入门到精通学习路线年薪50万机器学习入门到精通教程
仿小米商城类app和小程序视频教程深度学习数据分析基础到实战最新黑马javaEE2.1就业课程从 0到JVM实战高手教程MySQL入门到精通教程
查看: 420|回复: 0

记录一次mongodb因网络问题导致shard节点异常

[复制链接]
  • TA的每日心情
    奋斗
    2024-11-24 15:47
  • 签到天数: 804 天

    [LV.10]以坛为家III

    2053

    主题

    2111

    帖子

    72万

    积分

    管理员

    Rank: 9Rank: 9Rank: 9

    积分
    726782
    发表于 2021-5-22 15:23:59 | 显示全部楼层 |阅读模式

    现象:

    机房反馈9点左右,机房交换机故障,导致网络出现问题

    业务人员反馈某个接口超时

    初查:通过业务日志查看分析发现,在连接mongo的某个collections时候,报错错误如下:

    在写入数据的时候报错:

    Mongo::Error::OperationFailure - no progress was made executing batch write op in jdb3.images after 5 rounds (0 ops completed in 6 rounds total) (82):

    因此初步确定问题出在mongo分片集群上

    进入mongos节点,进行findOne操作,提示如下:

    "errmsg" : "None of the hosts for replica set configReplSet could be contacted."

    查看shard信息:

    --- Sharding Status ---
      sharding version: {
     "_id" : 1,
     "minCompatibleVersion" : 5,
     "currentVersion" : 6,
     "clusterId" : ObjectId("58c99a8257905f85f1828f52")
    }
      shards:
     {  "_id" : "shard01",  "host" : "shard01/100.106.23.22:27017,100.106.23.32:27017,100.111.9.19:27017" }
     {  "_id" : "shard02",  "host" : "shard02/100.106.23.23:27017,100.106.23.33:27017,100.111.9.20:27017" }
     {  "_id" : "shard03",  "host" : "shard03/100.106.23.24:27017,100.106.23.34:27017,100.111.17.3:27017" }
     {  "_id" : "shard04",  "host" : "shard04/100.106.23.25:27017,100.106.23.35:27017,100.111.17.4:27017" }
      active mongoses:
     "3.2.7" : 6
      balancer:
     Currently enabled:  yes
     Currently running:  no
      Balancer active window is set between 2:00 and 6:00 server local time
     Failed balancer rounds in last 5 attempts:  0
     Migration Results for the last 24 hours:
      9 : Success
      databases:
     {  "_id" : "jdb3",  "primary" : "shard01",  "partitioned" : true }
      jdb3.images
       shard key: { "uuid" : 1 }
       unique: false
       balancing: true
       chunks:
        shard01 41109
        shard02 41109
        shard03 41108
        shard04 41108
       too many chunks to print, use verbose if you want to force print
     {  "_id" : "gongan",  "primary" : "shard02",  "partitioned" : true }
     {  "_id" : "tmp",  "primary" : "shard03",  "partitioned" : false }
     {  "_id" : "1_n",  "primary" : "shard04",  "partitioned" : true }
     {  "_id" : "upload",  "primary" : "shard04",  "partitioned" : true }
      upload.images
       shard key: { "uuid" : 1 }
       unique: false
       balancing: true
       chunks:
        shard01 259
        shard02 258
        shard03 258
        shard04 259
       too many chunks to print, use verbose if you want to force print
     {  "_id" : "test",  "primary" : "shard03",  "partitioned" : false }

    没有发现异常,然后挨个检查shard节点日志

    发现在shard4节点的100.106.23.25副本上,找不到master,然后在shard4的master上查看错误日志

    100.106.23.25日志报错信息:

    2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.

    master100.106.23.35日志报错信息:

    2018-12-10T09:12:02.282+0800 W SHARDING [conn7204619] could not remotely refresh metadata for jdb3.images :: caused by :: None of the hosts for replica set configReplSet could be contacted.

    并且在35服务器上进行查询的时候,跟在mongos上查询报的错误是一样的:

    "errmsg" : "None of the hosts for replica set configReplSet could be contacted."

    定位问题:

    在其他shard1-3上查询一条数据,然后通过索引在mongos节点进行查询,均可查询到数据,从shard04节点上查询到的所有信息,在mongos上均报错,

     

    解决:重启slave,25,观察日志,已经没有了报错,

       重启master,35服务器,报错消失了,并且查看状态,master已经切换到了25服务器上,

     

    业务反馈,故障已经解决。

     

    疑点:

    1、网络问题导致,为何在网络恢复后,还是报如下错误:

    2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.

    难道mongo shard连接mongos用的是长连接么?

    有知道的大神欢迎告知!万分感谢

    哎...今天够累的,签到来了1...
    回复

    使用道具 举报

    您需要登录后才可以回帖 登录 | 立即注册

    本版积分规则

    QQ|手机版|小黑屋|Java自学者论坛 ( 声明:本站文章及资料整理自互联网,用于Java自学者交流学习使用,对资料版权不负任何法律责任,若有侵权请及时联系客服屏蔽删除 )

    GMT+8, 2025-1-23 09:26 , Processed in 0.056514 second(s), 28 queries .

    Powered by Discuz! X3.4

    Copyright © 2001-2021, Tencent Cloud.

    快速回复 返回顶部 返回列表