线上预警主从中断: 查看线上复制信息:
# Replication role:slave master_host:master_host master_port:6379 master_link_status:down master_last_io_seconds_ago:-1 master_sync_in_progress:1 slave_repl_offset:1 master_sync_left_bytes:713983940 master_sync_last_io_seconds_ago:0 master_link_down_since_seconds:248 slave_priority:100 slave_read_only:1 connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0
状态为DOWN.主从失败,查看主节点相关日志
[374] 15 Oct 16:41:28.146 # Connection with slave 10.72.26.55:6379 lost. [374] 15 Oct 16:41:28.999 * Slave asks for synchronization [374] 15 Oct 16:41:28.999 * Unable to partial resync with the slave for lack of backlog (Slave request was: 152340118946214). [374] 15 Oct 16:41:28.999 * Starting BGSAVE for SYNC [374] 15 Oct 16:41:29.447 * Background saving started by pid 11357 [11357] 15 Oct 16:41:57.325 * DB saved on disk [11357] 15 Oct 16:41:57.555 * RDB: 231 MB of memory used by copy-on-write [374] 15 Oct 16:41:57.980 * Background saving terminated with success [374] 15 Oct 16:42:31.739 * Synchronization with slave succeeded [374] 15 Oct 16:43:01.021 # Client id=6082455 addr=slave_host:55308 fd=329 name= age=93 idle=1 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=10657 omem=2504780296 events=rw cmd=replconf scheduled to be closed ASAP for overcoming of output buffer limits.
查看从节点日志:
[372] 15 Oct 16:43:01.141 # Connection with master lost. [372] 15 Oct 16:43:01.141 * Caching the disconnected master state. [372] 15 Oct 16:43:01.213 * Connecting to MASTER masterhost:6379 [372] 15 Oct 16:43:01.213 * MASTER <-> SLAVE sync started [372] 15 Oct 16:43:01.213 * Non blocking connect for SYNC fired the event. [372] 15 Oct 16:43:01.572 * Master replied to PING, replication can continue... [372] 15 Oct 16:43:01.599 * Trying a partial resynchronization (request cbc213a279fde141211f65d436595e4ed64198fa:152342150944513). [372] 15 Oct 16:43:01.602 * Full resync from master: cbc213a279fde141211f65d436595e4ed64198fa:152344338348685 [372] 15 Oct 16:43:01.602 * Discarding previously cached master state. [372] 15 Oct 16:43:30.326 * MASTER <-> SLAVE sync: receiving 1308737462 bytes from master [372] 15 Oct 16:43:59.846 * MASTER <-> SLAVE sync: Flushing old data [372] 15 Oct 16:44:01.534 * MASTER <-> SLAVE sync: Loading DB in memory [372] 15 Oct 16:44:22.590 * MASTER <-> SLAVE sync: Finished with success [372] 15 Oct 16:44:22.600 # Connection with master lost. [372] 15 Oct 16:44:22.600 * Caching the disconnected master state.
从主库的日志我们可以看到slave的链接由于超过了output buffer limits的设置值所以被强行中断了。看一下redis2.8的自描述文件
我们主要看slave的限制:
256mb 是一个硬性限制,当output-buffer的大小大于256mb之后就会断开连接 64mb 60 是一个条件限制,当output-buffer的大小大于64mb并且超过了60秒的时候就会断开连接
当我们链接暴增,数据量大的情况下默认参数已经不能满足主从同步,从库会不停的向主库发起同步,主库就会不停的bgsave,发送文件给从库,这样就会造成一个死循环。我们必须依据从库的使用来调整client-output-buffer-limit slave 的值。调整以后就可以正常同步了。 |