Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35326

Rebalance of delta recovered node fails: snapshot_range_t(a,b) requires start <= end

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 6.5.0
    • 6.5.0
    • couchbase-bucket
    • 6.5.0-3883-enterprise
    • Triaged
    • Centos 64-bit
    • Yes
    • KV-Engine MH Beta part 2

    Description

      Script to Repro

      ./testrunner -i /tmp/testexec.3494.ini -p get-cbcollect-info=True,flusher_batch_split_trigger=10 -t rebalance.rebalance_high_ops_pillowfight.RebalanceHighOpsWithPillowFight.test_graceful_failover_addback,node_out=3,replicas=2,nodes_init=4,items=2000000,batch_size=1000,rate_limit=100000,recovery_type=delta,instances=2,threads=5,loader=high_ops,flusher_batch_split_trigger=1
      

      Steps

      1. Create a 4 node cluster with 2 replicas, set flusher_batch_split_trigger=1
      2. Do a dataload with high ops dataloader
      3. Gracefully failover a node.
      4. Start high ops dataloader again.
      5. do a delta recovery.
      6. Start a Rebalance again.

      Rebalance fails as shown below.

      {u'node': u'ns_1@172.23.105.105', u'code': 0, u'text': u'Rebalance exited with reason {{badmatch,\n                                  {error,\n                                      {failed_nodes,[\'ns_1@172.23.105.47\']}}},\n                              [{ns_janitor,cleanup_apply_config_body,4,\n                                   [{file,"src/ns_janitor.erl"},{line,286}]},\n                               {ns_janitor,\'-cleanup_apply_config/4-fun-0-\',\n                                   4,\n                                   [{file,"src/ns_janitor.erl"},{line,209}]},\n                               {async,\'-async_init/4-fun-2-\',3,\n                                   [{file,"src/async.erl"},{line,211}]}]}.\nRebalance Operation Id = 28ffeff813a1d2e394ea0f10d72cbccf', u'shortText': u'message', u'serverTime': u'2019-07-27T23:42:38.878Z', u'module': u'ns_orchestrator', u'tstamp': 1564296158878, u'type': u'critical'}
      [2019-07-27 23:42:48,906] - [rest_client:3250] ERROR - {u'node': u'ns_1@172.23.105.47', u'code': 0, u'text': u'Control connection to memcached on \'ns_1@172.23.105.47\' disconnected: {lost_connection,\n                                                                       [{ns_memcached,\n                                                                         worker_loop,\n                                                                         3,\n                                                                         [{file,\n                                                                           "src/ns_memcached.erl"},\n                                                                          {line,\n                                                                           231}]},\n                                                                        {proc_lib,\n                                                                         init_p_do_apply,\n                                                                         3,\n                                                                         [{file,\n                                                                           "proc_lib.erl"},\n                                                                          {line,\n                                                                           247}]}]}', u'shortText': u'message', u'serverTime': u'2019-07-27T23:42:38.844Z', u'module': u'ns_memcached', u'tstamp': 1564296158844, u'type': u'info'}
      

      I also see a memcached crash on 172.23.105.47.

       {u'node': u'ns_1@172.23.105.47', u'code': 0, u'text': u"Service 'memcached' exited with status 134. Restarting. Messages:\n2019-07-27T23:42:38.784342-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f4c6e1e2000+0x8f213]\n2019-07-27T23:42:38.784356-07:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0x70842]\n2019-07-27T23:42:38.784366-07:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0xee6eb]\n2019-07-27T23:42:38.784378-07:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0x13ca45]\n2019-07-27T23:42:38.784392-07:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0x13cf0d]\n2019-07-27T23:42:38.784399-07:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0x1362ef]\n2019-07-27T23:42:38.784404-07:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f4c7007d000+0x8f27]\n2019-07-27T23:42:38.784410-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7f4c6daad000+0x7dd5]\n2019-07-27T23:42:38.784443-07:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f4c6d6e0000+0xfdead]\n[*** LOG ERROR ***] [2019-07-27 23:42:38] [spdlog_file_logger] async log: thread pool doesn't exist anymore", u'shortText': u'message', u'serverTime': u'2019-07-27T23:42:38.838Z', u'module': u'ns_log', u'tstamp': 1564296158838, u'type': u'info'}
      

      cbcollect_info attached from all the nodes in the cluster.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-35326
          # Subject Branch Project Status CR V

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty