Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-23161

Bucket deletion timeout issue followed by rebalance failures found in XDCR tests

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Test Blocker
    • Resolution: Fixed
    • 5.0.0
    • 5.0.0
    • memcached, XDCR
    • Triaged
    • Yes

    Description

      [2017-03-05 11:23:17,744] - [cluster_helper:78] INFO - waiting for ns_server @ 172.23.105.181:8091
      [2017-03-05 11:23:17,750] - [cluster_helper:80] INFO - ns_server @ 172.23.105.181:8091 is running
      Cluster instance shutdown with force
      [2017-03-05 11:23:17,750] - [xdcrnewbasetests:1149] INFO - removing xdcr/nodes settings
      [2017-03-05 11:23:17,813] - [rest_client:1025] INFO - removing remote cluster name:remote_cluster_C2-C1
      [2017-03-05 11:23:17,848] - [xdcrnewbasetests:1159] INFO - cleanup [ip:172.23.107.120 port:8091 ssh_username:root, ip:172.23.107.169 port:8091 ssh_username:root, ip:172.23.105.180 port:8091 ssh_username:root]
      [2017-03-05 11:23:17,886] - [bucket_helper:138] INFO - deleting existing buckets [u'default'] on 172.23.107.120
      [2017-03-05 11:23:17,886] - [bucket_helper:140] INFO - remove bucket default ...
      [2017-03-05 11:23:47,906] - [rest_client:790] ERROR - http://172.23.107.120:8091/pools/default/buckets/default error 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue.\r\n"}
      [2017-03-05 11:23:47,906] - [rest_client:1898] WARNING - Bucket deletion timed out waiting for all nodes
      [2017-03-05 11:23:47,931] - [bucket_helper:153] ERROR - Unable to get timings for bucket
      [2017-03-05 11:23:47,931] - [bucket_helper:154] INFO - deleted bucket : default from 172.23.107.120
      [2017-03-05 11:23:47,931] - [bucket_helper:230] INFO - waiting for bucket deletion to complete....
      [2017-03-05 11:23:47,955] - [rest_client:134] INFO - node 172.23.107.120 existing buckets : []
      [2017-03-05 11:23:48,005] - [cluster_helper:255] INFO - rebalancing all nodes in order to remove nodes
      [2017-03-05 11:23:48,010] - [rest_client:1322] INFO - rebalance params : password=welcome&ejectedNodes=ns_1%40172.23.107.169%2Cns_1%40172.23.105.180&user=Administrator&knownNodes=ns_1%40172.23.107.169%2Cns_1%40172.23.105.180%2Cns_1%40172.23.107.120
      [2017-03-05 11:23:48,017] - [rest_client:1326] INFO - rebalance operation started
      [2017-03-05 11:23:48,047] - [rest_client:1457] INFO - rebalance percentage : 0.00 %
      [2017-03-05 11:23:58,064] - [rest_client:1457] INFO - rebalance percentage : 0.00 %
      [2017-03-05 11:24:08,083] - [rest_client:1440] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2017-03-05 11:24:08,208] - [rest_client:2726] INFO - Latest logs from UI on 172.23.107.120:
      [2017-03-05 11:24:08,208] - [rest_client:2727] ERROR - {u'node': u'ns_1@172.23.107.120', u'code': 0
      

      Bucket deletion issue found in XDCR test cleanup - same as MB-16745 - attaching logs and console output

      Attachments

        1. 172.23.105.102.zip
          2.67 MB
        2. 172.23.105.178-20170305-1116-diag.zip
          4.92 MB
        3. 172.23.105.179-20170305-1117-diag.zip
          913 kB
        4. 172.23.105.180-20170305-1118-diag.zip
          3.63 MB
        5. 172.23.105.181-20170305-1120-diag.zip
          4.67 MB
        6. 172.23.105.20.zip
          6.53 MB
        7. 172.23.105.3.zip
          6.15 MB
        8. 172.23.105.4.zip
          4.24 MB
        9. 172.23.105.67.zip
          2.64 MB
        10. 172.23.107.120-20170305-1111-diag.zip
          8.73 MB
        11. 172.23.107.169-20170305-1114-diag.zip
          5.69 MB
        12. 172.23.107.171-20170305-1109-diag.zip
          6.68 MB
        13. 172.23.107.198-20170305-1106-diag.zip
          10.12 MB
        14. 172.23.107.53.zip
          2.27 MB
        15. consoleFull.rtf
          91 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Issue found again in 5.0.0-2564 - logs are here:
            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.105.27.zip
            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.106.251.zip
            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.108.169.zip
             
            Bucket deletion happened at test_39:
             
            [2017-04-08 17:32:05,084] - [rest_client:784] ERROR - DELETE http://172.23.105.27:8091/pools/default/buckets/default
            body: headers: {'Content-Type': 'application/x-www-form-urlencoded', 'Accept': '/', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==\n'} error: 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue.\r\n"} auth: Administrator:password
             
            All the following testcases failed with rebalance issues - logs were collected at the end - job ca be found here:
             
            http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/565/consoleFull
             
             

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Issue found again in 5.0.0-2564 - logs are here: https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.105.27.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.106.251.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.108.169.zip   Bucket deletion happened at test_39:   [2017-04-08 17:32:05,084] - [rest_client:784] ERROR - DELETE http://172.23.105.27:8091/pools/default/buckets/default body: headers: {'Content-Type': 'application/x-www-form-urlencoded', 'Accept': ' / ', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==\n'} error: 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue.\r\n"} auth: Administrator:password   All the following testcases failed with rebalance issues - logs were collected at the end - job ca be found here:   http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/565/consoleFull    

            There are still two offending/waiting connections. The connections are no longer the ones set up by pipeline supervisors, though. The "cmd" field of the connection shows "DCP_MUTATION", which indicates that the connection is a dcp connection. In comparison, the "cmd" field was "STAT" for the connection set up by pipeline supervisor.

            Will dig some more.

            2017-04-09T21:01:12.891631-07:00 NOTICE Worker thread 3: {"connection":"0x7f18d375a180","socket":50,"protocol":"memcached","peername":"172.23.105.27:46401","sockname":"172.23.105.27:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"@goxdcr","sasl_conn":"0x7f18e15e0100","nodelay":true,"refcount":2,"features":{"mutation_extras":false,"xerror":false},"engine_storage":"0x7f18c588c900","next":"0x0","thread":"0x7f1927483390","priority":"Medium","clustermap_revno":"unknown","sasl_disabled":false,"tap":false,"dcp":true,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0x44002a68","max_reqs_per_event":5,"nevents":0,"state":"conn_mwrite","cmd":"DCP_MUTATION","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f18c5815000","curr":"0x7f18c5815000","size":8192,"bytes":0},"write"{"buf":"0x7f18d1e96000","curr":"0x7f18d1e96037","size":2048,"bytes":55},"write_and_go":"conn_ship_log","ritem":"0x7f18c5815048","rlbytes":0,"item":"0x0","iov"{"size":10,"used":3,"vector":[\{"base":"0x7f18d1e96000","len":"0x37"},\{"base":"0x7f18e0c516f9","len":"0xb"},\{"base":"0x7f18e13f402b","len":"0x161"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":578},"itemlist":{"size":1},"temp_alloc_list":{"size":0},"noreply":(false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":10496,"total_send":1063612,"datatype":"snappy,json"}

            yu Yu Sui (Inactive) added a comment - There are still two offending/waiting connections. The connections are no longer the ones set up by pipeline supervisors, though. The "cmd" field of the connection shows "DCP_MUTATION", which indicates that the connection is a dcp connection. In comparison, the "cmd" field was "STAT" for the connection set up by pipeline supervisor. Will dig some more. 2017-04-09T21:01:12.891631-07:00 NOTICE Worker thread 3: {"connection":"0x7f18d375a180","socket":50,"protocol":"memcached","peername":"172.23.105.27:46401","sockname":"172.23.105.27:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"@goxdcr","sasl_conn":"0x7f18e15e0100","nodelay":true,"refcount":2,"features":{"mutation_extras":false,"xerror":false},"engine_storage":"0x7f18c588c900","next":"0x0","thread":"0x7f1927483390","priority":"Medium","clustermap_revno":"unknown","sasl_disabled":false,"tap":false,"dcp":true,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0x44002a68","max_reqs_per_event":5,"nevents":0,"state":"conn_mwrite","cmd":"DCP_MUTATION","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded": ["write","persist"] },"which":{"raw":"0x4","decoded": ["write"] }},"read":{"buf":"0x7f18c5815000","curr":"0x7f18c5815000","size":8192,"bytes":0},"write" {"buf":"0x7f18d1e96000","curr":"0x7f18d1e96037","size":2048,"bytes":55},"write_and_go":"conn_ship_log","ritem":"0x7f18c5815048","rlbytes":0,"item":"0x0","iov" {"size":10,"used":3,"vector": [\{"base":"0x7f18d1e96000","len":"0x37"},\{"base":"0x7f18e0c516f9","len":"0xb"},\{"base":"0x7f18e13f402b","len":"0x161"}] },"msglist":{"size":5,"used":1,"curr":0,"bytes":578},"itemlist":{"size":1},"temp_alloc_list":{"size":0},"noreply":(false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":10496,"total_send":1063612,"datatype":"snappy,json"}

            There were two replication instances with the same replication id running at the time of problem. When test completed and a command was issued to stop the replication, only one replication instance was stopped and the other was left running. This should be the reason that some dcp connections were left open at the end of test run.

            Will look for ways to prevent multiple instances from occuring.

            yu Yu Sui (Inactive) added a comment - There were two replication instances with the same replication id running at the time of problem. When test completed and a command was issued to stop the replication, only one replication instance was stopped and the other was left running. This should be the reason that some dcp connections were left open at the end of test run. Will look for ways to prevent multiple instances from occuring.

            Build 5.0.0-2629 contains goxdcr commit ac368d248c34024045f4c5a2fc64393e2a9fcbdc with commit message:
            MB-23161 fix replication stop sequence when replication spec is deleted
            https://github.com/couchbase/goxdcr/commit/ac368d248c34024045f4c5a2fc64393e2a9fcbdc

            build-team Couchbase Build Team added a comment - Build 5.0.0-2629 contains goxdcr commit ac368d248c34024045f4c5a2fc64393e2a9fcbdc with commit message: MB-23161 fix replication stop sequence when replication spec is deleted https://github.com/couchbase/goxdcr/commit/ac368d248c34024045f4c5a2fc64393e2a9fcbdc

            Issue not seen in 5.0.0-2634 regression - closing it out

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Issue not seen in 5.0.0-2634 regression - closing it out

            People

              yu Yu Sui (Inactive)
              arunkumar Arunkumar Senthilnathan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty