Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.1
    • Fix Version/s: 1.2.5
    • Component/s: library
    • Labels:
      None

      Description

      Our integration testing is showing irregular operations failing during tests where a node is failed over, then added back and rebalanced. This is not expected, as there should be no failures during rebalance.

      Assigning to Saakshi to further fill out the description.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        saakshi.manocha Saakshi Manocha added a comment -

        The required changes for this issue already got released with NCBC-228, so I'm closing out this one.
        No further similar issue reported

        Show
        saakshi.manocha Saakshi Manocha added a comment - The required changes for this issue already got released with NCBC-228 , so I'm closing out this one. No further similar issue reported
        Hide
        ingenthr Matt Ingenthron added a comment -

        Note that we ran into this in a Java deployment today. There may be something odd happening here.

        Is it possible to capture from this, using 2.0.0 server on linux, a packet capture of port 8091, 8092 and 11210 from the client system? This would allow us to see if the cluster is behaving as expected.

        Show
        ingenthr Matt Ingenthron added a comment - Note that we ran into this in a Java deployment today. There may be something odd happening here. Is it possible to capture from this, using 2.0.0 server on linux, a packet capture of port 8091, 8092 and 11210 from the client system? This would allow us to see if the cluster is behaving as expected.
        Hide
        saakshi.manocha Saakshi Manocha added a comment - - edited

        The report: sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32
        shows the error messages occur in debug mode during rebalance, but the error rate does not increase. And during and after rebound phase, the errors disappear and there is a full recovery of the cluster.
        As long as there are no errors after rebalance operation is complete, the report is good.

        Show
        saakshi.manocha Saakshi Manocha added a comment - - edited The report: sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32 shows the error messages occur in debug mode during rebalance, but the error rate does not increase. And during and after rebound phase, the errors disappear and there is a full recovery of the cluster. As long as there are no errors after rebalance operation is complete, the report is good.
        Hide
        saakshi.manocha Saakshi Manocha added a comment -

        Ran a full suite of hybrid test scenarios using sdkd and latest enyim.caching changes (as done by John related to issue# CBSE-396).
        The report is ready with comments and shared through Google docs:
        sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32

        The report has better grades than the last month report which is good.

        Show
        saakshi.manocha Saakshi Manocha added a comment - Ran a full suite of hybrid test scenarios using sdkd and latest enyim.caching changes (as done by John related to issue# CBSE-396). The report is ready with comments and shared through Google docs: sdkd-reports -> nosdk-nocluster-3d_AT-2013-02-24T22-21-32 The report has better grades than the last month report which is good.
        Hide
        mnunberg Mark Nunberg added a comment -

        Interesting to note that there are NOT_MY_VBUCKET errors well after the rebalance after the readd

        Show
        mnunberg Mark Nunberg added a comment - Interesting to note that there are NOT_MY_VBUCKET errors well after the rebalance after the readd
        Hide
        saakshi.manocha Saakshi Manocha added a comment - - edited
        • Reproduced the brun test lists again to include the newly added reAdd test.
        • Ran the command:
          python .\brun -C Sdkd.args -S dotnet-1.2-release -V 2.0.0-1976 -i cluster_config.ini -T HYBRID_readd-2
          (This command will fail two nodes, add them back and then rebalance)
        • Cluster_config.ini comprise of 4 nodes:
          10.3.121.134 10.3.121.135 10.3.121.136 10.3.3.206

        http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_MC_BASIC.txt

        http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_HTTP_BASIC.txt

        • Observations:
          (a) Following errors occur continuously during CHANGE phase while the rebalance operation is undergoing:
          [Enyim.Caching.Memcached.MemcachedNode|Error] System.IO.IOException: Failed to read from the socket '10.3.121.136:11210'. Error: SocketError value was Success, but 0 bytes were received
          [Enyim.Caching.Memcached.MemcachedNode.InternalPoolImpl|Error] Could not init pool. System.NullReferenceException Object reference not set to an instance of an object.
          [Sdkd.ViewQuery|Warn] Unrecognized error System.Net.WebException The operation has timed out

        (b) No Errors occur during REBOUND phase, which is a good sign. This is the time when Rebalance operation is complete and no more topology changes occur.

        Show
        saakshi.manocha Saakshi Manocha added a comment - - edited Reproduced the brun test lists again to include the newly added reAdd test. Ran the command: python .\brun -C Sdkd.args -S dotnet-1.2-release -V 2.0.0-1976 -i cluster_config.ini -T HYBRID_readd-2 (This command will fail two nodes, add them back and then rebalance) Cluster_config.ini comprise of 4 nodes: 10.3.121.134 10.3.121.135 10.3.121.136 10.3.3.206 Output is here: http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_CB_BASIC.txt http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_MC_BASIC.txt http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_readd-2-Sdotnet-1.2-release-T2013-02-14-03.49.12-LV_HTTP_BASIC.txt Observations: (a) Following errors occur continuously during CHANGE phase while the rebalance operation is undergoing: [Enyim.Caching.Memcached.MemcachedNode|Error] System.IO.IOException: Failed to read from the socket '10.3.121.136:11210'. Error: SocketError value was Success, but 0 bytes were received [Enyim.Caching.Memcached.MemcachedNode.InternalPoolImpl|Error] Could not init pool. System.NullReferenceException Object reference not set to an instance of an object. [Sdkd.ViewQuery|Warn] Unrecognized error System.Net.WebException The operation has timed out (b) No Errors occur during REBOUND phase, which is a good sign. This is the time when Rebalance operation is complete and no more topology changes occur.

          People

          • Assignee:
            saakshi.manocha Saakshi Manocha
            Reporter:
            ingenthr Matt Ingenthron
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes