Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-1545

Adding nodes to a 5.0.0 cluster can return "None" as error, never recovering

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.5.1
    • Fix Version/s: 2.5.1
    • Component/s: None
    • Labels:
      None

      Description

      Without SSL, Adding 2 nodes in Spock GA then KV Async operation spews "None" as error and throughput drops slightly as apposed to expected high throughput after adding 2 nodes.

      Steps to reproduce:

      1. Spock cluster with 2 nodes (1 kv, 1 kv/index/query/fts), with 2 buckets. (test was against the first bucket)
      2. keep KV operation against the Spock cluster and add 2 nodes and rebalance.

      Expected : No error and throughput increase.

      Actual: "None" error message with throughput slightly decrease.

       

      http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-5.0.0-xxxx/Rb2In-HYBRID/10-03-17/071054/99763117665d02947674a88d7c6ac0f3-MC.html

        Attachments

        1. couchbase_packet_while_None.pcapng.zip
          531 kB
        2. log_fromBeginningToHalfwayRebalance.zip
          583 kB
        3. log.zip
          200 kB
        4. log2.zip
          3.49 MB

          Issue Links

          For Gerrit Dashboard: NCBC-1545
          # Subject Branch Project Status CR V

            Activity

            Hide
            jmorris Jeff Morris added a comment -

            I see no errors or indication of failure in the logs provided. ResponseStatus.None generally means that the Status property has not been set, so perhaps its failing but the IResult.Status field is never getting set?

            Show
            jmorris Jeff Morris added a comment - I see no errors or indication of failure in the logs provided. ResponseStatus.None generally means that the Status property has not been set, so perhaps its failing but the IResult.Status field is never getting set?
            Hide
            jmorris Jeff Morris added a comment -

            There is one view failure in log2 and almost 140K success - "Persisted and replicated on first" (137916 hits in 1 file)

            Jae Park [X] - can you add code in SDKD to log the contents of the Exception and Message fields when ResponseStatus.None is encountered?

            Show
            jmorris Jeff Morris added a comment - There is one view failure in log2 and almost 140K success - "Persisted and replicated on first" (137916 hits in 1 file) Jae Park [X] - can you add code in SDKD to log the contents of the Exception and Message fields when ResponseStatus.None is encountered?
            Hide
            jaekwon.park Jae Park [X] (Inactive) added a comment -

            As we talked in offline by sharing my screen through GoToMeeting, I think that was better view than copy & paste here.

            Also, as you requested, I've attached the log_fromBeginningToHalfwayRebalance.zip which as the file name presents, it is logged from the beginning of the test till the 'None' started and after couple of seconds (till around 80% of rebalance) so you can avoid noises.

            And here is the result that maps to log_fromBeginningToHalfwayRebalance.zip 

            http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-5.0.0-3519/Rb2In-HYBRID/10-04-17/081349/0628e2aa2d7b62384596de1c6b5f870c-MC.html

            Let me know if you need further info.

            Show
            jaekwon.park Jae Park [X] (Inactive) added a comment - As we talked in offline by sharing my screen through GoToMeeting, I think that was better view than copy & paste here. Also, as you requested, I've attached the log_fromBeginningToHalfwayRebalance.zip which as the file name presents, it is logged from the beginning of the test till the 'None' started and after couple of seconds (till around 80% of rebalance) so you can avoid noises. And here is the result that maps to log_fromBeginningToHalfwayRebalance.zip  http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-5.0.0-3519/Rb2In-HYBRID/10-04-17/081349/0628e2aa2d7b62384596de1c6b5f870c-MC.html Let me know if you need further info.
            Hide
            jaekwon.park Jae Park [X] (Inactive) added a comment -

            uploaded couchbase_packet_while_None.pcapng.zip.

            This is captured from around rebalance start and after couple of 'None' error happens

            Show
            jaekwon.park Jae Park [X] (Inactive) added a comment - uploaded couchbase_packet_while_None.pcapng.zip. This is captured from around rebalance start and after couple of 'None' error happens
            Hide
            jaekwon.park Jae Park [X] (Inactive) added a comment -

            As I traced the code, 

            it was 0x08 from server response that is translated as UnknownError int SDK and result.Success was not true , thus, within CompletedFuncWithRetryForCouchbase<T>(), it calls SetException and the Status is not any of known ResponseStatus, it throws ArgumentOutOfRangeException

            Show
            jaekwon.park Jae Park [X] (Inactive) added a comment - As I traced the code,  it was 0x08 from server response that is translated as UnknownError int SDK and result.Success was not true , thus, within CompletedFuncWithRetryForCouchbase<T>(), it calls SetException and the Status is not any of known ResponseStatus, it throws ArgumentOutOfRangeException
            Hide
            jmorris Jeff Morris added a comment -

            Michael Goldsmith -

            When you get online, we need to figure out why the server is returning "UnknownError". FWIW I couldn't replicate exactly, but did get into a state where I received timeouts even after rebalance completed. It could be a server bug, or a client but...perhaps related or similar to NCBC-1517.

            Note: we believe we have isolated it to the case where we have two or more buckets on Spock GA; start with a two cluster node and add two more nodes and rebalance with a consistent load.

            -Jeff

            Show
            jmorris Jeff Morris added a comment - Michael Goldsmith - When you get online, we need to figure out why the server is returning "UnknownError". FWIW I couldn't replicate exactly, but did get into a state where I received timeouts even after rebalance completed. It could be a server bug, or a client but...perhaps related or similar to NCBC-1517 . Note: we believe we have isolated it to the case where we have two or more buckets on Spock GA; start with a two cluster node and add two more nodes and rebalance with a consistent load. -Jeff
            Hide
            jaekwon.park Jae Park [X] (Inactive) added a comment -

            Hmm sync mode symptom was hidden behind SSL issue.

            I tested without SSL with sync mode, and it never recovers.

            http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-5.0.0-3519/Rb2In-HYBRID/10-05-17/060275/af80aa2b46f510f4dc28c6d6c5c663e5-MC.html

            I will have to run with no SSL to check this.

            Show
            jaekwon.park Jae Park [X] (Inactive) added a comment - Hmm sync mode symptom was hidden behind SSL issue. I tested without SSL with sync mode, and it never recovers. http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-5.0.0-3519/Rb2In-HYBRID/10-05-17/060275/af80aa2b46f510f4dc28c6d6c5c663e5-MC.html I will have to run with no SSL to check this.
            Hide
            jmorris Jeff Morris added a comment -

            I am quite sure the problem is either how the client is handling the creation of new connections and enabling features or the server itself. I am tending towards the former...

            Show
            jmorris Jeff Morris added a comment - I am quite sure the problem is either how the client is handling the creation of new connections and enabling features or the server itself. I am tending towards the former...

              People

              • Assignee:
                mike.goldsmith Michael Goldsmith
                Reporter:
                jaekwon.park Jae Park [X] (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty

                    Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.