Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-1297

.NETCore SDK causes ClientFailure for SubDoc operation on Windows with muxio

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0-dp3
    • Fix Version/s: 2.4.0
    • Component/s: library
    • Labels:
      None

      Description

      SDK: .NETCore

      Environment:

      • OS: Windows Server 2012 R2
      • Mode : sync , muxio

      Error: NodeUnavailable

      Scenario: Restart all nodes

       

      Issue : After restarting phase is done, all subdoc request failed with error 'NodeUnavailable'. This was due to https://github.com/couchbase/couchbase-net-client/blob/master/Src/Couchbase/IO/Services/MultiplexingIOService.cs#L334 throws following exceptions

       

      System.Security.Authentication.AuthenticationException: Authentication failed for bucket 'default'
      System.TimeoutException: The operation has timed out.

      I'm not sure why those two exceptions but, later when the rebound phase is almost done, I see the log that is saying

      Successfully connected and marking data node xxx.xxx.xxx.xxx:11210 as up.

      Which seemed printed after collecting all test data.
       

       

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          jmorris Jeff Morris added a comment -

          Can you attach sdk logs and sdkd graph?

          Show
          jmorris Jeff Morris added a comment - Can you attach sdk logs and sdkd graph?
          Hide
          jmorris Jeff Morris added a comment -

          >>I'm not sure why those two exceptions
          Its expected that if the client tries to connect or reestablish a connection to a down node that the request might timeout. I suspect if you ran the test again, the failure would occur at some other point in the code.

          >>Successfully connected and marking data node xxx.xxx.xxx.xxx:11210 as up.

          When the client gets x number of IO errors in y time, it will put itself into a node unavailable state. Once this happens, the client will sleep for 1 second and then try to reconnect. Until it reconnects every request with a key mapped to it will return a NodeUnavailableException. Once the client connects, it will start processing operations again.

          That being said, after all nodes have gone back online (and assuming that they are functioning correctly), the client should recover. It may take some time, because the Memcached process has start and then stop returning temp_failures.

          Q: how many times did we run the test?
          Q: did we give the client enough time to recover?

          Show
          jmorris Jeff Morris added a comment - >>I'm not sure why those two exceptions Its expected that if the client tries to connect or reestablish a connection to a down node that the request might timeout. I suspect if you ran the test again, the failure would occur at some other point in the code. >>Successfully connected and marking data node xxx.xxx.xxx.xxx:11210 as up. When the client gets x number of IO errors in y time, it will put itself into a node unavailable state. Once this happens, the client will sleep for 1 second and then try to reconnect. Until it reconnects every request with a key mapped to it will return a NodeUnavailableException. Once the client connects, it will start processing operations again. That being said, after all nodes have gone back online (and assuming that they are functioning correctly), the client should recover. It may take some time, because the Memcached process has start and then stop returning temp_failures. Q: how many times did we run the test? Q: did we give the client enough time to recover?
          Hide
          jaekwon.park Jae Park [X] (Inactive) added a comment -

          A: I ran about 10 times for this test and result was same.

          A: rebound time was 4 minutes so I guess it should be enough time. 'Successfully connected and making data node xxxx:11210 as up' comes after test was done and collecting result.

          Here is one of the jenkins job. To see the debug log on the jenkins log, I replaced Logger to  Console.WriteLine.

          I will have to rerun to get clear logs that focusing on this but you can refer to this http://sdkbuilds.sc.couchbase.com/view/.NET/job/sdk-net-situational-release/job/netcore-windows-watson/54/console

           

          Show
          jaekwon.park Jae Park [X] (Inactive) added a comment - A: I ran about 10 times for this test and result was same. A: rebound time was 4 minutes so I guess it should be enough time. 'Successfully connected and making data node xxxx:11210 as up' comes after test was done and collecting result. Here is one of the jenkins job. To see the debug log on the jenkins log, I replaced Logger to  Console.WriteLine. I will have to rerun to get clear logs that focusing on this but you can refer to this http://sdkbuilds.sc.couchbase.com/view/.NET/job/sdk-net-situational-release/job/netcore-windows-watson/54/console  
          Show
          jmorris Jeff Morris added a comment - http://review.couchbase.org/#/c/73014/

            People

            • Assignee:
              jmorris Jeff Morris
              Reporter:
              jaekwon.park Jae Park [X] (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.