Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42555

gomemcached does not mark as fatal OS and timeout errors

    XMLWordPrintable

    Details

    • Triage:
      Untriaged
    • Story Points:
      1
    • Is this a Regression?:
      Unknown

      Description

      Marking it as query, but it affects any component that uses go-couchbase or gomemcached.
      When a gomemcached Get() ends prematurely (the most likely cause being a golang timeout),
      client/transport.go:getResponse() leaves the response structure uninitialized, which means that it has a 0 status.
      This is not picked as fatal, and the connection gets eventually reused, reading incorrectly, data left behind on the wire.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          build-team Couchbase Build Team added a comment -

          Build sync_gateway-3.0.0-223 contains gomemcached commit 8e4fcec with commit message:
          MB-42555 mark connections as unusable on golang / os socket errors

          Show
          build-team Couchbase Build Team added a comment - Build sync_gateway-3.0.0-223 contains gomemcached commit 8e4fcec with commit message: MB-42555 mark connections as unusable on golang / os socket errors
          Hide
          ajay.bhullar Ajay Bhullar added a comment -

          verified in 6.5.1-6312 using the repro steps above and based on marco's comments, we do not see bad magic error messages

          Show
          ajay.bhullar Ajay Bhullar added a comment - verified in 6.5.1-6312 using the repro steps above and based on marco's comments, we do not see bad magic error messages
          Hide
          ajay.bhullar Ajay Bhullar added a comment -

          based on marco's comments this is verified in 6.6.1-9177, not seeing any bad magic error messages, the bulk get operation error is always the one above, and happens infrequently. Ran script ~30 times

          Show
          ajay.bhullar Ajay Bhullar added a comment - based on marco's comments this is verified in 6.6.1-9177, not seeing any bad magic error messages, the bulk get operation error is always the one above, and happens infrequently. Ran script ~30 times
          Hide
          marco.greco Marco Greco added a comment -

          That is wholly correct. The testcase is designed to generate the timeouts.
          What you mustn't see is the BAD MAGIC error messages following the timeouts, which in my testing you don't.

          Show
          marco.greco Marco Greco added a comment - That is wholly correct. The testcase is designed to generate the timeouts. What you mustn't see is the BAD MAGIC error messages following the timeouts, which in my testing you don't.
          Hide
          ajay.bhullar Ajay Bhullar added a comment -

          we still see the error in 6.6.1-9177

          tid: 0 loop: 22

          {"elapsedTime": "1.489946ms", "executionTime": "1.300617ms", "resultSize": 0, "resultCount": 0, "errorCount": 2}

          [

          {"msg": "Error performing bulk get operation - cause: read tcp 127.0.0.1:47996->127.0.0.1:11210: i/o timeout", "code": 12008, "retry": true}

          ,

          {"msg": "Timeout 1ms exceeded", "code": 1080, "retry": true}

          ]

          Using sitarams script in the cbse and his repro steps:

          VM 8 cpu, 8gb machine

          single node cluster 2GB data node
          Bucket onebigjson 2GB
          Create primary index on onebigjson;
          SELECT META(d).id, d.* FROM onebigjson AS d LIMIT 1;
          timeout is random as Marco Greco suggested

          /opt/couchbase/bin/cbworkloadgen -n 127.0.0.1:8091 -u Administrator -p password -j -b onebigjson -i 1000 -s 2000000

          Show
          ajay.bhullar Ajay Bhullar added a comment - we still see the error in 6.6.1-9177 tid: 0 loop: 22 {"elapsedTime": "1.489946ms", "executionTime": "1.300617ms", "resultSize": 0, "resultCount": 0, "errorCount": 2} [ {"msg": "Error performing bulk get operation - cause: read tcp 127.0.0.1:47996->127.0.0.1:11210: i/o timeout", "code": 12008, "retry": true} , {"msg": "Timeout 1ms exceeded", "code": 1080, "retry": true} ] Using sitarams script in the cbse and his repro steps: VM 8 cpu, 8gb machine single node cluster 2GB data node Bucket onebigjson 2GB Create primary index on onebigjson; SELECT META(d).id, d.* FROM onebigjson AS d LIMIT 1; timeout is random as Marco Greco suggested /opt/couchbase/bin/cbworkloadgen -n 127.0.0.1:8091 -u Administrator -p password -j -b onebigjson -i 1000 -s 2000000

            People

            Assignee:
            mihir.kamdar Mihir Kamdar
            Reporter:
            marco.greco Marco Greco
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty