Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5534

keys failing to persist to disk with state "ram_but_not_disk" during rebalance

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Security Level: Public
    • Labels:
      None
    • Environment:
      2.0.0-1314-rel

      Description

      in viewquerytests.ViewQueryTests.test_employee_dataset_startkey_endkey_queries_rebalance_in

      Load 200k docs
      Add+Rebalance 6 nodes to cluster while running queries

      Looks like what happens is the keys being moved to new servers are not always being persisted to disk.

      ...from test log's we report when key a did not persist and cannot be indexed by view-engine.

      ["query doc_id: admin0150-2008_12_27 doesn\'t exist in bucket: default", "Error expected in results for key with invalid state

      {\'key_vb_state\': \'active\', \'key_last_modification_time\': \'1339537159\', \'key_data_age\': \'0\', \'key_cas\': \'7680066560307527\', \'key_exptime\': \'0\', \'key_is_dirty\': \'0\', \'key_flags\': \'0\', \'key_valid\': \'ram_but_not_disk\'}

      ",

      There should be plenty of time to allow this as we retry the query several times and no new docs are being loaded/updated during the rebalance.

      diags attached.

      1. 10.2.2.108-8091-diag.txt.gz
        555 kB
        Tommie McAfee
      2. 10.2.2.109-8091-diag.txt.gz
        1.69 MB
        Tommie McAfee
      3. 10.2.2.60-8091-diag.txt.gz
        1.43 MB
        Tommie McAfee
      4. 10.2.2.63-8091-diag.txt.gz
        1.68 MB
        Tommie McAfee
      5. 10.2.2.64-8091-diag.txt.gz
        562 kB
        Tommie McAfee
      6. 10.2.2.65-8091-diag.txt.gz
        559 kB
        Tommie McAfee
      7. 10.2.2.67-8091-diag.txt.gz
        567 kB
        Tommie McAfee
      8. finddoc.pl
        0.9 kB
        Tommie McAfee
      9. get_key_meta.py
        0.8 kB
        Tommie McAfee
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Super, thanks Jin!

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Super, thanks Jin!
        Show
        jin Jin Lim (Inactive) added a comment - http://review.couchbase.org/#change,17367
        Hide
        jin Jin Lim (Inactive) added a comment - - edited

        The test now passes without any failure after locally build ep engine with the fix, http://review.couchbase.org/#change,17367.

        Note: the number of items being persisted while rebalancing still fluctuates for vbuckets movements, which is perfectly normal under the current design. After the completion of rebalancing one should expect NO data loss at all.

        Show
        jin Jin Lim (Inactive) added a comment - - edited The test now passes without any failure after locally build ep engine with the fix, http://review.couchbase.org/#change,17367 . Note: the number of items being persisted while rebalancing still fluctuates for vbuckets movements, which is perfectly normal under the current design. After the completion of rebalancing one should expect NO data loss at all.
        Hide
        tommie Tommie McAfee added a comment -

        Here's the test runner command to reproduce:
        python testrunner -i resources/jenkins/centos-64-7node-viewquery.ini -t viewquerytests.ViewQueryTests.test_employee_dataset_startkey_endkey_queries_rebalance_in

        I've disabled the job in jenkins so you can run this exact command and it will run the test against our servers if you cannot reproduce in your own vm's.

        Some final observations in manually trying to reproduce this is that heavy queries causes disk write queue to drain slower, and when items are still in queue before rebalance(in) then we can get into this state.

        Let me know if anything else is needed.

        thanks,

        Show
        tommie Tommie McAfee added a comment - Here's the test runner command to reproduce: python testrunner -i resources/jenkins/centos-64-7node-viewquery.ini -t viewquerytests.ViewQueryTests.test_employee_dataset_startkey_endkey_queries_rebalance_in I've disabled the job in jenkins so you can run this exact command and it will run the test against our servers if you cannot reproduce in your own vm's. Some final observations in manually trying to reproduce this is that heavy queries causes disk write queue to drain slower, and when items are still in queue before rebalance(in) then we can get into this state. Let me know if anything else is needed. thanks,
        Hide
        tommie Tommie McAfee added a comment -

        More debugging tips from Filipe…. discovered update_seq = 0 for db(56) which corresponds to this key on active node. This db also reports "no documents" so no writes have happened here:

        /opt/couchbase/bin/couch_dbinfo 56.couch.1
        DB Info (56.couch.1)
        file format version: 10
        update_seq: 0
        no documents
        B-tree size: 0 bytes
        total disk size: 4.0 kB

        Information also reflected in index info:
        curl -s 'http://10.2.2.63:8092/_set_view/default/_design/dev_test_view-6ffa498/_info' | json_xs
        ….

        "active_partitions" : [
        56,
        57,
        58,
        59,
        60,
        61,
        62,
        63,
        64,
        65,
        66,
        67,
        68,
        69,
        70,
        71,
        72,
        73
        ],
        "pending_transition" : null,
        "update_seqs" :

        { "67" : 1560, "63" : 1571, "71" : 1585, "70" : 1581, "68" : 1558, "72" : 1585, "65" : 1580, "57" : 1573, "64" : 1581, "61" : 1584, "58" : 1585, "59" : 1590, "69" : 1563, "60" : 1125, "56" : 0, "73" : 1581, "66" : 1562, "62" : 1572 }

        ,

        Show
        tommie Tommie McAfee added a comment - More debugging tips from Filipe…. discovered update_seq = 0 for db(56) which corresponds to this key on active node. This db also reports "no documents" so no writes have happened here: /opt/couchbase/bin/couch_dbinfo 56.couch.1 DB Info (56.couch.1) file format version: 10 update_seq: 0 no documents B-tree size: 0 bytes total disk size: 4.0 kB Information also reflected in index info: curl -s 'http://10.2.2.63:8092/_set_view/default/_design/dev_test_view-6ffa498/_info' | json_xs …. "active_partitions" : [ 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73 ], "pending_transition" : null, "update_seqs" : { "67" : 1560, "63" : 1571, "71" : 1585, "70" : 1581, "68" : 1558, "72" : 1585, "65" : 1580, "57" : 1573, "64" : 1581, "61" : 1584, "58" : 1585, "59" : 1590, "69" : 1563, "60" : 1125, "56" : 0, "73" : 1581, "66" : 1562, "62" : 1572 } ,

          People

          • Assignee:
            jin Jin Lim (Inactive)
            Reporter:
            tommie Tommie McAfee
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes