Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40370

[Doc_Isolation]: xattr::utils::check_len(2634346613) exceeds 287

    XMLWordPrintable

    Details

    • Triage:
      Untriaged
    • Operating System:
      Centos 64-bit
    • Story Points:
      1
    • Is this a Regression?:
      No

      Description

       

      Build: 6.6.0-7861

      Scenario:

      1. Single node cluster, Couchbase bucket (replica=0)
      2. Start two parallel transactions (one will succeed and other will rollback)
      3. Rebalance in 3 nodes into the cluster in parallel to the transactions

      Observation:

      Seeing rebalance failure with reason "mover crashed - bulk_set_vbucket_state_failed".

      cbcollect logs:
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance_failure/collectinfo-2020-07-09T180702-ns_1%40172.23.105.155.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance_failure/collectinfo-2020-07-09T180702-ns_1%40172.23.105.159.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance_failure/collectinfo-2020-07-09T180702-ns_1%40172.23.105.205.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance_failure/collectinfo-2020-07-09T180702-ns_1%40172.23.105.206.zip
      Rebalance failure prints:

      Rebalance exited with reason {mover_crashed,
      {unexpected_exit,
      {'EXIT',<0.23194.0>,
      {{bulk_set_vbucket_state_failed,
      [{'ns_1@172.23.105.206',
      {'EXIT',
      {{{{{badmatch, [{<17504.5562.0>,
      {done,exit, {socket_closed,
      {gen_server,call,
      [<17504.5215.0>,
      {maybe_close_stream,905}, infinity]}},
      [{gen_server,call,3,
      [{file,"gen_server.erl"}, {line,214}]},
      {dcp_replicator, '-handle_call/3-fun-1-',2, 
      [{file, "src/dcp_replicator.erl"}, {line,128}]},
      {dcp_replicator, '-spawn_and_wait/1-fun-0-',1,
      [{file, "src/dcp_replicator.erl"}, {line,243}]}]}}]},
      [{misc, sync_shutdown_many_i_am_trapping_exits, 1,
      [{file,"src/misc.erl"}, {line,1374}]},
      {dcp_replicator,spawn_and_wait,1,
      [{file,"src/dcp_replicator.erl"}, {line,265}]},
      {dcp_replicator,handle_call,3,
      [{file,"src/dcp_replicator.erl"}, {line,127}]},
      {gen_server,try_handle_call,4,
      [{file,"gen_server.erl"}, {line,636}]},
      {gen_server,handle_msg,6,
      [{file,"gen_server.erl"}, {line,665}]},
      {proc_lib,init_p_do_apply,3,
      [{file,"proc_lib.erl"}, {line,247}]}]},
      {gen_server,call,
      [<17504.5214.0>,get_partitions, infinity]}},
      {gen_server,call,
      ['dcp_replication_manager-default',
      {get_replicator_pid,903},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default',
      'ns_1@172.23.105.206'},
      {if_rebalance,<0.3370.0>,
      {update_vbucket_state,900,pending,
      passive,'ns_1@172.23.105.155'}},
      infinity]}}}}]},
      [{janitor_agent,bulk_set_vbucket_state,4,
      [{file,"src/janitor_agent.erl"}, {line,403}]},
      {proc_lib,init_p,3,
      [{file,"proc_lib.erl"},{line,232}]}]}}}}.
      Rebalance Operation Id = 7a1aaa2b7b2c47644b91cb745f74d3be
       
      Worker <0.16520.0> (for action 
      {move,{433,
      ['ns_1@172.23.105.155'],
      ['ns_1@172.23.105.159'],
      []}}) 
      exited with reason {unexpected_exit,
      {'EXIT', <0.17484.0>,
      {{{{{badmatch, [{<25299.4924.0>,
      {done, exit,
      {socket_closed,
      {gen_server, call,
      [<25299.4462.0>, {maybe_close_stream, 433}, infinity]}},
      [{gen_server, call, 3,
      [{file, "gen_server.erl"}, {line, 214}]},
      {dcp_replicator, '-handle_call/3-fun-1-', 2,
      [{file, "src/dcp_replicator.erl"}, {line, 128}]},
      {dcp_replicator, '-spawn_and_wait/1-fun-0-', 1,
      [{file, "src/dcp_replicator.erl"}, {line, 243}]}]}}]},
      [{misc, sync_shutdown_many_i_am_trapping_exits, 1,
      [{file, "src/misc.erl"}, {line, 1374}]},
      {dcp_replicator, spawn_and_wait, 1,
      [{file, "src/dcp_replicator.erl"}, {line, 265}]},
      {dcp_replicator, handle_call, 3,
      [{file, "src/dcp_replicator.erl"}, {line, 127}]},
      {gen_server, try_handle_call, 4,
      [{file, "gen_server.erl"}, {line, 636}]},
      {gen_server, handle_msg, 6,
      [{file, "gen_server.erl"}, {line, 665}]},
      {proc_lib, init_p_do_apply, 3,
      [{file, "proc_lib.erl"}, {line, 247}]}]},
      {gen_server, call,
      [<25299.4461.0>,
      get_partitions, infinity]}},
      {gen_server, call,
      ['dcp_replication_manager-default',
      {get_replicator_pid, 431}, infinity]}},
      {gen_server, call,
      [{'janitor_agent-default', 'ns_1@172.23.105.159'},
      {if_rebalance, <0.4111.0>,
      {dcp_takeover, 'ns_1@172.23.105.155', 433}}, infinity]}}}} 

      Testcase:

      ./testrunner -i /tmp/5-centos-nodes-jython.ini rerun=False,get-cbcollect-info=False -t Atomicity.doc_isolation.IsolationDocTest.test_transaction_with_rebalance,nodes_init=1,replicas=0,rebalance_type=in,nodes_in=3,doc_op=create,GROUP=P1
      

       

       

        Attachments

        1. __dcp_prepare.png
          __dcp_prepare.png
          355 kB
        2. __subdoc.png
          __subdoc.png
          298 kB
        3. pcaps+test_log.zip
          8.85 MB

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            ashwin.govindarajulu Ashwin Govindarajulu created issue -
            Hide
            owend Daniel Owen added a comment -

            On 172.23.105.159:

            2020-07-09T11:06:30.597186-07:00 ERROR 66: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:40961 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f8c90437410","ewouldblock":false,"packet":{"bodylen":336,"cas":1594317990558040064,"datatype":["Snappy","Xattr"],"extlen":31,"key":"<ud>test_docs-00166742</ud>","keylen":18,"magic":"ClientRequest","opaque":34,"opcode":"DCP_PREPARE","vbucket":395},"refcount":1}] - closing connection ([ 127.0.0.1:40961 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): xattr::utils::check_len(2634346613) exceeds 287
            

            Show
            owend Daniel Owen added a comment - On 172.23.105.159: 2020-07-09T11:06:30.597186-07:00 ERROR 66: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:40961 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f8c90437410","ewouldblock":false,"packet":{"bodylen":336,"cas":1594317990558040064,"datatype":["Snappy","Xattr"],"extlen":31,"key":"<ud>test_docs-00166742</ud>","keylen":18,"magic":"ClientRequest","opaque":34,"opcode":"DCP_PREPARE","vbucket":395},"refcount":1}] - closing connection ([ 127.0.0.1:40961 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): xattr::utils::check_len(2634346613) exceeds 287
            Hide
            owend Daniel Owen added a comment -

            Also on 172.23.105.206:

            2020-07-09T11:06:30.603926-07:00 ERROR 66: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:41454 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f560c243010","ewouldblock":false,"packet":{"bodylen":336,"cas":1594317990559481856,"datatype":["Snappy","Xattr"],"extlen":31,"key":"<ud>test_docs-00166740</ud>","keylen":18,"magic":"ClientRequest","opaque":38,"opcode":"DCP_PREPARE","vbucket":901},"refcount":1}] - closing connection ([ 127.0.0.1:41454 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): xattr::utils::check_len(2634346613) exceeds 287
            

            Show
            owend Daniel Owen added a comment - Also on 172.23.105.206: 2020-07-09T11:06:30.603926-07:00 ERROR 66: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:41454 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f560c243010","ewouldblock":false,"packet":{"bodylen":336,"cas":1594317990559481856,"datatype":["Snappy","Xattr"],"extlen":31,"key":"<ud>test_docs-00166740</ud>","keylen":18,"magic":"ClientRequest","opaque":38,"opcode":"DCP_PREPARE","vbucket":901},"refcount":1}] - closing connection ([ 127.0.0.1:41454 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): xattr::utils::check_len(2634346613) exceeds 287
            Hide
            owend Daniel Owen added a comment -

            Following the code in kv_engine/xattr/utils.cc

            static void check_len(uint32_t len, size_t size) {
                if (len > size) {
                    throw std::out_of_range("xattr::utils::check_len(" +
                                            std::to_string(len) + ") exceeds " +
                                            std::to_string(size));
                }
            }
            

            Only called from

            uint32_t get_body_offset(const cb::const_char_buffer& payload) {
            Expects(payload.size() > 0);
            const uint32_t* lenptr = reinterpret_cast<const uint32_t*>(payload.buf);
            auto len = ntohl(*lenptr);
            check_len(len, payload.size());
             return len + sizeof(uint32_t);
            }
             
            uint32_t get_body_offset(const cb::char_buffer& payload) {
            Expects(payload.size() > 0);
            const uint32_t* lenptr = reinterpret_cast<const uint32_t*>(payload.buf);
            auto len = ntohl(*lenptr);
            check_len(len, payload.size());
             return len + sizeof(uint32_t);
            }
            

            So it looks like len is negative (so appears as a very large unsigned value)

            Show
            owend Daniel Owen added a comment - Following the code in kv_engine/xattr/utils.cc static void check_len(uint32_t len, size_t size) { if (len > size) { throw std::out_of_range("xattr::utils::check_len(" + std::to_string(len) + ") exceeds " + std::to_string(size)); } } Only called from uint32_t get_body_offset(const cb::const_char_buffer& payload) { Expects(payload.size() > 0); const uint32_t* lenptr = reinterpret_cast<const uint32_t*>(payload.buf); auto len = ntohl(*lenptr); check_len(len, payload.size()); return len + sizeof(uint32_t); }   uint32_t get_body_offset(const cb::char_buffer& payload) { Expects(payload.size() > 0); const uint32_t* lenptr = reinterpret_cast<const uint32_t*>(payload.buf); auto len = ntohl(*lenptr); check_len(len, payload.size()); return len + sizeof(uint32_t); } So it looks like len is negative (so appears as a very large unsigned value)
            owend Daniel Owen made changes -
            Field Original Value New Value
            Summary [Doc_Isolation]: Rebalance failed with reason "mover crashed - bulk_set_vbucket_state_failed" [Doc_Isolation]: xattr::utils::check_len(2634346613) exceeds 287
            Hide
            owend Daniel Owen added a comment -

            Noticed that all the documents have the datatype

             ep_active_datatype_snappy,json
            

            Show
            owend Daniel Owen added a comment - Noticed that all the documents have the datatype ep_active_datatype_snappy,json
            Hide
            owend Daniel Owen added a comment -

            Looking at the 2 cases:
            bodylen = 336
            extlen = 31
            keylen = 18

            exceeds payload.size = 287
            (287 + 31 + 18) = 336

            Show
            owend Daniel Owen added a comment - Looking at the 2 cases: bodylen = 336 extlen = 31 keylen = 18 exceeds payload.size = 287 (287 + 31 + 18) = 336
            Hide
            owend Daniel Owen added a comment -

            Looking at

            void Item::removeBodyAndOrXattrs(
            500         IncludeValue includeVal,
            501         IncludeXattrs includeXattrs,
            502         IncludeDeletedUserXattrs includeDeletedUserXattrs) {
            503     if (!value) {
            504         // If no value (ie, no body and/or xattrs) then nothing to do
            505         return;
            506     }
            507 
            508     // Take a copy of the original datatype before proceeding, any modification
            509     // to the value may change the datatype.
            510     const auto originalDatatype = getDataType();
            511 
            512     // Note: IncludeValue acts like "include body"
            513     if (includeVal != IncludeValue::Yes) {
            514         removeBody();
            515     }
            516 
            517     if (includeXattrs == IncludeXattrs::No) {
            518         removeXattrs();
            519     }
            520 
            521     if (isDeleted() &&
            522         includeDeletedUserXattrs == IncludeDeletedUserXattrs::No) {
            523         removeUserXattrs();
            524     }
            525 
            526     // Datatype for no-value must be RAW
            527     if (getNBytes() == 0) {
            528         Expects(datatype == PROTOCOL_BINARY_RAW_BYTES);
            529     }
            530 
            531     // MB-31967: Restore the complete datatype if requested
            532     if (includeVal == IncludeValue::NoWithUnderlyingDatatype) {
            533         setDataType(originalDatatype);
            534     }
            535 }
            

            Is line 533 safe? For example if we removed the Xattrs could line 533 result in us setting the datatype to having XATTRS when they have been removed?

            Show
            owend Daniel Owen added a comment - Looking at void Item::removeBodyAndOrXattrs( 500 IncludeValue includeVal, 501 IncludeXattrs includeXattrs, 502 IncludeDeletedUserXattrs includeDeletedUserXattrs) { 503 if (!value) { 504 // If no value (ie, no body and/or xattrs) then nothing to do 505 return; 506 } 507 508 // Take a copy of the original datatype before proceeding, any modification 509 // to the value may change the datatype. 510 const auto originalDatatype = getDataType(); 511 512 // Note: IncludeValue acts like "include body" 513 if (includeVal != IncludeValue::Yes) { 514 removeBody(); 515 } 516 517 if (includeXattrs == IncludeXattrs::No) { 518 removeXattrs(); 519 } 520 521 if (isDeleted() && 522 includeDeletedUserXattrs == IncludeDeletedUserXattrs::No) { 523 removeUserXattrs(); 524 } 525 526 // Datatype for no-value must be RAW 527 if (getNBytes() == 0) { 528 Expects(datatype == PROTOCOL_BINARY_RAW_BYTES); 529 } 530 531 // MB-31967: Restore the complete datatype if requested 532 if (includeVal == IncludeValue::NoWithUnderlyingDatatype) { 533 setDataType(originalDatatype); 534 } 535 } Is line 533 safe? For example if we removed the Xattrs could line 533 result in us setting the datatype to having XATTRS when they have been removed?
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Paolo Cocchi [ paolo.cocchi ]
            owend Daniel Owen made changes -
            Link This issue is duplicated by MB-40371 [ MB-40371 ]
            Hide
            paolo.cocchi Paolo Cocchi added a comment -

            Daniel Owen

            Is line 533 safe? For example if we removed the Xattrs could line 533 result in us setting the datatype to having XATTRS when they have been removed?

            By initial code inspection, what you describe seems possible. I'll spin a test to check how we behave at Consumer in that case.

            But, that doesn't seem the case here as the Consumer is receiving some kind of value in the DcpPrepare (the 287 bytes). That would not be possible if at Producer we do IncludeValue::NoWithUnderlyingDatatype + IncludeDeletedUserXattrs::No, as that would remove everything (value-size = 0).

            Ashwin Govindarajulu I'm not able to run the test as I miss the /tmp/5-centos-nodes-jython.ini file. Could you provide what is necessary for running the test locally in a cluster_run please?
            Also, could you provide the full pcap file for the run above so I can inspect the payload of the failing DcpPrepare?
            Thanks

            Show
            paolo.cocchi Paolo Cocchi added a comment - Daniel Owen Is line 533 safe? For example if we removed the Xattrs could line 533 result in us setting the datatype to having XATTRS when they have been removed? By initial code inspection, what you describe seems possible. I'll spin a test to check how we behave at Consumer in that case. But, that doesn't seem the case here as the Consumer is receiving some kind of value in the DcpPrepare (the 287 bytes). That would not be possible if at Producer we do IncludeValue::NoWithUnderlyingDatatype + IncludeDeletedUserXattrs::No, as that would remove everything (value-size = 0). Ashwin Govindarajulu I'm not able to run the test as I miss the /tmp/5-centos-nodes-jython.ini file. Could you provide what is necessary for running the test locally in a cluster_run please? Also, could you provide the full pcap file for the run above so I can inspect the payload of the failing DcpPrepare? Thanks
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment pcaps+test_log.zip [ 100504 ]
            Show
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Paolo Cocchi - Please find the pcaps+test_log.zip file. *Cbcollect file:* https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance_failure/collectinfo-2020-07-10T113341-ns_1%40172.23.105.155.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance_failure/collectinfo-2020-07-10T113341-ns_1%40172.23.105.159.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance_failure/collectinfo-2020-07-10T113341-ns_1%40172.23.105.205.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance_failure/collectinfo-2020-07-10T113341-ns_1%40172.23.105.206.zip
            paolo.cocchi Paolo Cocchi made changes -
            Attachment __dcp_prepare.png [ 100522 ]
            Attachment __subdoc.png [ 100521 ]
            owend Daniel Owen made changes -
            Due Date 13/Jul/20
            till Till Westmann made changes -
            Link This issue blocks MB-38724 [ MB-38724 ]
            till Till Westmann made changes -
            Labels 6.6.0 Transactions functional-test 6.6.0 Transactions approved-for-6.6.0 functional-test
            owend Daniel Owen made changes -
            Assignee Paolo Cocchi [ paolo.cocchi ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.6.0-7879 contains kv_engine commit 5c64d40 with commit message:
            MB-40370: Remove unused code in xattr/utils.cc

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.0-7879 contains kv_engine commit 5c64d40 with commit message: MB-40370 : Remove unused code in xattr/utils.cc
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.6.0-7879 contains kv_engine commit e31f273 with commit message:
            MB-40370: Make cb::xattr::get_body_size resilient to compressed payloads

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.0-7879 contains kv_engine commit e31f273 with commit message: MB-40370 : Make cb::xattr::get_body_size resilient to compressed payloads
            owend Daniel Owen made changes -
            Link This issue is duplicated by MB-40440 [ MB-40440 ]
            Hide
            ashwin.govindarajulu Ashwin Govindarajulu added a comment -

            Not seeing this issue on latest MH build.

            Verified using 6.6.0-7880. Closing this ticket.

            Show
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Not seeing this issue on latest MH build. Verified using 6.6.0-7880. Closing this ticket.
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            drigby Dave Rigby made changes -
            Link This issue is duplicated by MB-40467 [ MB-40467 ]
            drigby Dave Rigby made changes -
            Link This issue is duplicated by MB-40467 [ MB-40467 ]
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-7.0.0-2647 contains kv_engine commit 5c64d40 with commit message:
            MB-40370: Remove unused code in xattr/utils.cc

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2647 contains kv_engine commit 5c64d40 with commit message: MB-40370 : Remove unused code in xattr/utils.cc
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-7.0.0-2647 contains kv_engine commit e31f273 with commit message:
            MB-40370: Make cb::xattr::get_body_size resilient to compressed payloads

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2647 contains kv_engine commit e31f273 with commit message: MB-40370 : Make cb::xattr::get_body_size resilient to compressed payloads
            drigby Dave Rigby made changes -
            Link This issue is duplicated by MB-40579 [ MB-40579 ]
            Hide
            mihir.kamdar Mihir Kamdar added a comment - - edited

            Reopening and resolving the issue again to update the Fix Version (to include 7.0 as well).

            Ashwin Govindarajulu pls verify the fix on 7.0 as well.

            Show
            mihir.kamdar Mihir Kamdar added a comment - - edited Reopening and resolving the issue again to update the Fix Version (to include 7.0 as well). Ashwin Govindarajulu pls verify the fix on 7.0 as well.
            mihir.kamdar Mihir Kamdar made changes -
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]
            mihir.kamdar Mihir Kamdar made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]
            mihir.kamdar Mihir Kamdar made changes -
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]
            Hide
            ashwin.govindarajulu Ashwin Govindarajulu added a comment -

            Works fine on latest builds. Verified using 6.6.0-7883.

            Closing this ticket.

            Show
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Works fine on latest builds. Verified using 6.6.0-7883. Closing this ticket.
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]

              People

              Assignee:
              ashwin.govindarajulu Ashwin Govindarajulu
              Reporter:
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Due:
                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty