Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7299

[system test] database compaction crashed during and after rebalance

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Duplicate
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
    • Environment:
      Windows 2008 R2 SP1 64bit in EC2 build 2.0.0-1967

      Description

      Environment:
      8 windows 2008 R2 SP1 64bit in ec2 (each server has 4 core cpu, 15GB RAM, 70GB for data and 170GB for view)

      1. 10.158.47.154
      2. 10.159.31.68
      3. 10.158.47.189
      4. 10.159.13.46
      5. 10.159.31.76
      6. 10.159.31.101
      1. 10.158.45.173
      2. 10.159.31.91

      Create a 6 nodes cluster.
      Create a default bucket and load 45+ million items with size from 128 to 512 bytes.
      Create one doc with 2 views. Let initial index complete.
      Mutate items with new size from 512 to 1024 bytes (creates/updates/gets/deletes/expirations = 10/60/20/5/5)
      Query view with rate 400 queries/second
      Add node 10.158.45.173 to cluster and rebalance. During rebalance, monitor data and view size.
      I saw data size in node 10.158.47.189 going up to big
      Change data compaction setting to 2% (force data compaction) but database size is still high.
      When rebalance done, database size of node 10.158.47.189 did not go down.
      I check diag log of node 10.158.47.189, I see a lot of crashed on database compaction process.

      =========================CRASH REPORT=========================
      crasher:
      initial call: compaction_daemon:spawn_vbucket_compactor/2-fun-0/0
      pid: <0.9691.327>
      registered_name: []
      exception error: no match of right hand side value none
      in function compaction_daemon:free_space/1
      in call from compaction_daemon:ensure_can_db_compact/1
      in call from compaction_daemon:'spawn_vbucket_compactor/2-fun-0'/4
      ancestors: [<0.9688.327>,<0.9685.327>,<0.9684.327>,compaction_daemon,
      <0.6769.208>,ns_server_sup,ns_server_cluster_sup,<0.67.0>]
      messages: []
      links: [<0.9688.327>]
      dictionary: []
      trap_exit: false
      status: running
      heap_size: 2584
      stack_size: 24
      reductions: 375
      neighbours:

      Link to mainifest file of build 2.0.0-1967 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1967-rel.setup.exe.manifest.xml
      I will upload collect info soon

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Show
        thuan Thuan Nguyen added a comment - Link to collect info of all nodes https://s3.amazonaws.com/bugdb/jira/MB-7299/8nodes-ci-1967-data-compaction-crashed-20121130-151132.tgz Link to diags of node 10.158.47.189 https://s3.amazonaws.com/bugdb/jira/MB-7299/ns-diag-node-189-20121130211824.txt.gz
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        needs to be traiged before moving to the next version

        Show
        farshid Farshid Ghods (Inactive) added a comment - needs to be traiged before moving to the next version
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Duplicate of timeout issues. I.e. disksup is dead and thus there's no information at all about filesystems and free space.

        [error_logger:error,2012-11-29T21:24:06.016,ns_1@10.158.47.189:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]OS_MON (disksup) called by <0.8606.205>, not started

        [ns_server:error,2012-11-29T21:24:06.094,ns_1@10.158.47.189:<0.1837.0>:ns_memcached:verify_report_long_call:297]call

        {stats,<<>>}

        took too long: 17534000 us
        [stats:warn,2012-11-29T21:24:06.094,ns_1@10.158.47.189:system_stats_collector<0.1290.0>:system_stats_collector:handle_info:133]lost 4 ticks
        [error_logger:error,2012-11-29T21:24:06.110,ns_1@10.158.47.189:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: compaction_daemon:spawn_view_index_compactor/6-fun-0/0
        pid: <0.8606.205>
        registered_name: []
        exception error: no match of right hand side value none
        in function compaction_daemon:free_space/1
        in call from compaction_daemon:ensure_can_view_compact/3

        We've seen that as part of heavy timeouts diskup and memsup die quite easily.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Duplicate of timeout issues. I.e. disksup is dead and thus there's no information at all about filesystems and free space. [error_logger:error,2012-11-29T21:24:06.016,ns_1@10.158.47.189:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76] OS_MON (disksup) called by <0.8606.205>, not started [ns_server:error,2012-11-29T21:24:06.094,ns_1@10.158.47.189:<0.1837.0>:ns_memcached:verify_report_long_call:297] call {stats,<<>>} took too long: 17534000 us [stats:warn,2012-11-29T21:24:06.094,ns_1@10.158.47.189:system_stats_collector<0.1290.0>:system_stats_collector:handle_info:133] lost 4 ticks [error_logger:error,2012-11-29T21:24:06.110,ns_1@10.158.47.189:error_logger<0.6.0>:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: compaction_daemon: spawn_view_index_compactor/6-fun-0 /0 pid: <0.8606.205> registered_name: [] exception error: no match of right hand side value none in function compaction_daemon:free_space/1 in call from compaction_daemon:ensure_can_view_compact/3 We've seen that as part of heavy timeouts diskup and memsup die quite easily.
        Hide
        thuan Thuan Nguyen added a comment -

        So if there is not information about file system and free space (as in bug MB-7239), the database file size in node with negative number will increase until it fills up disk space available. This may lead to this node crashed.

        Show
        thuan Thuan Nguyen added a comment - So if there is not information about file system and free space (as in bug MB-7239 ), the database file size in node with negative number will increase until it fills up disk space available. This may lead to this node crashed.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        Dupe.

        Show
        maria Maria McDuff (Inactive) added a comment - Dupe.

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes