Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7587

System Testing : Uneven compaction + Very high swap, over a cluster, some nodes have much higher (70 percent ) fragmentation than others.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: test-execution
    • Security Level: Public
    • Labels:
      None
    • Environment:

      Description

      Setup a cluster as above.
      Create 2 views per bucket.
      Run time : 10 hour+

      Seeing uneven compaction/ high doc fragmentation across the nodes. ( Screenshot attached below)

      Seeing very high swap on these nodes as well ( Screenshot below)

      • Indexing / Compaction is running continuosly, but never completes.

      Adding logs.

      • Does this high doc fragmentation(70-80 percent) indicate that compaction did not work as expected?
        Is this something we ve seen before/ know of?

      Please let me know if you need any other system information on this.

      1. Screen Shot 2013-01-23 at 8.51.32 AM.png
        87 kB
        Ketaki Gangal
      2. Screen Shot 2013-01-23 at 8.53.35 AM.png
        124 kB
        Ketaki Gangal
      3. Screen Shot 2013-01-23 at 9.07.08 AM.png
        34 kB
        Ketaki Gangal
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        ketaki Ketaki Gangal created issue -
        Hide
        ketaki Ketaki Gangal added a comment -

        Atop from node : 10.6.2.37 shows memcached is in 2G swap.

        Show
        ketaki Ketaki Gangal added a comment - Atop from node : 10.6.2.37 shows memcached is in 2G swap.
        ketaki Ketaki Gangal made changes -
        Field Original Value New Value
        Attachment Screen Shot 2013-01-23 at 9.07.08 AM.png [ 16272 ]
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        I can look at the logs once they're available but keep in mind:

        1) memcached is totally unrelated to view engine;

        2) doc fragmentation (vbuckets), is unrelated to view engine as well.

        So I think it's premature to classify the view engine as the source of the problem.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - I can look at the logs once they're available but keep in mind: 1) memcached is totally unrelated to view engine; 2) doc fragmentation (vbuckets), is unrelated to view engine as well. So I think it's premature to classify the view engine as the source of the problem.
        ketaki Ketaki Gangal made changes -
        Assignee Filipe Manana [ filipemanana ] Ketaki Gangal [ ketaki ]
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Further, it's continuous indexing and compaction is not necessarily bad. If there's data being loaded continuously and queries, continuous indexing is expected. The progress bar you see in the UI might not be related to the same indexer run. For very small indexing iterations, the UI might give an idea of staleness.

        Anyway, waiting for the logs to tell what's going on.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Further, it's continuous indexing and compaction is not necessarily bad. If there's data being loaded continuously and queries, continuous indexing is expected. The progress bar you see in the UI might not be related to the same indexer run. For very small indexing iterations, the UI might give an idea of staleness. Anyway, waiting for the logs to tell what's going on.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        yes. assigning this back to QE as there is not enough information available

        also doc fragmentation is related to compaction engine

        Show
        farshid Farshid Ghods (Inactive) added a comment - yes. assigning this back to QE as there is not enough information available also doc fragmentation is related to compaction engine
        farshid Farshid Ghods (Inactive) made changes -
        Component/s test-execution [ 10231 ]
        Component/s view-engine [ 10060 ]
        Hide
        ketaki Ketaki Gangal added a comment -

        Thanks for the input Filipe.

        Yes - I agree this may not be a view issue. But i ve seen some system unusual behavior after I added views and hence the tagging.

        I ve currently assigned this bug to myself. And will post the logs once I am on a slightly faster n/w ( @MV).

        memcached in higher swap is something we are aware of. [ seen after the change on 'mlock_all of beam pages']
        Compaction not catching up on some nodes needs to be looked at, and this would be at a couchstore layer.

        Show
        ketaki Ketaki Gangal added a comment - Thanks for the input Filipe. Yes - I agree this may not be a view issue. But i ve seen some system unusual behavior after I added views and hence the tagging. I ve currently assigned this bug to myself. And will post the logs once I am on a slightly faster n/w ( @MV). memcached in higher swap is something we are aware of. [ seen after the change on 'mlock_all of beam pages'] Compaction not catching up on some nodes needs to be looked at, and this would be at a couchstore layer.
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Well it depends, I would need the logs to see if there's any problem with view indexing and/or compaction on that specific node (from what I understand this is being a problem in a single node right?).

        To my knowledge there hasn't been any for some a good period of time.

        thanks Ketaki

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Well it depends, I would need the logs to see if there's any problem with view indexing and/or compaction on that specific node (from what I understand this is being a problem in a single node right?). To my knowledge there hasn't been any for some a good period of time. thanks Ketaki
        Hide
        ketaki Ketaki Gangal added a comment -

        cbcollect_info from node 10.6.242 ( high doc fragmentation on this node)

        Show
        ketaki Ketaki Gangal added a comment - cbcollect_info from node 10.6.242 ( high doc fragmentation on this node)
        ketaki Ketaki Gangal made changes -
        Attachment out_10.6.2.42.zip [ 16274 ]
        Hide
        ketaki Ketaki Gangal added a comment -
        Show
        ketaki Ketaki Gangal added a comment - Logs from all the nodes https://s3.amazonaws.com/bugdb/7587/7587.tar
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment - - edited

        Looking at log from node .42, the one with largest total file size:

        [ns_server:info,2013-01-23T0:10:57.058,ns_1@10.6.2.42:<0.16991.0>:compaction_daemon:spawn_dbs_compactor:690]Compacting databases for bucket default
        [ns_server:info,2013-01-23T0:10:57.059,ns_1@10.6.2.42:<0.16992.0>:compaction_daemon:spawn_vbucket_compactor:793]Compacting <<"default/master">>
        [couchdb:error,2013-01-23T0:10:57.085,ns_1@10.6.2.42:<0.16993.0>:couch_log:error:42]Native compact for "default/master" failed due to error

        {exit_status,139}

        . Falling back to erlang.
        [error_logger:error,2013-01-23T0:10:57.087,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]Error in process <0.16993.0> on node 'ns_1@10.6.2.42' with exit value: badmatch,{error,no_valid_header,[

        {couch_db_updater,start_copy_compact,2}]}


        [ns_server:warn,2013-01-23T0:10:57.087,ns_1@10.6.2.42:<0.16991.0>:compaction_daemon:do_chain_compactors:665]Compactor for vbucket `default/master` (pid [{type,vbucket},
        {name,<<"default/master">>},
        {important,false},
        {fa,
        {#Fun<compaction_daemon.17.36422344>,
        [<<"default">>,
        {master,<<"default/master">>},
        []]}},
        {on_success,
        {#Fun<compaction_daemon.18.44611183>,
        [1,293]}}]) terminated unexpectedly (ignoring this): {{badmatch,
        {error,
        no_valid_header}},
        [{couch_db_updater, start_copy_compact, 2}]}
        [ns_server:info,2013-01-23T0:10:57.088,ns_1@10.6.2.42:<0.16991.0>:compaction_daemon:spawn_dbs_compactor:730]Finished compaction of databases for bucket default
        [error_logger:error,2013-01-23T0:10:57.088,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: compaction_daemon:spawn_vbucket_compactor/3-fun-0/0
        pid: <0.16992.0>
        registered_name: []
        exception exit: badmatch,{error,no_valid_header,
        [{couch_db_updater,start_copy_compact,2}

        ]}
        in function compaction_daemon:'spawn_vbucket_compactor/3-fun-0'/5
        ancestors: [<0.16991.0>,<0.16989.0>,<0.16988.0>,compaction_daemon,
        <0.2488.0>,ns_server_sup,ns_server_cluster_sup,<0.58.0>]
        messages: []
        links: [<0.16991.0>]
        dictionary: []
        trap_exit: true
        status: running
        heap_size: 2584
        stack_size: 24
        reductions: 1084
        neighbours:

        [ns_server:debug,2013-01-23T0:10:57.088,ns_1@10.6.2.42:<0.16999.0>:compaction_daemon:file_needs_compaction:985]Estimated size for `default/_design/dev_ddoc1/main`: data 0, file 0
        [error_logger:error,2013-01-23T0:10:57.088,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.16994.0> terminating

          • Last message in was {'EXIT',<0.16993.0>,
            badmatch,{error,no_valid_header,
            [ {couch_db_updater,start_copy_compact,2}]}}
            ** When Server state == {file,<0.16995.0>,<0.16996.0>,0}
            ** Reason for termination ==
            ** badmatch,{error,no_valid_header,
            [{couch_db_updater,start_copy_compact,2}

            ]}

        There's indeed a problem with database compaction.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - - edited Looking at log from node .42, the one with largest total file size: [ns_server:info,2013-01-23T0:10:57.058,ns_1@10.6.2.42:<0.16991.0>:compaction_daemon:spawn_dbs_compactor:690] Compacting databases for bucket default [ns_server:info,2013-01-23T0:10:57.059,ns_1@10.6.2.42:<0.16992.0>:compaction_daemon:spawn_vbucket_compactor:793] Compacting <<"default/master">> [couchdb:error,2013-01-23T0:10:57.085,ns_1@10.6.2.42:<0.16993.0>:couch_log:error:42] Native compact for "default/master" failed due to error {exit_status,139} . Falling back to erlang. [error_logger:error,2013-01-23T0:10:57.087,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76] Error in process <0.16993.0> on node 'ns_1@10.6.2.42' with exit value: badmatch,{error,no_valid_header ,[ {couch_db_updater,start_copy_compact,2}]} [ns_server:warn,2013-01-23T0:10:57.087,ns_1@10.6.2.42:<0.16991.0>:compaction_daemon:do_chain_compactors:665] Compactor for vbucket `default/master` (pid [{type,vbucket}, {name,<<"default/master">>}, {important,false}, {fa, {#Fun<compaction_daemon.17.36422344>, [<<"default">>, {master,<<"default/master">>}, []]}}, {on_success, {#Fun<compaction_daemon.18.44611183>, [1,293] }}]) terminated unexpectedly (ignoring this): {{badmatch, {error, no_valid_header}}, [{couch_db_updater, start_copy_compact, 2}] } [ns_server:info,2013-01-23T0:10:57.088,ns_1@10.6.2.42:<0.16991.0>:compaction_daemon:spawn_dbs_compactor:730] Finished compaction of databases for bucket default [error_logger:error,2013-01-23T0:10:57.088,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: compaction_daemon: spawn_vbucket_compactor/3-fun-0 /0 pid: <0.16992.0> registered_name: [] exception exit: badmatch,{error,no_valid_header , [{couch_db_updater,start_copy_compact,2} ]} in function compaction_daemon:' spawn_vbucket_compactor/3-fun-0 '/5 ancestors: [<0.16991.0>,<0.16989.0>,<0.16988.0>,compaction_daemon, <0.2488.0>,ns_server_sup,ns_server_cluster_sup,<0.58.0>] messages: [] links: [<0.16991.0>] dictionary: [] trap_exit: true status: running heap_size: 2584 stack_size: 24 reductions: 1084 neighbours: [ns_server:debug,2013-01-23T0:10:57.088,ns_1@10.6.2.42:<0.16999.0>:compaction_daemon:file_needs_compaction:985] Estimated size for `default/_design/dev_ddoc1/main`: data 0, file 0 [error_logger:error,2013-01-23T0:10:57.088,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76] ** Generic server <0.16994.0> terminating Last message in was {'EXIT',<0.16993.0>, badmatch,{error,no_valid_header , [ {couch_db_updater,start_copy_compact,2}]}} ** When Server state == {file,<0.16995.0>,<0.16996.0>,0} ** Reason for termination == ** badmatch,{error,no_valid_header , [{couch_db_updater,start_copy_compact,2} ]} There's indeed a problem with database compaction.
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Couchstore's vbucket compactor exited with status 139 (seg fault).

        Check with Aaron.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Couchstore's vbucket compactor exited with status 139 (seg fault). Check with Aaron.
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        And that results in the compaction daemon (ns_server) to fail on each compaction iteration (not proceeding to other vbuckets and indexes). Each iteration fails always with the same error when attempting to compact default/master vbucket.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - And that results in the compaction daemon (ns_server) to fail on each compaction iteration (not proceeding to other vbuckets and indexes). Each iteration fails always with the same error when attempting to compact default/master vbucket.
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Looking at the logs of all nodes, neither indexing nor view compaction got any errors or got stuck.

        In the past (before beta), view compaction retry phase had many iterations when there was load in parallel (sometimes hundreds of iterations in the extreme). It's not the case anymore, and confirmed in the logs:

        fdmanana 21:24:59 ~/jira/MB-7587/7587 > egrep 'retry number' * | cut -d ' ' -f 16 | tr -d ')' | sort -n | uniq -c
        473 1
        180 2
        43 3
        17 4
        12 5
        7 6
        6 7
        6 8
        4 9
        3 10
        1 11
        1 12
        1 13
        1 14
        1 15
        fdmanana 21:29:36 ~/jira/MB-7587/7587 >

        Indexing compaction times histogram (for all nodes, all ddocs), values in seconds:

        fdmanana 21:30:01 ~/jira/MB-7587/7587 > egrep 'compaction complete in' * | cut -d ' ' -f 10 | perl -MStatistics::Histogram -e '@data = <>; chomp @data; print get_histogram(@data);'
        Count: 684
        Range: 0.746 - 6905.309; Mean: 183.588; Median: 50.588; Stddev: 523.667
        Percentiles: 90th: 332.315; 95th: 533.129; 99th: 2635.922
        0.746 - 2.997: 39 ###############
        2.997 - 8.151: 72 ###########################
        8.151 - 19.951: 100 #####################################
        19.951 - 46.965: 124 ##############################################
        46.965 - 108.811: 108 ########################################
        108.811 - 250.400: 142 #####################################################
        250.400 - 574.553: 65 ########################
        574.553 - 1316.667: 15 ######
        1316.667 - 3015.656: 14 #####
        3015.656 - 6905.309: 5 ##
        fdmanana 21:30:05 ~/jira/MB-7587/7587 >

        Indexing times histogram (again, all nodes and all ddocs), values in seconds:

        fdmanana 21:29:50 ~/jira/MB-7587/7587 > egrep 'Indexing time:' * | cut -d ' ' -f 3 | perl -MStatistics::Histogram -e '@data = <>; chomp @data; print get_histogram(@data);'
        Count: 2242
        Range: 0.566 - 9082.209; Mean: 70.312; Median: 21.408; Stddev: 255.755
        Percentiles: 90th: 160.939; 95th: 282.844; 99th: 695.317
        0.566 - 2.725: 13 #
        2.725 - 7.861: 499 ##############################################
        7.861 - 20.078: 552 ##################################################
        20.078 - 49.138: 581 #####################################################
        49.138 - 118.266: 304 ############################
        118.266 - 282.701: 180 ################
        282.701 - 673.847: 88 ########
        673.847 - 1604.276: 19 ##
        1604.276 - 3817.515: 5 |
        3817.515 - 9082.209: 1 |
        fdmanana 21:30:42 ~/jira/MB-7587/7587 >

        The fact that you see indexing happening all the time is indeed due to queries happening all the time, each triggering an index update that processes a small number of changes, whence likely hard to know via the UI if the progress bar relates to different indexer runs or the same - there were 2242 index updates.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Looking at the logs of all nodes, neither indexing nor view compaction got any errors or got stuck. In the past (before beta), view compaction retry phase had many iterations when there was load in parallel (sometimes hundreds of iterations in the extreme). It's not the case anymore, and confirmed in the logs: fdmanana 21:24:59 ~/jira/ MB-7587 /7587 > egrep 'retry number' * | cut -d ' ' -f 16 | tr -d ')' | sort -n | uniq -c 473 1 180 2 43 3 17 4 12 5 7 6 6 7 6 8 4 9 3 10 1 11 1 12 1 13 1 14 1 15 fdmanana 21:29:36 ~/jira/ MB-7587 /7587 > Indexing compaction times histogram (for all nodes, all ddocs), values in seconds: fdmanana 21:30:01 ~/jira/ MB-7587 /7587 > egrep 'compaction complete in' * | cut -d ' ' -f 10 | perl -MStatistics::Histogram -e '@data = <>; chomp @data; print get_histogram(@data);' Count: 684 Range: 0.746 - 6905.309; Mean: 183.588; Median: 50.588; Stddev: 523.667 Percentiles: 90th: 332.315; 95th: 533.129; 99th: 2635.922 0.746 - 2.997: 39 ############### 2.997 - 8.151: 72 ########################### 8.151 - 19.951: 100 ##################################### 19.951 - 46.965: 124 ############################################## 46.965 - 108.811: 108 ######################################## 108.811 - 250.400: 142 ##################################################### 250.400 - 574.553: 65 ######################## 574.553 - 1316.667: 15 ###### 1316.667 - 3015.656: 14 ##### 3015.656 - 6905.309: 5 ## fdmanana 21:30:05 ~/jira/ MB-7587 /7587 > Indexing times histogram (again, all nodes and all ddocs), values in seconds: fdmanana 21:29:50 ~/jira/ MB-7587 /7587 > egrep 'Indexing time:' * | cut -d ' ' -f 3 | perl -MStatistics::Histogram -e '@data = <>; chomp @data; print get_histogram(@data);' Count: 2242 Range: 0.566 - 9082.209; Mean: 70.312; Median: 21.408; Stddev: 255.755 Percentiles: 90th: 160.939; 95th: 282.844; 99th: 695.317 0.566 - 2.725: 13 # 2.725 - 7.861: 499 ############################################## 7.861 - 20.078: 552 ################################################## 20.078 - 49.138: 581 ##################################################### 49.138 - 118.266: 304 ############################ 118.266 - 282.701: 180 ################ 282.701 - 673.847: 88 ######## 673.847 - 1604.276: 19 ## 1604.276 - 3817.515: 5 | 3817.515 - 9082.209: 1 | fdmanana 21:30:42 ~/jira/ MB-7587 /7587 > The fact that you see indexing happening all the time is indeed due to queries happening all the time, each triggering an index update that processes a small number of changes, whence likely hard to know via the UI if the progress bar relates to different indexer runs or the same - there were 2242 index updates.
        Hide
        thuan Thuan Nguyen added a comment -

        See compaction crash in windows cluster with build 2.0.1-140

        Environment:

        • 9 windows 2008 R2 64bit.
        • Each server has 4 CPU, 8GB RAM and SSD disk
        • Cluster has 2 buckets, default and sasl bucket with consistent view enable.
        • Load 26 million items to default bucket and 16 million items to sasl bucket. Each key has size from 128 to 512 bytes
        • Each bucket has one doc and 2 views for each doc.
        • Rebalance out 2 nodes 10.3.121.173 and 10.3.121.243

        Starting rebalance, KeepNodes = ['ns_1@10.3.3.181','ns_1@10.3.121.47',
        'ns_1@10.3.3.214','ns_1@10.3.3.182',
        'ns_1@10.3.3.180','ns_1@10.3.121.171',
        'ns_1@10.3.121.169'], EjectNodes = ['ns_1@10.3.121.173',
        'ns_1@10.3.121.243'] ns_orchestrator004 ns_1@10.3.121.169 23:26:03 - Tue Jan 22, 2013

        =========================CRASH REPORT=========================
        crasher:
        initial call: compaction_daemon:spawn_view_index_compactor/6-fun-0/0
        pid: <0.29697.56>
        registered_name: []
        exception exit: {{badmatch,
        {error,

        {file_error, "c:/view/@indexes/default/tmp_40ae216d3de3826de065e665ac0f52dd_replica/d8004c67163a37e21f8c542b9ffa9c05.compact", enoent}

        }},
        [

        {couch_set_view_compactor,merge_files,3},
        {couch_set_view_compactor,apply_log,4},
        {couch_set_view_compactor,maybe_retry_compact,5}]}
        in function compaction_daemon:do_spawn_view_index_compactor/5
        in call from compaction_daemon:'spawn_view_index_compactor/6-fun-0'/7
        ancestors: [<0.12760.55>,<0.16968.54>,<0.16931.54>,compaction_daemon,
        <0.152.1>,ns_server_sup,ns_server_cluster_sup,<0.67.0>]
        messages: []
        links: [<0.12760.55>]
        dictionary: []
        trap_exit: true
        status: running
        heap_size: 4181
        stack_size: 24
        reductions: 2873
        neighbours:

        [error_logger:error,2013-01-23T5:02:44.219,ns_1@10.3.121.169:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]Error in process <0.24289.57> on node 'ns_1@10.3.121.169' with exit value: badmatch,{error,{file_error,"c:/view/@indexes/default/tmp_40ae216d3de3826de065e665ac0f52dd_replica/d8004c67163a37e21f8c542b9ffa9c05.compact",enoent},[{couch_set_view_compactor,merge_files,3}

        ,{couch_set_view_compactor...

        [error_logger:error,2013-01-23T5:02:44.219,ns_1@10.3.121.169:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: compaction_daemon:spawn_view_compactor/5-fun-1/0
        pid: <0.12760.55>
        registered_name: []
        exception exit: {{badmatch,
        {error,

        {file_error, "c:/view/@indexes/default/tmp_40ae216d3de3826de065e665ac0f52dd_replica/d8004c67163a37e21f8c542b9ffa9c05.compact", enoent}

        }},
        [

        {couch_set_view_compactor,merge_files,3}

        ,

        {couch_set_view_compactor,apply_log,4}

        ,

        {couch_set_view_compactor,maybe_retry_compact,5}

        ]}
        in function compaction_daemon:do_chain_compactors/2
        ancestors: [<0.16968.54>,<0.16931.54>,compaction_daemon,<0.152.1>,
        ns_server_sup,ns_server_cluster_sup,<0.67.0>]
        messages: []
        links: [<0.16968.54>]
        dictionary: []
        trap_exit: true
        status: running
        heap_size: 1597
        stack_size: 24
        reductions: 4965
        neighbours:

        Links to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/2_0_1/201301/9nodes-col-201-140-rebalance-failed-buckets-shutdown-20130123-14-11-14.tgz

        Show
        thuan Thuan Nguyen added a comment - See compaction crash in windows cluster with build 2.0.1-140 Environment: 9 windows 2008 R2 64bit. Each server has 4 CPU, 8GB RAM and SSD disk Cluster has 2 buckets, default and sasl bucket with consistent view enable. Load 26 million items to default bucket and 16 million items to sasl bucket. Each key has size from 128 to 512 bytes Each bucket has one doc and 2 views for each doc. Rebalance out 2 nodes 10.3.121.173 and 10.3.121.243 Starting rebalance, KeepNodes = ['ns_1@10.3.3.181','ns_1@10.3.121.47', 'ns_1@10.3.3.214','ns_1@10.3.3.182', 'ns_1@10.3.3.180','ns_1@10.3.121.171', 'ns_1@10.3.121.169'], EjectNodes = ['ns_1@10.3.121.173', 'ns_1@10.3.121.243'] ns_orchestrator004 ns_1@10.3.121.169 23:26:03 - Tue Jan 22, 2013 =========================CRASH REPORT========================= crasher: initial call: compaction_daemon: spawn_view_index_compactor/6-fun-0 /0 pid: <0.29697.56> registered_name: [] exception exit: {{badmatch, {error, {file_error, "c:/view/@indexes/default/tmp_40ae216d3de3826de065e665ac0f52dd_replica/d8004c67163a37e21f8c542b9ffa9c05.compact", enoent} }}, [ {couch_set_view_compactor,merge_files,3}, {couch_set_view_compactor,apply_log,4}, {couch_set_view_compactor,maybe_retry_compact,5}]} in function compaction_daemon:do_spawn_view_index_compactor/5 in call from compaction_daemon:' spawn_view_index_compactor/6-fun-0 '/7 ancestors: [<0.12760.55>,<0.16968.54>,<0.16931.54>,compaction_daemon, <0.152.1>,ns_server_sup,ns_server_cluster_sup,<0.67.0>] messages: [] links: [<0.12760.55>] dictionary: [] trap_exit: true status: running heap_size: 4181 stack_size: 24 reductions: 2873 neighbours: [error_logger:error,2013-01-23T5:02:44.219,ns_1@10.3.121.169:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76] Error in process <0.24289.57> on node 'ns_1@10.3.121.169' with exit value: badmatch,{error,{file_error,"c:/view/@indexes/default/tmp_40ae216d3de3826de065e665ac0f52dd_replica/d8004c67163a37e21f8c542b9ffa9c05.compact",enoent} ,[{couch_set_view_compactor,merge_files,3} ,{couch_set_view_compactor... [error_logger:error,2013-01-23T5:02:44.219,ns_1@10.3.121.169:error_logger<0.6.0>:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: compaction_daemon: spawn_view_compactor/5-fun-1 /0 pid: <0.12760.55> registered_name: [] exception exit: {{badmatch, {error, {file_error, "c:/view/@indexes/default/tmp_40ae216d3de3826de065e665ac0f52dd_replica/d8004c67163a37e21f8c542b9ffa9c05.compact", enoent} }}, [ {couch_set_view_compactor,merge_files,3} , {couch_set_view_compactor,apply_log,4} , {couch_set_view_compactor,maybe_retry_compact,5} ]} in function compaction_daemon:do_chain_compactors/2 ancestors: [<0.16968.54>,<0.16931.54>,compaction_daemon,<0.152.1>, ns_server_sup,ns_server_cluster_sup,<0.67.0>] messages: [] links: [<0.16968.54>] dictionary: [] trap_exit: true status: running heap_size: 1597 stack_size: 24 reductions: 4965 neighbours: Links to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/2_0_1/201301/9nodes-col-201-140-rebalance-failed-buckets-shutdown-20130123-14-11-14.tgz
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Tony, that's a totally different issue from Ketaki's and on a different environment (windows).
        Lets not mix everything here. Please create a separate issue for that and assign it to me for now.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Tony, that's a totally different issue from Ketaki's and on a different environment (windows). Lets not mix everything here. Please create a separate issue for that and assign it to me for now.
        Hide
        thuan Thuan Nguyen added a comment -

        Sure, I will create the new bug.

        Show
        thuan Thuan Nguyen added a comment - Sure, I will create the new bug.
        Hide
        aaron Aaron Miller (Inactive) added a comment -

        If we have compaction crashes I'll at least need stack traces or cores to have any hope of diagnosing them.

        Show
        aaron Aaron Miller (Inactive) added a comment - If we have compaction crashes I'll at least need stack traces or cores to have any hope of diagnosing them.
        Hide
        aaron Aaron Miller (Inactive) added a comment -

        Found the cores.

        It looks like tmpfile() is returning NULL on this machine, looks like error # 13, permission denied. Could be caused by Couchbase not having permission to create files in /tmp.
        Will do a fix to handle this (the compactor will still not work, It'll just exit with an error that's not a segfault).

        To make sure this doesn't happen, the directory tmpfile() creates file in (typically /tmp) needs to be writable by the user couchbase runs as.

        Show
        aaron Aaron Miller (Inactive) added a comment - Found the cores. It looks like tmpfile() is returning NULL on this machine, looks like error # 13, permission denied. Could be caused by Couchbase not having permission to create files in /tmp. Will do a fix to handle this (the compactor will still not work, It'll just exit with an error that's not a segfault). To make sure this doesn't happen, the directory tmpfile() creates file in (typically /tmp) needs to be writable by the user couchbase runs as.
        Hide
        dipti Dipti Borkar added a comment -

        Per bug-scrub: After bug is merged, QE will re-run

        Show
        dipti Dipti Borkar added a comment - Per bug-scrub: After bug is merged, QE will re-run
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Ketaki to rerun the test with build 144+

        Show
        farshid Farshid Ghods (Inactive) added a comment - Ketaki to rerun the test with build 144+
        farshid Farshid Ghods (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        I don't think any fix was done here.
        Couchstore's compactor was only changed to exit with a status other than 139 (seg fault). From above, after database compaction failed with couchstore, the Erlang based compactor, which it fallbacks to, also failed:

        [couchdb:error,2013-01-23T0:10:57.085,ns_1@10.6.2.42:<0.16993.0>:couch_log:error:42]Native compact for "default/master" failed due to error

        {exit_status,139}

        . Falling back to erlang.
        [error_logger:error,2013-01-23T0:10:57.087,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]Error in process <0.16993.0> on node 'ns_1@10.6.2.42' with exit value: badmatch,{error,no_valid_header,[

        {couch_db_updater,start_copy_compact,2}

        ]}

        It needs to be analyzed why master vbucket compaction fails (apparently empty file or non-empty but no header written, as it has at least 1 doc).

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - I don't think any fix was done here. Couchstore's compactor was only changed to exit with a status other than 139 (seg fault). From above, after database compaction failed with couchstore, the Erlang based compactor, which it fallbacks to, also failed: [couchdb:error,2013-01-23T0:10:57.085,ns_1@10.6.2.42:<0.16993.0>:couch_log:error:42] Native compact for "default/master" failed due to error {exit_status,139} . Falling back to erlang. [error_logger:error,2013-01-23T0:10:57.087,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76] Error in process <0.16993.0> on node 'ns_1@10.6.2.42' with exit value: badmatch,{error,no_valid_header ,[ {couch_db_updater,start_copy_compact,2} ]} It needs to be analyzed why master vbucket compaction fails (apparently empty file or non-empty but no header written, as it has at least 1 doc).

          People

          • Assignee:
            ketaki Ketaki Gangal
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes