Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4732

Compaction seems to be stuck (or not running)

    Details

      Description

      Quoting Sharon's:

      "Troubleshooting, I found many nodes where disk size was 4 times greater then on other nodes.

      Looking at one of these nodes where data is not compacted,
      Compaction seems to be stuck.

      http://50.18.98.4:8092/default%2F101

      {"db_name":"default/101","doc_count":1807,"doc_del_count":0,"update_seq":2986,"purge_seq":0,"compact_running":false,"disk_size":4452469,"data_size":922673,"instance_start_time":"1328040896372522","disk_format_version":7,"committed_update_seq":2985}

      Cluster is at http://50.18.98.4:8091 (Administrator/password)"

      > Quoting Allaksey
      The cause of compaction daemon hang is the same as of views hangs. So
      generally this is the same bug.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Compaction daemon processes' backtraces:

        {<0.5083.0>,
        [

        {registered_name,[]},
        {status,waiting},
        {initial_call,{proc_lib,init_p,5}},
        {backtrace,
        [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>,
        <<"CP: 0x0000000000000000 (invalid)">>,
        <<"arity = 0">>,<<>>,
        <<"0x00002aaaad443728 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>,
        <<"y(0) []">>,<<"y(1) infinity">>,
        <<"y(2) supervisor_cushion">>,
        <<"y(3) {state,couchbase_compaction_daemon,3000,{1328,40081,401252},<0.5084.0>}">>,
        <<"y(4) <0.5083.0>">>,<<"y(5) <0.4997.0>">>,
        <<>>,
        <<"0x00002aaaad443760 Return addr 0x000000000088e318 (<terminate process normally>)">>,
        <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>,
        <<>>]},
        {error_handler,error_handler},
        {garbage_collection,
        [{min_bin_vheap_size,46368},
        {min_heap_size,233},
        {fullsweep_after,0},
        {minor_gcs,0}]},
        {heap_size,233},
        {total_heap_size,233},
        {links,[<0.4997.0>,<0.5084.0>]},
        {memory,2840},
        {message_queue_len,0},
        {reductions,75},
        {trap_exit,true}]},
        {<0.5084.0>,
        [{registered_name,couchbase_compaction_daemon},
        {status,waiting},
        {initial_call,{proc_lib,init_p,5}},
        {backtrace,
        [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>,
        <<"CP: 0x0000000000000000 (invalid)">>,
        <<"arity = 0">>,<<>>,
        <<"0x00002aaabe24c5f8 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>,
        <<"y(0) []">>,<<"y(1) infinity">>,
        <<"y(2) couchbase_compaction_daemon">>,
        <<"y(3) {state,<0.5085.0>}">>,
        <<"y(4) couchbase_compaction_daemon">>,
        <<"y(5) <0.5083.0>">>,<<>>,
        <<"0x00002aaabe24c630 Return addr 0x000000000088e318 (<terminate process normally>)">>,
        <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>,
        <<>>]},
        {error_handler,error_handler},
        {garbage_collection,
        [{min_bin_vheap_size,46368},
        {min_heap_size,233},
        {fullsweep_after,0},
        {minor_gcs,0}]},
        {heap_size,987},
        {total_heap_size,987},
        {links,[<0.5083.0>,<0.5085.0>]},
        {memory,8944},
        {message_queue_len,0},
        {reductions,2388},
        {trap_exit,true}]},
        {<0.5085.0>,
        [{registered_name,[]}

        ,

        {status,waiting}

        ,
        {initial_call,{erlang,apply,2}},
        {backtrace,
        [<<"Program counter: 0x00002aaaabe73ef0 (gen:do_call/4 + 576)">>,
        <<"CP: 0x0000000000000000 (invalid)">>,
        <<"arity = 0">>,<<>>,
        <<"0x00002aaabf118b68 Return addr 0x00002aaaabef1498 (gen_server:call/3 + 128)">>,
        <<"y(0) #Ref<0.0.51.118144>">>,
        <<"y(1) 'ns_1@10.176.215.197'">>,
        <<"y(2) []">>,<<"y(3) infinity">>,
        <<"(4) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>,
        <<"y(5) '$gen_call'">>,<<"y(6) <0.4821.0>">>,
        <<>>,
        <<"x00002aaabf118ba8 Return addr 0x00002aaaafa76380 (couch_set_view:get_group_server/2 + 128)">>,
        <<"y(0) infinity">>,
        <<"(1) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>,
        <<"y(2) couch_set_view">>,
        <<"y(3) Catch 0x00002aaaabef1498 (gen_server:call/3 + 128)">>,
        <<>>,
        <<"0x00002aaabf118bd0 Return addr 0x00002aaaafa76550 (couch_set_view:get_group_info/2 + 40)">>,
        <<>>,
        <<"x00002aaabf118bd8 Return addr 0x00002aaaafa7f9a0 (couch_set_view:'-cleanup_index_files/1-f">>,
        <<>>,
        <<"0x00002aaabf118be0 Return addr 0x00002aaaabeb06c0 (lists:map/2 + 120)">>,
        <<>>,
        <<"x00002aaabf118be8 Return addr 0x00002aaaafa76828 (couch_set_view:cleanup_index_files/1 + 5">>,
        <<"y(0) #Fun<couch_set_view.0.102244014>">>,
        <<"(1) [{doc,<<19 bytes>>,

        {4,<<4 bytes>>}

        ,{[{<<5 bytes>>,{[{<<11 bytes>>,{[{<<3 bytes>>,<">>,
        <<>>,
        <<"x00002aaabf118c00 Return addr 0x00002aaab0d65490 (couchbase_compaction_daemon:maybe_compac">>,
        <<"y(0) []">>,<<"y(1) []">>,
        <<"y(2) <<7 bytes>>">>,<<>>,
        <<"0x00002aaabf118c20 Return addr 0x00002aaaabeb1170 (lists:foreach/2 + 120)">>,
        <<"y(0) [<<15 bytes>>,<<19 bytes>>]">>,
        <<"(1) Catch 0x00002aaab0d654b0 (couchbase_compaction_daemon:maybe_compact_bucket/3 + 688">>,
        <<"y(2)

        {config,30,80,nil,false,false}

        ">>,
        <<"(3) [<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<1">>,
        <<"y(4) <<7 bytes>>">>,<<>>,
        <<"x00002aaabf118c50 Return addr 0x00002aaab0d65028 (couchbase_compaction_daemon:compact_loop">>,
        <<"y(0) #Fun<couchbase_compaction_daemon.3.77482903>">>,
        <<"(1) [

        {<<14 bytes>>,[<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<">>, <<>>, <<"0x00002aaabf118c68 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) <0.5084.0>">>,<<>>]}

        ,

        {error_handler,error_handler}

        ,
        {garbage_collection,
        [

        {min_bin_vheap_size,46368}

        ,

        {min_heap_size,233}

        ,

        {fullsweep_after,0}

        ,

        {minor_gcs,0}

        ]},

        {heap_size,46368}

        ,

        {total_heap_size,46368}

        ,

        {links,[<0.5084.0>]}

        ,

        {memory,371952}

        ,

        {message_queue_len,0}

        ,

        {reductions,390457}

        ,

        {trap_exit,false}

        ]}

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Compaction daemon processes' backtraces: {<0.5083.0>, [ {registered_name,[]}, {status,waiting}, {initial_call,{proc_lib,init_p,5}}, {backtrace, [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00002aaaad443728 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>, <<"y(0) []">>,<<"y(1) infinity">>, <<"y(2) supervisor_cushion">>, <<"y(3) {state,couchbase_compaction_daemon,3000,{1328,40081,401252},<0.5084.0>}">>, <<"y(4) <0.5083.0>">>,<<"y(5) <0.4997.0>">>, <<>>, <<"0x00002aaaad443760 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>, <<>>]}, {error_handler,error_handler}, {garbage_collection, [{min_bin_vheap_size,46368}, {min_heap_size,233}, {fullsweep_after,0}, {minor_gcs,0}]}, {heap_size,233}, {total_heap_size,233}, {links,[<0.4997.0>,<0.5084.0>]}, {memory,2840}, {message_queue_len,0}, {reductions,75}, {trap_exit,true}]}, {<0.5084.0>, [{registered_name,couchbase_compaction_daemon}, {status,waiting}, {initial_call,{proc_lib,init_p,5}}, {backtrace, [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00002aaabe24c5f8 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>, <<"y(0) []">>,<<"y(1) infinity">>, <<"y(2) couchbase_compaction_daemon">>, <<"y(3) {state,<0.5085.0>}">>, <<"y(4) couchbase_compaction_daemon">>, <<"y(5) <0.5083.0>">>,<<>>, <<"0x00002aaabe24c630 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>, <<>>]}, {error_handler,error_handler}, {garbage_collection, [{min_bin_vheap_size,46368}, {min_heap_size,233}, {fullsweep_after,0}, {minor_gcs,0}]}, {heap_size,987}, {total_heap_size,987}, {links,[<0.5083.0>,<0.5085.0>]}, {memory,8944}, {message_queue_len,0}, {reductions,2388}, {trap_exit,true}]}, {<0.5085.0>, [{registered_name,[]} , {status,waiting} , {initial_call,{erlang,apply,2}}, {backtrace, [<<"Program counter: 0x00002aaaabe73ef0 (gen:do_call/4 + 576)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00002aaabf118b68 Return addr 0x00002aaaabef1498 (gen_server:call/3 + 128)">>, <<"y(0) #Ref<0.0.51.118144>">>, <<"y(1) 'ns_1@10.176.215.197'">>, <<"y(2) []">>,<<"y(3) infinity">>, <<"(4) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>, <<"y(5) '$gen_call'">>,<<"y(6) <0.4821.0>">>, <<>>, <<"x00002aaabf118ba8 Return addr 0x00002aaaafa76380 (couch_set_view:get_group_server/2 + 128)">>, <<"y(0) infinity">>, <<"(1) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>, <<"y(2) couch_set_view">>, <<"y(3) Catch 0x00002aaaabef1498 (gen_server:call/3 + 128)">>, <<>>, <<"0x00002aaabf118bd0 Return addr 0x00002aaaafa76550 (couch_set_view:get_group_info/2 + 40)">>, <<>>, <<"x00002aaabf118bd8 Return addr 0x00002aaaafa7f9a0 (couch_set_view:'-cleanup_index_files/1-f">>, <<>>, <<"0x00002aaabf118be0 Return addr 0x00002aaaabeb06c0 (lists:map/2 + 120)">>, <<>>, <<"x00002aaabf118be8 Return addr 0x00002aaaafa76828 (couch_set_view:cleanup_index_files/1 + 5">>, <<"y(0) #Fun<couch_set_view.0.102244014>">>, <<"(1) [{doc,<<19 bytes>>, {4,<<4 bytes>>} ,{[{<<5 bytes>>,{[{<<11 bytes>>,{[{<<3 bytes>>,<">>, <<>>, <<"x00002aaabf118c00 Return addr 0x00002aaab0d65490 (couchbase_compaction_daemon:maybe_compac">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) <<7 bytes>>">>,<<>>, <<"0x00002aaabf118c20 Return addr 0x00002aaaabeb1170 (lists:foreach/2 + 120)">>, <<"y(0) [<<15 bytes>>,<<19 bytes>>] ">>, <<"(1) Catch 0x00002aaab0d654b0 (couchbase_compaction_daemon:maybe_compact_bucket/3 + 688">>, <<"y(2) {config,30,80,nil,false,false} ">>, <<"(3) [<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<1">>, <<"y(4) <<7 bytes>>">>,<<>>, <<"x00002aaabf118c50 Return addr 0x00002aaab0d65028 (couchbase_compaction_daemon:compact_loop">>, <<"y(0) #Fun<couchbase_compaction_daemon.3.77482903>">>, <<"(1) [ {<<14 bytes>>,[<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<">>, <<>>, <<"0x00002aaabf118c68 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) <0.5084.0>">>,<<>>]} , {error_handler,error_handler} , {garbage_collection, [ {min_bin_vheap_size,46368} , {min_heap_size,233} , {fullsweep_after,0} , {minor_gcs,0} ]}, {heap_size,46368} , {total_heap_size,46368} , {links,[<0.5084.0>]} , {memory,371952} , {message_queue_len,0} , {reductions,390457} , {trap_exit,false} ]}
        Hide
        damien damien added a comment -

        I appears we have a btree related bug. There is a badarith error in the logs that is causing the view compaction to crash. The badarith error is in couch_view_compactor:update_task/2 and I believe is caused by division by zero, but if that happens then the indexes should be empty and the update_task/2 should not be called.

        The only way that seems possible is if there are values in the primary btree indexes, but the row counts are 0. I believe this must be caused by the cleaning of vbuckets values from the indexes, which must not be properly computing the reductions when this happens.

        I believe the compactor crash then causes the couch_file for the compaction file to be leaked, which means it cannot be opened again (due to couch_file_write_guard). There is actually an file_already_opened error in the logs which indicates this is happening.

        I'm adding code to check for division by zero and exit with a diagnostic message. Reassigning to Filipe to look into the btree issue.

        Show
        damien damien added a comment - I appears we have a btree related bug. There is a badarith error in the logs that is causing the view compaction to crash. The badarith error is in couch_view_compactor:update_task/2 and I believe is caused by division by zero, but if that happens then the indexes should be empty and the update_task/2 should not be called. The only way that seems possible is if there are values in the primary btree indexes, but the row counts are 0. I believe this must be caused by the cleaning of vbuckets values from the indexes, which must not be properly computing the reductions when this happens. I believe the compactor crash then causes the couch_file for the compaction file to be leaked, which means it cannot be opened again (due to couch_file_write_guard). There is actually an file_already_opened error in the logs which indicates this is happening. I'm adding code to check for division by zero and exit with a diagnostic message. Reassigning to Filipe to look into the btree issue.
        Hide
        filipe manana filipe manana added a comment -

        Would be great if someone could repeat this test.

        Neither I or Damien realize how to reproduce this neither why it could happen.
        The following commit will help diagnose this better when it happens the next time.

        https://github.com/couchbase/couchdb/commit/dd6546cad52c72421442b54eb59fe5984d913269

        Show
        filipe manana filipe manana added a comment - Would be great if someone could repeat this test. Neither I or Damien realize how to reproduce this neither why it could happen. The following commit will help diagnose this better when it happens the next time. https://github.com/couchbase/couchdb/commit/dd6546cad52c72421442b54eb59fe5984d913269
        Hide
        steve Steve Yen added a comment -

        please try to reproduce (with Filipe's changes)

        Show
        steve Steve Yen added a comment - please try to reproduce (with Filipe's changes)
        Hide
        filipe manana filipe manana added a comment -

        This is same issues as MB-4774. One of them should be closed and marked as duplicate.
        Fix in http://review.couchbase.org/#change,13067

        Show
        filipe manana filipe manana added a comment - This is same issues as MB-4774 . One of them should be closed and marked as duplicate. Fix in http://review.couchbase.org/#change,13067
        Hide
        filipe manana filipe manana added a comment -
        Show
        filipe manana filipe manana added a comment - Fix merged today: https://github.com/couchbase/couchdb/commit/6319846fa68c73580e5ead96dbe27868447f730f

          People

          • Assignee:
            tommie Tommie McAfee
            Reporter:
            tommie Tommie McAfee
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes