Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4732

Compaction seems to be stuck (or not running)

    Details

      Description

      Quoting Sharon's:

      "Troubleshooting, I found many nodes where disk size was 4 times greater then on other nodes.

      Looking at one of these nodes where data is not compacted,
      Compaction seems to be stuck.

      http://50.18.98.4:8092/default%2F101

      {"db_name":"default/101","doc_count":1807,"doc_del_count":0,"update_seq":2986,"purge_seq":0,"compact_running":false,"disk_size":4452469,"data_size":922673,"instance_start_time":"1328040896372522","disk_format_version":7,"committed_update_seq":2985}

      Cluster is at http://50.18.98.4:8091 (Administrator/password)"

      > Quoting Allaksey
      The cause of compaction daemon hang is the same as of views hangs. So
      generally this is the same bug.

      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-4732=message:MB-4732

        Activity

        tommie Tommie McAfee created issue -
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Compaction daemon processes' backtraces:

        {<0.5083.0>,
        [

        {registered_name,[]},
        {status,waiting},
        {initial_call,{proc_lib,init_p,5}},
        {backtrace,
        [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>,
        <<"CP: 0x0000000000000000 (invalid)">>,
        <<"arity = 0">>,<<>>,
        <<"0x00002aaaad443728 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>,
        <<"y(0) []">>,<<"y(1) infinity">>,
        <<"y(2) supervisor_cushion">>,
        <<"y(3) {state,couchbase_compaction_daemon,3000,{1328,40081,401252},<0.5084.0>}">>,
        <<"y(4) <0.5083.0>">>,<<"y(5) <0.4997.0>">>,
        <<>>,
        <<"0x00002aaaad443760 Return addr 0x000000000088e318 (<terminate process normally>)">>,
        <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>,
        <<>>]},
        {error_handler,error_handler},
        {garbage_collection,
        [{min_bin_vheap_size,46368},
        {min_heap_size,233},
        {fullsweep_after,0},
        {minor_gcs,0}]},
        {heap_size,233},
        {total_heap_size,233},
        {links,[<0.4997.0>,<0.5084.0>]},
        {memory,2840},
        {message_queue_len,0},
        {reductions,75},
        {trap_exit,true}]},
        {<0.5084.0>,
        [{registered_name,couchbase_compaction_daemon},
        {status,waiting},
        {initial_call,{proc_lib,init_p,5}},
        {backtrace,
        [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>,
        <<"CP: 0x0000000000000000 (invalid)">>,
        <<"arity = 0">>,<<>>,
        <<"0x00002aaabe24c5f8 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>,
        <<"y(0) []">>,<<"y(1) infinity">>,
        <<"y(2) couchbase_compaction_daemon">>,
        <<"y(3) {state,<0.5085.0>}">>,
        <<"y(4) couchbase_compaction_daemon">>,
        <<"y(5) <0.5083.0>">>,<<>>,
        <<"0x00002aaabe24c630 Return addr 0x000000000088e318 (<terminate process normally>)">>,
        <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>,
        <<>>]},
        {error_handler,error_handler},
        {garbage_collection,
        [{min_bin_vheap_size,46368},
        {min_heap_size,233},
        {fullsweep_after,0},
        {minor_gcs,0}]},
        {heap_size,987},
        {total_heap_size,987},
        {links,[<0.5083.0>,<0.5085.0>]},
        {memory,8944},
        {message_queue_len,0},
        {reductions,2388},
        {trap_exit,true}]},
        {<0.5085.0>,
        [{registered_name,[]}

        ,

        {status,waiting}

        ,
        {initial_call,{erlang,apply,2}},
        {backtrace,
        [<<"Program counter: 0x00002aaaabe73ef0 (gen:do_call/4 + 576)">>,
        <<"CP: 0x0000000000000000 (invalid)">>,
        <<"arity = 0">>,<<>>,
        <<"0x00002aaabf118b68 Return addr 0x00002aaaabef1498 (gen_server:call/3 + 128)">>,
        <<"y(0) #Ref<0.0.51.118144>">>,
        <<"y(1) 'ns_1@10.176.215.197'">>,
        <<"y(2) []">>,<<"y(3) infinity">>,
        <<"(4) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>,
        <<"y(5) '$gen_call'">>,<<"y(6) <0.4821.0>">>,
        <<>>,
        <<"x00002aaabf118ba8 Return addr 0x00002aaaafa76380 (couch_set_view:get_group_server/2 + 128)">>,
        <<"y(0) infinity">>,
        <<"(1) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>,
        <<"y(2) couch_set_view">>,
        <<"y(3) Catch 0x00002aaaabef1498 (gen_server:call/3 + 128)">>,
        <<>>,
        <<"0x00002aaabf118bd0 Return addr 0x00002aaaafa76550 (couch_set_view:get_group_info/2 + 40)">>,
        <<>>,
        <<"x00002aaabf118bd8 Return addr 0x00002aaaafa7f9a0 (couch_set_view:'-cleanup_index_files/1-f">>,
        <<>>,
        <<"0x00002aaabf118be0 Return addr 0x00002aaaabeb06c0 (lists:map/2 + 120)">>,
        <<>>,
        <<"x00002aaabf118be8 Return addr 0x00002aaaafa76828 (couch_set_view:cleanup_index_files/1 + 5">>,
        <<"y(0) #Fun<couch_set_view.0.102244014>">>,
        <<"(1) [{doc,<<19 bytes>>,

        {4,<<4 bytes>>}

        ,{[{<<5 bytes>>,{[{<<11 bytes>>,{[{<<3 bytes>>,<">>,
        <<>>,
        <<"x00002aaabf118c00 Return addr 0x00002aaab0d65490 (couchbase_compaction_daemon:maybe_compac">>,
        <<"y(0) []">>,<<"y(1) []">>,
        <<"y(2) <<7 bytes>>">>,<<>>,
        <<"0x00002aaabf118c20 Return addr 0x00002aaaabeb1170 (lists:foreach/2 + 120)">>,
        <<"y(0) [<<15 bytes>>,<<19 bytes>>]">>,
        <<"(1) Catch 0x00002aaab0d654b0 (couchbase_compaction_daemon:maybe_compact_bucket/3 + 688">>,
        <<"y(2)

        {config,30,80,nil,false,false}

        ">>,
        <<"(3) [<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<1">>,
        <<"y(4) <<7 bytes>>">>,<<>>,
        <<"x00002aaabf118c50 Return addr 0x00002aaab0d65028 (couchbase_compaction_daemon:compact_loop">>,
        <<"y(0) #Fun<couchbase_compaction_daemon.3.77482903>">>,
        <<"(1) [

        {<<14 bytes>>,[<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<">>, <<>>, <<"0x00002aaabf118c68 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) <0.5084.0>">>,<<>>]}

        ,

        {error_handler,error_handler}

        ,
        {garbage_collection,
        [

        {min_bin_vheap_size,46368}

        ,

        {min_heap_size,233}

        ,

        {fullsweep_after,0}

        ,

        {minor_gcs,0}

        ]},

        {heap_size,46368}

        ,

        {total_heap_size,46368}

        ,

        {links,[<0.5084.0>]}

        ,

        {memory,371952}

        ,

        {message_queue_len,0}

        ,

        {reductions,390457}

        ,

        {trap_exit,false}

        ]}

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Compaction daemon processes' backtraces: {<0.5083.0>, [ {registered_name,[]}, {status,waiting}, {initial_call,{proc_lib,init_p,5}}, {backtrace, [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00002aaaad443728 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>, <<"y(0) []">>,<<"y(1) infinity">>, <<"y(2) supervisor_cushion">>, <<"y(3) {state,couchbase_compaction_daemon,3000,{1328,40081,401252},<0.5084.0>}">>, <<"y(4) <0.5083.0>">>,<<"y(5) <0.4997.0>">>, <<>>, <<"0x00002aaaad443760 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>, <<>>]}, {error_handler,error_handler}, {garbage_collection, [{min_bin_vheap_size,46368}, {min_heap_size,233}, {fullsweep_after,0}, {minor_gcs,0}]}, {heap_size,233}, {total_heap_size,233}, {links,[<0.4997.0>,<0.5084.0>]}, {memory,2840}, {message_queue_len,0}, {reductions,75}, {trap_exit,true}]}, {<0.5084.0>, [{registered_name,couchbase_compaction_daemon}, {status,waiting}, {initial_call,{proc_lib,init_p,5}}, {backtrace, [<<"Program counter: 0x00002aaaabef2a70 (gen_server:loop/6 + 256)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00002aaabe24c5f8 Return addr 0x00002aaaabe931c8 (proc_lib:init_p_do_apply/3 + 56)">>, <<"y(0) []">>,<<"y(1) infinity">>, <<"y(2) couchbase_compaction_daemon">>, <<"y(3) {state,<0.5085.0>}">>, <<"y(4) couchbase_compaction_daemon">>, <<"y(5) <0.5083.0>">>,<<>>, <<"0x00002aaabe24c630 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) Catch 0x00002aaaabe931e8 (proc_lib:init_p_do_apply/3 + 88)">>, <<>>]}, {error_handler,error_handler}, {garbage_collection, [{min_bin_vheap_size,46368}, {min_heap_size,233}, {fullsweep_after,0}, {minor_gcs,0}]}, {heap_size,987}, {total_heap_size,987}, {links,[<0.5083.0>,<0.5085.0>]}, {memory,8944}, {message_queue_len,0}, {reductions,2388}, {trap_exit,true}]}, {<0.5085.0>, [{registered_name,[]} , {status,waiting} , {initial_call,{erlang,apply,2}}, {backtrace, [<<"Program counter: 0x00002aaaabe73ef0 (gen:do_call/4 + 576)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00002aaabf118b68 Return addr 0x00002aaaabef1498 (gen_server:call/3 + 128)">>, <<"y(0) #Ref<0.0.51.118144>">>, <<"y(1) 'ns_1@10.176.215.197'">>, <<"y(2) []">>,<<"y(3) infinity">>, <<"(4) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>, <<"y(5) '$gen_call'">>,<<"y(6) <0.4821.0>">>, <<>>, <<"x00002aaabf118ba8 Return addr 0x00002aaaafa76380 (couch_set_view:get_group_server/2 + 128)">>, <<"y(0) infinity">>, <<"(1) {get_group_server,<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<15 by">>, <<"y(2) couch_set_view">>, <<"y(3) Catch 0x00002aaaabef1498 (gen_server:call/3 + 128)">>, <<>>, <<"0x00002aaabf118bd0 Return addr 0x00002aaaafa76550 (couch_set_view:get_group_info/2 + 40)">>, <<>>, <<"x00002aaabf118bd8 Return addr 0x00002aaaafa7f9a0 (couch_set_view:'-cleanup_index_files/1-f">>, <<>>, <<"0x00002aaabf118be0 Return addr 0x00002aaaabeb06c0 (lists:map/2 + 120)">>, <<>>, <<"x00002aaabf118be8 Return addr 0x00002aaaafa76828 (couch_set_view:cleanup_index_files/1 + 5">>, <<"y(0) #Fun<couch_set_view.0.102244014>">>, <<"(1) [{doc,<<19 bytes>>, {4,<<4 bytes>>} ,{[{<<5 bytes>>,{[{<<11 bytes>>,{[{<<3 bytes>>,<">>, <<>>, <<"x00002aaabf118c00 Return addr 0x00002aaab0d65490 (couchbase_compaction_daemon:maybe_compac">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) <<7 bytes>>">>,<<>>, <<"0x00002aaabf118c20 Return addr 0x00002aaaabeb1170 (lists:foreach/2 + 120)">>, <<"y(0) [<<15 bytes>>,<<19 bytes>>] ">>, <<"(1) Catch 0x00002aaab0d654b0 (couchbase_compaction_daemon:maybe_compact_bucket/3 + 688">>, <<"y(2) {config,30,80,nil,false,false} ">>, <<"(3) [<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<11 bytes>>,<<1">>, <<"y(4) <<7 bytes>>">>,<<>>, <<"x00002aaabf118c50 Return addr 0x00002aaab0d65028 (couchbase_compaction_daemon:compact_loop">>, <<"y(0) #Fun<couchbase_compaction_daemon.3.77482903>">>, <<"(1) [ {<<14 bytes>>,[<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<<18 bytes>>,<">>, <<>>, <<"0x00002aaabf118c68 Return addr 0x000000000088e318 (<terminate process normally>)">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) <0.5084.0>">>,<<>>]} , {error_handler,error_handler} , {garbage_collection, [ {min_bin_vheap_size,46368} , {min_heap_size,233} , {fullsweep_after,0} , {minor_gcs,0} ]}, {heap_size,46368} , {total_heap_size,46368} , {links,[<0.5084.0>]} , {memory,371952} , {message_queue_len,0} , {reductions,390457} , {trap_exit,false} ]}
        steve Steve Yen made changes -
        Field Original Value New Value
        Assignee Aliaksey Artamonau [ aliaksey artamonau ] Damien Katz [ damien ]
        Hide
        damien damien added a comment -

        I appears we have a btree related bug. There is a badarith error in the logs that is causing the view compaction to crash. The badarith error is in couch_view_compactor:update_task/2 and I believe is caused by division by zero, but if that happens then the indexes should be empty and the update_task/2 should not be called.

        The only way that seems possible is if there are values in the primary btree indexes, but the row counts are 0. I believe this must be caused by the cleaning of vbuckets values from the indexes, which must not be properly computing the reductions when this happens.

        I believe the compactor crash then causes the couch_file for the compaction file to be leaked, which means it cannot be opened again (due to couch_file_write_guard). There is actually an file_already_opened error in the logs which indicates this is happening.

        I'm adding code to check for division by zero and exit with a diagnostic message. Reassigning to Filipe to look into the btree issue.

        Show
        damien damien added a comment - I appears we have a btree related bug. There is a badarith error in the logs that is causing the view compaction to crash. The badarith error is in couch_view_compactor:update_task/2 and I believe is caused by division by zero, but if that happens then the indexes should be empty and the update_task/2 should not be called. The only way that seems possible is if there are values in the primary btree indexes, but the row counts are 0. I believe this must be caused by the cleaning of vbuckets values from the indexes, which must not be properly computing the reductions when this happens. I believe the compactor crash then causes the couch_file for the compaction file to be leaked, which means it cannot be opened again (due to couch_file_write_guard). There is actually an file_already_opened error in the logs which indicates this is happening. I'm adding code to check for division by zero and exit with a diagnostic message. Reassigning to Filipe to look into the btree issue.
        damien damien made changes -
        Assignee Damien Katz [ damien ] Filipe Manana [ filipe manana ]
        Hide
        filipe manana filipe manana added a comment -

        Would be great if someone could repeat this test.

        Neither I or Damien realize how to reproduce this neither why it could happen.
        The following commit will help diagnose this better when it happens the next time.

        https://github.com/couchbase/couchdb/commit/dd6546cad52c72421442b54eb59fe5984d913269

        Show
        filipe manana filipe manana added a comment - Would be great if someone could repeat this test. Neither I or Damien realize how to reproduce this neither why it could happen. The following commit will help diagnose this better when it happens the next time. https://github.com/couchbase/couchdb/commit/dd6546cad52c72421442b54eb59fe5984d913269
        Hide
        steve Steve Yen added a comment -

        please try to reproduce (with Filipe's changes)

        Show
        steve Steve Yen added a comment - please try to reproduce (with Filipe's changes)
        steve Steve Yen made changes -
        Assignee Filipe Manana [ filipe manana ] Tommie McAfee [ tommie ]
        Hide
        filipe manana filipe manana added a comment -

        This is same issues as MB-4774. One of them should be closed and marked as duplicate.
        Fix in http://review.couchbase.org/#change,13067

        Show
        filipe manana filipe manana added a comment - This is same issues as MB-4774 . One of them should be closed and marked as duplicate. Fix in http://review.couchbase.org/#change,13067
        Hide
        filipe manana filipe manana added a comment -
        Show
        filipe manana filipe manana added a comment - Fix merged today: https://github.com/couchbase/couchdb/commit/6319846fa68c73580e5ead96dbe27868447f730f
        filipe manana filipe manana made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Fixed [ 1 ]
        farshid Farshid Ghods (Inactive) made changes -
        Resolution Fixed [ 1 ]
        Status Closed [ 6 ] Reopened [ 4 ]
        farshid Farshid Ghods (Inactive) made changes -
        Labels 2.0-dev-preview-4-release-notes
        farshid Farshid Ghods (Inactive) made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        farshid Farshid Ghods (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        farshid Farshid Ghods (Inactive) made changes -
        Component/s couchbase-bucket [ 10173 ]
        Component/s ep_engine [ 10013 ]

          People

          • Assignee:
            tommie Tommie McAfee
            Reporter:
            tommie Tommie McAfee
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes