Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47482

[System Test] Building view index hung

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Not a Bug
    • 6.6.3
    • 6.6.3
    • view-engine

    Description

      Build - 6.6.3-9796
      Test -

      -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml
      

      Scale - 3
      Day - 3

      Error -
      Building view index default/_design/scale stuck and keeps on switching between 1,2 and 3%.

      Issue not observed on previous run with 6.6.3 - 9791.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Rebalance failed

            [user:error,2021-07-18T04:42:41.940-07:00,ns_1@172.23.104.155:<0.19338.177>:ns_orchestrator:log_rebalance_completion:1460]Rebalance exited with reason {mover_crashed,
                                          {{{function_clause,
                                             [{compaction_daemon,
                                               process_scheduler_message,
                                               [compact_views,
                                                {compaction_state,
                                                 [<<"default">>],
                                                 <25421.30680.1299>,
                                                 {state,undefined,undefined,30,
                                                  compact_views},
                                                 #Fun<compaction_daemon.2.62703019>,
                                                 {config,
                                                  {30,undefined},
                                                  {30,undefined},
                                                  undefined,false,false,
                                                  {daemon_config,30,131072,20971520}}}],
                                               [{file,"src/compaction_daemon.erl"},
                                                {line,1292}]},
                                              {compaction_daemon,handle_info,2,
                                               [{file,"src/compaction_daemon.erl"},
                                                {line,334}]},
                                              {gen_server,try_dispatch,4,
                                               [{file,"gen_server.erl"},{line,616}]},
                                              {gen_server,handle_msg,6,
                                               [{file,"gen_server.erl"},{line,686}]},
                                              {proc_lib,init_p_do_apply,3,
                                               [{file,"proc_lib.erl"},{line,247}]}]},
                                            {gen_server,call,
                                             [compaction_daemon,
                                              {uninhibit_view_compaction,<<"default">>,
                                               #Ref<25421.2849894141.663486468.234167>},
                                              infinity]}},
                                           {gen_server,call,
                                            [{'janitor_agent-default',
                                              'ns_1@172.23.108.103'},
                                             {if_rebalance,<0.32327.1435>,
                                              {uninhibit_view_compaction,
                                               #Ref<25421.2849894141.663486468.234167>}},
                                             infinity]}}}.
            Rebalance Operation Id = 91b0323dcb3c7456460b65e64e9d9d01
            [stats:error,2021-07-18T04:55:08.087-07:00,ns_1@172.23.104.155:<0.29601.1443>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100']
            [stats:error,2021-07-18T04:55:08.464-07:00,ns_1@172.23.104.155:<0.27916.1443>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100']
            [stats:error,2021-07-18T04:55:08.517-07:00,ns_1@172.23.104.155:<0.10220.1444>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100']
            [stats:error,2021-07-18T04:55:08.520-07:00,ns_1@172.23.104.155:<0.5552.1444>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100']
            [stats:error,2021-07-18T04:55:08.545-07:00,ns_1@172.23.104.155:<0.29738.1443>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100']
            [stats:error,2021-07-18T04:56:23.407-07:00,ns_1@172.23.104.155:<0.1185.1445>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100']
            [stats:error,2021-07-18T04:56:23.412-07:00,ns_1@172.23.104.155:<0.7634.1445>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100']
            [stats:error,2021-07-18T04:59:48.457-07:00,ns_1@172.23.104.155:<0.13759.1448>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100']
            [ns_server:error,2021-07-18T06:05:01.738-07:00,ns_1@172.23.104.155:<0.6577.1484>:ns_config_rep:synchronize_remote:299]Failed to synchronize config to some nodes:
            [{'ns_1@172.23.104.5',
                 {exit,
                     {shutdown,
                         {gen_server,call,
                             [{ns_config_rep,'ns_1@172.23.104.5'},
                              synchronize_everything,infinity]}}}}]
            ...
            [ns_server:error,2021-07-18T06:05:01.739-07:00,ns_1@172.23.104.155:<0.5251.1484>:misc:try_with_maybe_ignorant_after:1427]Eating exception from ignorant after-block:
            {error,
                {badmatch,
                    [{<0.11916.1484>,
                      {failed_to_update_vbucket_map,"default",244,
                          {error,
                              [{'ns_1@172.23.104.5',
                                   {exit,
                                       {shutdown,
                                           {gen_server,call,
                                               [{ns_config_rep,'ns_1@172.23.104.5'},
                                                synchronize_everything,infinity]}}}}]}}}]},
                [{misc,sync_shutdown_many_i_am_trapping_exits,1,
                     [{file,"src/misc.erl"},{line,1393}]},
                 {misc,try_with_maybe_ignorant_after,2,
                     [{file,"src/misc.erl"},{line,1425}]},
                 {ns_single_vbucket_mover,mover,6,
                     [{file,"src/ns_single_vbucket_mover.erl"},{line,52}]},
                 {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
            

            Could someone from ns_server team take a look?

            ankit.prabhu Ankit Prabhu added a comment - Rebalance failed [user:error,2021-07-18T04:42:41.940-07:00,ns_1@172.23.104.155:<0.19338.177>:ns_orchestrator:log_rebalance_completion:1460]Rebalance exited with reason {mover_crashed, {{{function_clause, [{compaction_daemon, process_scheduler_message, [compact_views, {compaction_state, [<<"default">>], <25421.30680.1299>, {state,undefined,undefined,30, compact_views}, #Fun<compaction_daemon.2.62703019>, {config, {30,undefined}, {30,undefined}, undefined,false,false, {daemon_config,30,131072,20971520}}}], [{file,"src/compaction_daemon.erl"}, {line,1292}]}, {compaction_daemon,handle_info,2, [{file,"src/compaction_daemon.erl"}, {line,334}]}, {gen_server,try_dispatch,4, [{file,"gen_server.erl"},{line,616}]}, {gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,686}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,247}]}]}, {gen_server,call, [compaction_daemon, {uninhibit_view_compaction,<<"default">>, #Ref<25421.2849894141.663486468.234167>}, infinity]}}, {gen_server,call, [{'janitor_agent-default', 'ns_1@172.23.108.103'}, {if_rebalance,<0.32327.1435>, {uninhibit_view_compaction, #Ref<25421.2849894141.663486468.234167>}}, infinity]}}}. Rebalance Operation Id = 91b0323dcb3c7456460b65e64e9d9d01 [stats:error,2021-07-18T04:55:08.087-07:00,ns_1@172.23.104.155:<0.29601.1443>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100'] [stats:error,2021-07-18T04:55:08.464-07:00,ns_1@172.23.104.155:<0.27916.1443>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100'] [stats:error,2021-07-18T04:55:08.517-07:00,ns_1@172.23.104.155:<0.10220.1444>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100'] [stats:error,2021-07-18T04:55:08.520-07:00,ns_1@172.23.104.155:<0.5552.1444>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100'] [stats:error,2021-07-18T04:55:08.545-07:00,ns_1@172.23.104.155:<0.29738.1443>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100'] [stats:error,2021-07-18T04:56:23.407-07:00,ns_1@172.23.104.155:<0.1185.1445>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100'] [stats:error,2021-07-18T04:56:23.412-07:00,ns_1@172.23.104.155:<0.7634.1445>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100'] [stats:error,2021-07-18T04:59:48.457-07:00,ns_1@172.23.104.155:<0.13759.1448>:stats_reader:log_bad_responses:238]Some nodes didn't respond: ['ns_1@172.23.106.100'] [ns_server:error,2021-07-18T06:05:01.738-07:00,ns_1@172.23.104.155:<0.6577.1484>:ns_config_rep:synchronize_remote:299]Failed to synchronize config to some nodes: [{'ns_1@172.23.104.5', {exit, {shutdown, {gen_server,call, [{ns_config_rep,'ns_1@172.23.104.5'}, synchronize_everything,infinity]}}}}] ... [ns_server:error,2021-07-18T06:05:01.739-07:00,ns_1@172.23.104.155:<0.5251.1484>:misc:try_with_maybe_ignorant_after:1427]Eating exception from ignorant after-block: {error, {badmatch, [{<0.11916.1484>, {failed_to_update_vbucket_map,"default",244, {error, [{'ns_1@172.23.104.5', {exit, {shutdown, {gen_server,call, [{ns_config_rep,'ns_1@172.23.104.5'}, synchronize_everything,infinity]}}}}]}}}]}, [{misc,sync_shutdown_many_i_am_trapping_exits,1, [{file,"src/misc.erl"},{line,1393}]}, {misc,try_with_maybe_ignorant_after,2, [{file,"src/misc.erl"},{line,1425}]}, {ns_single_vbucket_mover,mover,6, [{file,"src/ns_single_vbucket_mover.erl"},{line,52}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]} Could someone from ns_server team take a look?
            ankit.prabhu Ankit Prabhu added a comment -

            Looking at the task endpoint the totalChanges are changing for views updater(this is captured when the view updater starts). That means views are trying to index on new snapshot and old one is completed. This percentage of index build is on snapshot not on the entire bucket data.
            There are "update_after" query going on in the background and also there are ops on bucket on which views are defined.

            [couchdb:info,2021-07-19T00:40:59.333-07:00,couchdb_ns_1@cb.local:couch_query_logger<0.189.0>:couch_log:info:41]Query-Volume
            <ud>"/default/_design/scale/_view/padd"</ud> | internal={stale: {ok: 0, upd_after: 600, false: 0}, ResultBytes: 89727192} | external={stale: {ok: 0, upd_after: 0, false: 0}, ResultBytes: 0}
            <ud>"/default/_design/scale/_view/stats"</ud> | internal={stale: {ok: 0, upd_after: 600, false: 0}, ResultBytes: 9000} | external={stale: {ok: 0, upd_after: 0, false: 0}, ResultBytes: 0}
            

            task endpoint result:

            [{"statusId":"bc622ce7c5df20c72d69a2c68eb37448","type":"rebalance","status":"notRunning","statusIsStale":false,"masterRequestTimedOut":false,"lastReportURI":"/logs/rebalanceReport?reportID=df593ad583b28b5990e143507559511d"},{"cancelURI":"/controller/cancelXDCR/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","settingsURI":"/settings/replications/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","status":"running","replicationType":"xmem","continuous":true,"filterBypassExpiry":false,"filterDeletion":false,"filterExpiration":false,"filterExpression":"","id":"48e39a1f0a3383317117c525188efd29/default/remote","pauseRequested":false,"source":"default","target":"/remoteClusters/48e39a1f0a3383317117c525188efd29/buckets/remote","type":"xdcr","recommendedRefreshPeriod":10,"changesLeft":12,"docsChecked":877750378,"docsWritten":64888858,"maxVBReps":null,"errors":[]},{"type":"indexer","recommendedRefreshPeriod":2,"status":"running","bucket":"default","designDocument":"_design/scale","changesDone":3,"totalChanges":238,"progress":1},{"type":"bucket_compaction","recommendedRefreshPeriod":2,"status":"running","bucket":"ITEM","changesDone":26,"totalChanges":159,"progress":16},{"type":"bucket_compaction","recommendedRefreshPeriod":2,"status":"running","bucket":"NEW_ORDER","changesDone":464,"totalChanges":789,"progress":58},
            ...
            [{"statusId":"bc622ce7c5df20c72d69a2c68eb37448","type":"rebalance","status":"notRunning","statusIsStale":false,"masterRequestTimedOut":false,"lastReportURI":"/logs/rebalanceReport?reportID=df593ad583b28b5990e143507559511d"},{"cancelURI":"/controller/cancelXDCR/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","settingsURI":"/settings/replications/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","status":"running","replicationType":"xmem","continuous":true,"filterBypassExpiry":false,"filterDeletion":false,"filterExpiration":false,"filterExpression":"","id":"48e39a1f0a3383317117c525188efd29/default/remote","pauseRequested":false,"source":"default","target":"/remoteClusters/48e39a1f0a3383317117c525188efd29/buckets/remote","type":"xdcr","recommendedRefreshPeriod":10,"changesLeft":3,"docsChecked":881450287,"docsWritten":65188405,"maxVBReps":null,"errors":[]},{"type":"indexer","recommendedRefreshPeriod":2,"status":"running","bucket":"default","designDocument":"_design/scale","changesDone":3,"totalChanges":29,"progress":10},
            ...[{"statusId":"bc622ce7c5df20c72d69a2c68eb37448","type":"rebalance","status":"notRunning","statusIsStale":false,"masterRequestTimedOut":false,"lastReportURI":"/logs/rebalanceReport?reportID=df593ad583b28b5990e143507559511d"},{"cancelURI":"/controller/cancelXDCR/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","settingsURI":"/settings/replications/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","status":"running","replicationType":"xmem","continuous":true,"filterBypassExpiry":false,"filterDeletion":false,"filterExpiration":false,"filterExpression":"","id":"48e39a1f0a3383317117c525188efd29/default/remote","pauseRequested":false,"source":"default","target":"/remoteClusters/48e39a1f0a3383317117c525188efd29/buckets/remote","type":"xdcr","recommendedRefreshPeriod":10,"changesLeft":2,"docsChecked":882204174,"docsWritten":65196075,"maxVBReps":null,"errors":[]},{"type":"indexer","recommendedRefreshPeriod":2,"status":"running","bucket":"default","designDocument":"_design/scale","changesDone":2,"totalChanges":17,"progress":11},
            ...
            [{"statusId":"bc622ce7c5df20c72d69a2c68eb37448","type":"rebalance","status":"notRunning","statusIsStale":false,"masterRequestTimedOut":false,"lastReportURI":"/logs/rebalanceReport?reportID=df593ad583b28b5990e143507559511d"},{"cancelURI":"/controller/cancelXDCR/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","settingsURI":"/settings/replications/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","status":"running","replicationType":"xmem","continuous":true,"filterBypassExpiry":false,"filterDeletion":false,"filterExpiration":false,"filterExpression":"","id":"48e39a1f0a3383317117c525188efd29/default/remote","pauseRequested":false,"source":"default","target":"/remoteClusters/48e39a1f0a3383317117c525188efd29/buckets/remote","type":"xdcr","recommendedRefreshPeriod":10,"changesLeft":2,"docsChecked":882215626,"docsWritten":65200308,"maxVBReps":null,"errors":[]},{"type":"indexer","recommendedRefreshPeriod":2,"status":"running","bucket":"default","designDocument":"_design/scale","changesDone":2,"totalChanges":16,"progress":12},
            

            Sujay Gad could you please confirm this by stopping the update-after query or stopping ops on bucket?
            Stopping update after query will run the updater every 5 seconds(updateInterval) if unindexed seq number is more than 5000(updateMinChanges)?

            ankit.prabhu Ankit Prabhu added a comment - Looking at the task endpoint the totalChanges are changing for views updater(this is captured when the view updater starts). That means views are trying to index on new snapshot and old one is completed. This percentage of index build is on snapshot not on the entire bucket data. There are "update_after" query going on in the background and also there are ops on bucket on which views are defined. [couchdb:info,2021-07-19T00:40:59.333-07:00,couchdb_ns_1@cb.local:couch_query_logger<0.189.0>:couch_log:info:41]Query-Volume <ud>"/default/_design/scale/_view/padd"</ud> | internal={stale: {ok: 0, upd_after: 600, false: 0}, ResultBytes: 89727192} | external={stale: {ok: 0, upd_after: 0, false: 0}, ResultBytes: 0} <ud>"/default/_design/scale/_view/stats"</ud> | internal={stale: {ok: 0, upd_after: 600, false: 0}, ResultBytes: 9000} | external={stale: {ok: 0, upd_after: 0, false: 0}, ResultBytes: 0} task endpoint result: [{"statusId":"bc622ce7c5df20c72d69a2c68eb37448","type":"rebalance","status":"notRunning","statusIsStale":false,"masterRequestTimedOut":false,"lastReportURI":"/logs/rebalanceReport?reportID=df593ad583b28b5990e143507559511d"},{"cancelURI":"/controller/cancelXDCR/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","settingsURI":"/settings/replications/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","status":"running","replicationType":"xmem","continuous":true,"filterBypassExpiry":false,"filterDeletion":false,"filterExpiration":false,"filterExpression":"","id":"48e39a1f0a3383317117c525188efd29/default/remote","pauseRequested":false,"source":"default","target":"/remoteClusters/48e39a1f0a3383317117c525188efd29/buckets/remote","type":"xdcr","recommendedRefreshPeriod":10,"changesLeft":12,"docsChecked":877750378,"docsWritten":64888858,"maxVBReps":null,"errors":[]},{"type":"indexer","recommendedRefreshPeriod":2,"status":"running","bucket":"default","designDocument":"_design/scale","changesDone":3,"totalChanges":238,"progress":1},{"type":"bucket_compaction","recommendedRefreshPeriod":2,"status":"running","bucket":"ITEM","changesDone":26,"totalChanges":159,"progress":16},{"type":"bucket_compaction","recommendedRefreshPeriod":2,"status":"running","bucket":"NEW_ORDER","changesDone":464,"totalChanges":789,"progress":58}, ... [{"statusId":"bc622ce7c5df20c72d69a2c68eb37448","type":"rebalance","status":"notRunning","statusIsStale":false,"masterRequestTimedOut":false,"lastReportURI":"/logs/rebalanceReport?reportID=df593ad583b28b5990e143507559511d"},{"cancelURI":"/controller/cancelXDCR/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","settingsURI":"/settings/replications/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","status":"running","replicationType":"xmem","continuous":true,"filterBypassExpiry":false,"filterDeletion":false,"filterExpiration":false,"filterExpression":"","id":"48e39a1f0a3383317117c525188efd29/default/remote","pauseRequested":false,"source":"default","target":"/remoteClusters/48e39a1f0a3383317117c525188efd29/buckets/remote","type":"xdcr","recommendedRefreshPeriod":10,"changesLeft":3,"docsChecked":881450287,"docsWritten":65188405,"maxVBReps":null,"errors":[]},{"type":"indexer","recommendedRefreshPeriod":2,"status":"running","bucket":"default","designDocument":"_design/scale","changesDone":3,"totalChanges":29,"progress":10}, ...[{"statusId":"bc622ce7c5df20c72d69a2c68eb37448","type":"rebalance","status":"notRunning","statusIsStale":false,"masterRequestTimedOut":false,"lastReportURI":"/logs/rebalanceReport?reportID=df593ad583b28b5990e143507559511d"},{"cancelURI":"/controller/cancelXDCR/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","settingsURI":"/settings/replications/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","status":"running","replicationType":"xmem","continuous":true,"filterBypassExpiry":false,"filterDeletion":false,"filterExpiration":false,"filterExpression":"","id":"48e39a1f0a3383317117c525188efd29/default/remote","pauseRequested":false,"source":"default","target":"/remoteClusters/48e39a1f0a3383317117c525188efd29/buckets/remote","type":"xdcr","recommendedRefreshPeriod":10,"changesLeft":2,"docsChecked":882204174,"docsWritten":65196075,"maxVBReps":null,"errors":[]},{"type":"indexer","recommendedRefreshPeriod":2,"status":"running","bucket":"default","designDocument":"_design/scale","changesDone":2,"totalChanges":17,"progress":11}, ... [{"statusId":"bc622ce7c5df20c72d69a2c68eb37448","type":"rebalance","status":"notRunning","statusIsStale":false,"masterRequestTimedOut":false,"lastReportURI":"/logs/rebalanceReport?reportID=df593ad583b28b5990e143507559511d"},{"cancelURI":"/controller/cancelXDCR/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","settingsURI":"/settings/replications/48e39a1f0a3383317117c525188efd29%2Fdefault%2Fremote","status":"running","replicationType":"xmem","continuous":true,"filterBypassExpiry":false,"filterDeletion":false,"filterExpiration":false,"filterExpression":"","id":"48e39a1f0a3383317117c525188efd29/default/remote","pauseRequested":false,"source":"default","target":"/remoteClusters/48e39a1f0a3383317117c525188efd29/buckets/remote","type":"xdcr","recommendedRefreshPeriod":10,"changesLeft":2,"docsChecked":882215626,"docsWritten":65200308,"maxVBReps":null,"errors":[]},{"type":"indexer","recommendedRefreshPeriod":2,"status":"running","bucket":"default","designDocument":"_design/scale","changesDone":2,"totalChanges":16,"progress":12}, Sujay Gad could you please confirm this by stopping the update-after query or stopping ops on bucket? Stopping update after query will run the updater every 5 seconds(updateInterval) if unindexed seq number is more than 5000(updateMinChanges)?
            sujay.gad Sujay Gad added a comment - - edited

            Initially rebalance failed due to MB-47477. Issue of view index getting stuck is not observed after stopping ops on default bucket.

            sujay.gad Sujay Gad added a comment - - edited Initially rebalance failed due to MB-47477 . Issue of view index getting stuck is not observed after stopping ops on default bucket.
            ankit.prabhu Ankit Prabhu added a comment -

            Rebalance failed issue is tracked on MB-47477
            Resolving this as not a bug.

            ankit.prabhu Ankit Prabhu added a comment - Rebalance failed issue is tracked on MB-47477 Resolving this as not a bug.

            People

              ankit.prabhu Ankit Prabhu
              sujay.gad Sujay Gad
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty