Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-24441

Rebalance taking more time(causing view quiring timing out due to view indexing)

    XMLWordPrintable

Details

    • Triaged
    • Centos 64-bit
    • Unknown

    Description

      Setup details

      3* CBS nodes with 200 GB data 
      replica =1
      eviction = value eviction
      2* syncgateway

      Seeing rebalance taking more time. and also index are getting rebuild during rebalance .

      This is causing views quires timeout leading to outage.
      syncgateway is going to hung state , all request to syncgateway are getting timeout.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          ark7856 Arihant Rk added a comment -

          we have set below parameter as well but does not see any improvement

          /opt/couchbase/bin/cbepctl 10.2.1.186:11210 -b ptxdata set flush_param warmup_min_memory_threshold 10
          /opt/couchbase/bin/cbepctl 10.2.1.186:11210 -b ptxdata set flush_param max_num_readers 14
          /opt/couchbase/bin/cbepctl 10.2.1.186:11210 -b ptxdata set flush_param max_num_writers 8

          ark7856 Arihant Rk added a comment - we have set below parameter as well but does not see any improvement /opt/couchbase/bin/cbepctl 10.2.1.186:11210 -b ptxdata set flush_param warmup_min_memory_threshold 10 /opt/couchbase/bin/cbepctl 10.2.1.186:11210 -b ptxdata set flush_param max_num_readers 14 /opt/couchbase/bin/cbepctl 10.2.1.186:11210 -b ptxdata set flush_param max_num_writers 8

          Arihant Rk We'll need a lot more information if we are to do any investigation this ticket.
          Firstly, please supply a set of logs from the Couchbase cluster using cbcollect_info. Also please can you describe the problem in more detail - e.g. "rebalance taking more time" - since when is it taking more time and what changed? What makes you think the rebalance time is causing the view queries to time out?
          Those epctl parameters you tried - what lead you to believe they would make a difference to the behaviour you were seeing?

          dhaikney David Haikney added a comment - Arihant Rk We'll need a lot more information if we are to do any investigation this ticket. Firstly, please supply a set of logs from the Couchbase cluster using cbcollect_info. Also please can you describe the problem in more detail - e.g. "rebalance taking more time" - since when is it taking more time and what changed? What makes you think the rebalance time is causing the view queries to time out? Those epctl parameters you tried - what lead you to believe they would make a difference to the behaviour you were seeing?

          Resolving this as incomplete for now. If you are able to collect the information requested above we'll be happy to reopen and investigate.

          dhaikney David Haikney added a comment - Resolving this as incomplete for now. If you are able to collect the information requested above we'll be happy to reopen and investigate.

          Opening ticket and adding the requested logs

          asif.kazi Asif Kazi (Inactive) added a comment - Opening ticket and adding the requested logs

          Noticed the following messages on node .82

          2017-05-24T16:14:10.714234+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 340, Check for: 4220, Persisted upto: 3679, cookie 0x7f17f959c000
          2017-05-24T16:14:19.740135+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 339, Check for: 7106, Persisted upto: 4488, cookie 0x7f17d5228000
          2017-05-24T16:14:20.056419+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1022, Check for: 4817, Persisted upto: 4206, cookie 0x7f17d525c000
          2017-05-24T16:14:28.505342+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1021, Check for: 4518, Persisted upto: 2869, cookie 0x7f17d52d2000
          2017-05-24T16:14:34.819405+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 337, Check for: 26644, Persisted upto: 12218, cookie 0x7f17d5228000
          2017-05-24T16:14:45.164906+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1020, Check for: 52565, Persisted upto: 16965, cookie 0x7f17d525c000
          2017-05-24T16:14:50.354263+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 336, Check for: 4825, Persisted upto: 4808, cookie 0x7f17d55e1000
          2017-05-24T16:14:54.846921+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1019, Check for: 5302, Persisted upto: 4264, cookie 0x7f17d52d2000
          2017-05-24T16:15:04.332396+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1018, Check for: 5207, Persisted upto: 3199, cookie 0x7f17d525c000
          2017-05-24T16:15:20.888261+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 334, Check for: 5345, Persisted upto: 5326, cookie 0x7f17d55e1000
          2017-05-24T16:15:21.893911+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1017, Check for: 6245, Persisted upto: 4868, cookie 0x7f17d52d2000
          2017-05-24T16:15:39.742261+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1016, Check for: 4025, Persisted upto: 3596, cookie 0x7f17d525c000
          2017-05-24T16:15:44.136704+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1015, Check for: 4264, Persisted up..

          sriram Sriram Ganesan (Inactive) added a comment - Noticed the following messages on node .82 2017-05-24T16:14:10.714234+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 340, Check for: 4220, Persisted upto: 3679, cookie 0x7f17f959c000 2017-05-24T16:14:19.740135+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 339, Check for: 7106, Persisted upto: 4488, cookie 0x7f17d5228000 2017-05-24T16:14:20.056419+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1022, Check for: 4817, Persisted upto: 4206, cookie 0x7f17d525c000 2017-05-24T16:14:28.505342+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1021, Check for: 4518, Persisted upto: 2869, cookie 0x7f17d52d2000 2017-05-24T16:14:34.819405+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 337, Check for: 26644, Persisted upto: 12218, cookie 0x7f17d5228000 2017-05-24T16:14:45.164906+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1020, Check for: 52565, Persisted upto: 16965, cookie 0x7f17d525c000 2017-05-24T16:14:50.354263+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 336, Check for: 4825, Persisted upto: 4808, cookie 0x7f17d55e1000 2017-05-24T16:14:54.846921+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1019, Check for: 5302, Persisted upto: 4264, cookie 0x7f17d52d2000 2017-05-24T16:15:04.332396+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1018, Check for: 5207, Persisted upto: 3199, cookie 0x7f17d525c000 2017-05-24T16:15:20.888261+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 334, Check for: 5345, Persisted upto: 5326, cookie 0x7f17d55e1000 2017-05-24T16:15:21.893911+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1017, Check for: 6245, Persisted upto: 4868, cookie 0x7f17d52d2000 2017-05-24T16:15:39.742261+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1016, Check for: 4025, Persisted upto: 3596, cookie 0x7f17d525c000 2017-05-24T16:15:44.136704+05:30 WARNING (ptxdata) Notified the timeout on seqno persistence for vbucket 1015, Check for: 4264, Persisted up..

          Reasons for view queries and sync gateway timeout:

          2017-05-24T16:14:10.518136+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/locationGrpView (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state
          2017-05-24T16:14:10.518193+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/ptxGroup (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state
          2017-05-24T16:14:10.600141+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/sync_gateway (prod/replica) - (vb 339) Stream request failed because this vbucket is in backfill state
          2017-05-24T16:14:10.604108+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/locationGrpView (prod/replica) - (vb 339) Stream request failed because this vbucket is in backfill state
          2017-05-24T16:14:10.617106+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/sync_housekeeping (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state
          2017-05-24T16:14:10.617146+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/sync_gateway (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state
          2017-05-24T16:14:10.619101+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/locationGrpView (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state
          2017-05-24T16:14:10.619166+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/ptxGroup (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state
          2017-05-24T16:14:10.701123+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/sync_gateway (prod/replica) - (vb 339) Stream request failed because this vbucket is in backfill state
          2017-05-24T16:14:10.705154+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/locationGrpView (prod/replica) - (vb 339) Stream request failed because this vbucket is in backfill state

          The streams are not getting created because the associated vbuckets are still in backfilling state

          sriram Sriram Ganesan (Inactive) added a comment - Reasons for view queries and sync gateway timeout: 2017-05-24T16:14:10.518136+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/locationGrpView (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state 2017-05-24T16:14:10.518193+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/ptxGroup (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state 2017-05-24T16:14:10.600141+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/sync_gateway (prod/replica) - (vb 339) Stream request failed because this vbucket is in backfill state 2017-05-24T16:14:10.604108+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/locationGrpView (prod/replica) - (vb 339) Stream request failed because this vbucket is in backfill state 2017-05-24T16:14:10.617106+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/sync_housekeeping (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state 2017-05-24T16:14:10.617146+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/sync_gateway (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state 2017-05-24T16:14:10.619101+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/locationGrpView (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state 2017-05-24T16:14:10.619166+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/ptxGroup (prod/main) - (vb 1023) Stream request failed because this vbucket is in backfill state 2017-05-24T16:14:10.701123+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/sync_gateway (prod/replica) - (vb 339) Stream request failed because this vbucket is in backfill state 2017-05-24T16:14:10.705154+05:30 WARNING (ptxdata) DCP (Producer) eq_dcpq:mapreduce_view: ptxdata _design/locationGrpView (prod/replica) - (vb 339) Stream request failed because this vbucket is in backfill state The streams are not getting created because the associated vbuckets are still in backfilling state

          After examining the logs on node .82

          When the node was rebalanced in for full recovery, timeouts were observed on seqno
          persistence. There was no conclusive evidence was causing the timeouts. It could just purely
          be that disk was slow and hence the writes to disk were taking more time.

          (ii) Views and sync gateway was timing out. Evidence from the logs suggest that the stream
          requests were timing out because the associated vbuckets were in backfilling state. This could
          just be because rebalance is taking place. Given that the disk is slow during this time, the
          backfills (basically reads from disk) are also slower, thus resulting in timeouts.

          Suggestion would be inspect the hardware on node .82 or trying using a different node to see if it would ameliorate this particular case. Please update the ticket if you have more details to share.

          sriram Sriram Ganesan (Inactive) added a comment - After examining the logs on node .82 When the node was rebalanced in for full recovery, timeouts were observed on seqno persistence. There was no conclusive evidence was causing the timeouts. It could just purely be that disk was slow and hence the writes to disk were taking more time. (ii) Views and sync gateway was timing out. Evidence from the logs suggest that the stream requests were timing out because the associated vbuckets were in backfilling state. This could just be because rebalance is taking place. Given that the disk is slow during this time, the backfills (basically reads from disk) are also slower, thus resulting in timeouts. Suggestion would be inspect the hardware on node .82 or trying using a different node to see if it would ameliorate this particular case. Please update the ticket if you have more details to share.

          People

            sriram Sriram Ganesan (Inactive)
            ark7856 Arihant Rk
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty