Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-20465

[GSI]Indexer missing DCP mutations

    XMLWordPrintable

Details

    • Untriaged
    • Yes

    Description

      Testrunner command: ./testrunner -i /tmp/centos_rec_job4.ini verify_query_result=False,gsi_type=forestdb,get-cbcollect-info=True,get-coredumps=True -t 2i.indexscans_2i.SecondaryIndexingScanTests.test_multi_create_query_explain_drop_index_with_concurrent_mutations,nodes_init=4,services_init=kv:n1ql-kv-kv:index-index,groups=simple,dataset=default,use_gsi_for_primary=True,use_gsi_for_secondary=True,doc-per-day=50,doc_ops=True,create_ops_per=.5,delete_ops_per=.2,update_ops_per=.2,run_async=True,scan_consistency=request_plus,dgm_run=True,active_resident_threshold=30,eviction_policy=fullEviction,GROUP=VIEW;GSI

      Testrunner Logs: http://qa.sc.couchbase.com/job/cen7-2i-set7-job3-crud-recovery-dgm/176/consoleFull
      Test #17

      Steps:
      1. Configure cluster, create buckets. set memory quota of data nodes to lesser value
      2. Load documents till dgm is achieved.
      3. Start KV operations and create and query indexes concurrently.

      Create index is failing

      Logs: https://s3.amazonaws.com/bugdb/jira/mb-007/create_index_dgm.tar

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Thanks Dave for taking a look. The bucket delete happens at 2016-08-08T02:50:55 as part of test cleanup after waiting for 10mins for the mutations.

          The below backfill was scheduled at 2016-08-08T02:40:59 but these mutations never reached Indexer.

          2016-08-08T02:40:59.449045-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 358) Scheduling backfill from 374 to 375, reschedule flag : True
          

          And when the stream got closed 10 mins later, the DCP logs also seem to indicate the mutations were sent only upto seqno 373.

          2016-08-08T02:50:55.955121-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 358) Stream closing, sent until seqno 373 remaining items 0, reason: The stream closed early because the conn was disconnected.
          

          deepkaran.salooja Deepkaran Salooja added a comment - Thanks Dave for taking a look. The bucket delete happens at 2016-08-08T02:50:55 as part of test cleanup after waiting for 10mins for the mutations. The below backfill was scheduled at 2016-08-08T02:40:59 but these mutations never reached Indexer. 2016-08-08T02:40:59.449045-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 358) Scheduling backfill from 374 to 375, reschedule flag : True And when the stream got closed 10 mins later, the DCP logs also seem to indicate the mutations were sent only upto seqno 373. 2016-08-08T02:50:55.955121-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 358) Stream closing, sent until seqno 373 remaining items 0, reason: The stream closed early because the conn was disconnected.
          drigby Dave Rigby added a comment -

          Hmm, I suspect this might be related to cursor dropping. @Manu - could you take a look please?

          drigby Dave Rigby added a comment - Hmm, I suspect this might be related to cursor dropping. @Manu - could you take a look please?
          manu Manu Dhundi (Inactive) added a comment - - edited

          From the logs,

          346 backfills are scheduled, but only 342 of them are run.

          cat memcached.log | grep -e "Scheduling backfill" | grep -e "INIT_STREAM_TOPIC_76" | wc -l
               346
          cat memcached.log | grep -e "Backfill complete" | grep -e "INIT_STREAM_TOPIC_76" | wc -l
               342
          

          Of the 5 rescheduled backfills,

          2016-08-08T02:40:58.949595-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/3 - (vb 565) Scheduling backfill from 398 to 398, reschedule flag : True
          2016-08-08T02:40:59.443957-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 350) Scheduling backfill from 372 to 372, reschedule flag : True
          2016-08-08T02:40:59.449045-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 358) Scheduling backfill from 374 to 375, reschedule flag : True
          2016-08-08T02:40:59.461183-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 362) Scheduling backfill from 448 to 449, reschedule flag : True
          2016-08-08T02:40:59.465253-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 370) Scheduling backfill from 395 to 395, reschedule flag : True
          

          1 of them is run correctly to completion,

           
          2016-08-08T02:40:58.954913-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/3 - (vb 565) Backfill complete, 239 items read from disk, 120 from memory, last seqno read: 398
          

          but not the other 4.

          Also, as expected the system is under severe memory pressure as it is a DGM test.
          Sometimes BackfillManagerTask is suspended temporarily.

          2016-08-08T02:40:58.183254-07:00 NOTICE (default) DCP backfilling task temporarily suspended because the current memory usage is too high
          2016-08-08T02:40:58.946745-07:00 Message repeated 2 times
          

          So, it is most likely not specific to cursor dropping.
          Something to do with
          1) BackfillManagerTask not running in some scenario or
          2) Some queued backfills being run by the BackfillManagerTask.

          Prasanna Gholap [X], 256MB for KV node seems less even for a DGM test (I agree that we claim to support it, but it is quite small).
          If this is blocking you or if you want to confirm that this a sporadic failure, you could run the test case with higher mem quota for KV node/bucket.

          Also note that, the stats.log does not contain the stats. There was some problem in the stats collection. The failure happens around 2016-08-08T02:40. But the logs are collected only around 2016-08-08T05:54. Is it possible to collect logs at the time of failure.

          manu Manu Dhundi (Inactive) added a comment - - edited From the logs, 346 backfills are scheduled, but only 342 of them are run. cat memcached.log | grep -e "Scheduling backfill" | grep -e "INIT_STREAM_TOPIC_76" | wc -l 346 cat memcached.log | grep -e "Backfill complete" | grep -e "INIT_STREAM_TOPIC_76" | wc -l 342 Of the 5 rescheduled backfills, 2016-08-08T02:40:58.949595-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/3 - (vb 565) Scheduling backfill from 398 to 398, reschedule flag : True 2016-08-08T02:40:59.443957-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 350) Scheduling backfill from 372 to 372, reschedule flag : True 2016-08-08T02:40:59.449045-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 358) Scheduling backfill from 374 to 375, reschedule flag : True 2016-08-08T02:40:59.461183-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 362) Scheduling backfill from 448 to 449, reschedule flag : True 2016-08-08T02:40:59.465253-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/0 - (vb 370) Scheduling backfill from 395 to 395, reschedule flag : True 1 of them is run correctly to completion, 2016-08-08T02:40:58.954913-07:00 NOTICE (default) DCP (Producer) eq_dcpq:secidx:proj-default-INIT_STREAM_TOPIC_76:80:e7:51:2d:d1:a6:5a-2990058650266308743/3 - (vb 565) Backfill complete, 239 items read from disk, 120 from memory, last seqno read: 398 but not the other 4. Also, as expected the system is under severe memory pressure as it is a DGM test. Sometimes BackfillManagerTask is suspended temporarily. 2016-08-08T02:40:58.183254-07:00 NOTICE (default) DCP backfilling task temporarily suspended because the current memory usage is too high 2016-08-08T02:40:58.946745-07:00 Message repeated 2 times So, it is most likely not specific to cursor dropping. Something to do with 1) BackfillManagerTask not running in some scenario or 2) Some queued backfills being run by the BackfillManagerTask. Prasanna Gholap [X] , 256MB for KV node seems less even for a DGM test (I agree that we claim to support it, but it is quite small). If this is blocking you or if you want to confirm that this a sporadic failure, you could run the test case with higher mem quota for KV node/bucket. Also note that, the stats.log does not contain the stats. There was some problem in the stats collection. The failure happens around 2016-08-08T02:40. But the logs are collected only around 2016-08-08T05:54. Is it possible to collect logs at the time of failure.

          Prasanna Gholap [X], can you please reproduce this problem.
          With the unavailability of stats.log we can't infer more.

          Also, if possible, can you modify the test script such that the logs are collected immediately on the failure?

          manu Manu Dhundi (Inactive) added a comment - Prasanna Gholap [X] , can you please reproduce this problem. With the unavailability of stats.log we can't infer more. Also, if possible, can you modify the test script such that the logs are collected immediately on the failure?

          The issue is not reproducible on latest 4.7 build (4.7.0-1177). Hence, resolving.
          If observed, the issue will be reopened.

          prasanna.gholap Prasanna Gholap [X] (Inactive) added a comment - The issue is not reproducible on latest 4.7 build (4.7.0-1177). Hence, resolving. If observed, the issue will be reopened.

          People

            prasanna.gholap Prasanna Gholap [X] (Inactive)
            prasanna.gholap Prasanna Gholap [X] (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty