Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37689

[System Test] : index building for a partitioned primary index is stuck on 1 indexer node

    XMLWordPrintable

    Details

      Description

      Build : 7.0.0-1210
      Test : -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml
      Scale : 3
      Day : 3

      Seeing that building of a partitioned primary index on CUSTOMER bucket has been stuck at 100% on indexer node 172.23.96.223.

      Some interesting stats for this index :
      "CUSTOMER:#primary:avg_drain_rate":0,
      "CUSTOMER:#primary:avg_item_size":0,
      "CUSTOMER:#primary:build_progress":100,
      "CUSTOMER:#primary:items_count":0,
      "CUSTOMER:#primary:num_docs_indexed":0,
      "CUSTOMER:#primary:num_docs_pending":0,
      "CUSTOMER:mutation_queue_size":0,
      "CUSTOMER:num_mutations_queued":10462620,

      On the other indexer node - 172.23.96.216, where the build progress is also 100% for this index, the same stats have the following values.
      "CUSTOMER:#primary:avg_drain_rate":0,
      "CUSTOMER:#primary:avg_item_size":7,
      "CUSTOMER:#primary:build_progress":100,
      "CUSTOMER:#primary:items_count":148215,
      "CUSTOMER:#primary:num_docs_indexed":148215,
      "CUSTOMER:#primary:num_docs_pending":0,
      "CUSTOMER:mutation_queue_size":0,
      "CUSTOMER:num_mutations_queued":11830267,

      Index creation was initiated at 2020-01-27T03:11:45

      [2020-01-27T03:11:45-08:00, sequoiatools/cbq:ba233f] -e=http://172.23.97.148:8093 -u=Administrator -p=password -script=create primary index on `CUSTOMER` partition by hash(rating,result,claim) using GSI with {"num_partition":4}
      

      The indexer logs on 172.23.96.223 shows some errors before and after the index creation time :

      2020-01-27T02:53:36.556-08:00 [Error] feed.DcpGetSeqnos(): read tcp 172.23.96.223:43674->172.23.96.14:11210: read: connection reset by peer
      2020-01-27T02:53:54.480-08:00 [Error] feed.DcpGetSeqnos(): read tcp 172.23.96.223:43728->172.23.96.14:11210: read: connection reset by peer
      2020-01-27T02:54:23.427-08:00 [Error] feed.DcpGetSeqnos(): EOF
      2020-01-27T02:54:54.467-08:00 [Error] DATP[->dataport ":9105"] worker "172.23.96.14:57938" exit: EOF
      2020-01-27T02:54:54.467-08:00 [Error] DATP[->dataport ":9105"] remote "172.23.96.14:57938" closed
      2020-01-27T02:54:59.550-08:00 [Error] PeerPipe.doRecieve() : ecounter error when received mesasage from Peer 172.23.96.216:43484.  Error = EOF. Kill Pipe.
      2020-01-27T03:12:04.432-08:00 [Error] PeerPipe.doRecieve() : ecounter error when received mesasage from Peer 172.23.96.216:9100.  Error = read tcp 172.23.96.223:45848->172.23.96.216:9100: use of closed network connection. Kill Pipe.
      2020-01-27T03:12:04.432-08:00 [Error] PeerPipe.doRecieve() : ecounter error when received mesasage from Peer 172.23.96.223:9100.  Error = read tcp 172.23.96.223:59288->172.23.96.223:9100: use of closed network connection. Kill Pipe.
      2020-01-27T03:12:04.432-08:00 [Error] PeerPipe.doRecieve() : ecounter error when received mesasage from Peer 172.23.96.223:59288.  Error = EOF. Kill Pipe.
      2020-01-27T03:12:06.958-08:00 [Error] PeerPipe.doRecieve() : ecounter error when received mesasage from Peer 172.23.96.216:43732.  Error = EOF. Kill Pipe.
      2020-01-27T03:13:06.827-08:00 [Error] DATP[->dataport ":9103"] Accept() Error: accept tcp [::]:9103: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.122:55304: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.122:55304" exit: read tcp 172.23.96.223:9103->172.23.96.122:55304: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.190:34134: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.190:34134" exit: read tcp 172.23.96.223:9103->172.23.96.190:34134: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.191:33560: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.191:33560" exit: read tcp 172.23.96.223:9103->172.23.96.191:33560: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.97.74:46710: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.97.74:46710" exit: read tcp 172.23.96.223:9103->172.23.97.74:46710: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.254:36816: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.254:36816" exit: read tcp 172.23.96.223:9103->172.23.96.254:36816: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.18:53666: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.18:53666" exit: read tcp 172.23.96.223:9103->172.23.96.18:53666: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.183:44420: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.183:44420" exit: read tcp 172.23.96.223:9103->172.23.96.183:44420: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.207:45080: use of closed network connection
      2020-01-27T03:13:06.828-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.207:45080" exit: read tcp 172.23.96.223:9103->172.23.96.207:45080: use of closed network connection
      2020-01-27T03:13:06.829-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.209:39492: use of closed network connection
      2020-01-27T03:13:06.829-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.209:39492" exit: read tcp 172.23.96.223:9103->172.23.96.209:39492: use of closed network connection
      2020-01-27T03:13:06.829-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.210:47340: use of closed network connection
      2020-01-27T03:13:06.829-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.210:47340" exit: read tcp 172.23.96.223:9103->172.23.96.210:47340: use of closed network connection
      2020-01-27T03:13:06.829-08:00 [Error] receiving packet: read tcp 172.23.96.223:9103->172.23.96.212:41140: use of closed network connection
      2020-01-27T03:13:06.829-08:00 [Error] DATP[->dataport ":9103"] worker "172.23.96.212:41140" exit: read tcp 172.23.96.223:9103->172.23.96.212:41140: use of closed network connection
      2020-01-27T03:13:07.818-08:00 [Error] KVSender::sendDelBucketsRequest Unexpected Error During Del Buckets Request Projector 172.23.96.18:9999 Topic INIT_STREAM_TOPIC_25becdd68259f2ef88ec15e3bb31eeed Buckets [CUSTOMER]. Err genServer.closed
      2020-01-27T03:13:07.818-08:00 [Error] KVSender::deleteBucketsFromStream INIT_STREAM CUSTOMER Error Received genServer.closed from 172.23.96.18:9999
      2020-01-27T03:13:07.845-08:00 [Error] KVSender::sendDelBucketsRequest Unexpected Error During Del Buckets Request Projector 172.23.96.190:9999 Topic INIT_STREAM_TOPIC_25becdd68259f2ef88ec15e3bb31eeed Buckets [CUSTOMER]. Err genServer.closed
      2020-01-27T03:13:07.845-08:00 [Error] KVSender::deleteBucketsFromStream INIT_STREAM CUSTOMER Error Received genServer.closed from 172.23.96.190:9999
      2020-01-27T03:13:07.861-08:00 [Error] KVSender::sendDelBucketsRequest Unexpected Error During Del Buckets Request Projector 172.23.96.191:9999 Topic INIT_STREAM_TOPIC_25becdd68259f2ef88ec15e3bb31eeed Buckets [CUSTOMER]. Err genServer.closed
      2020-01-27T03:13:07.861-08:00 [Error] KVSender::deleteBucketsFromStream INIT_STREAM CUSTOMER Error Received genServer.closed from 172.23.96.191:9999
      2020-01-27T03:13:07.908-08:00 [Error] KVSender::sendDelBucketsRequest Unexpected Error During Del Buckets Request Projector 172.23.96.210:9999 Topic INIT_STREAM_TOPIC_25becdd68259f2ef88ec15e3bb31eeed Buckets [CUSTOMER]. Err genServer.closed
      2020-01-27T03:13:07.908-08:00 [Error] KVSender::deleteBucketsFromStream INIT_STREAM CUSTOMER Error Received genServer.closed from 172.23.96.210:9999
      2020-01-27T03:13:07.952-08:00 [Error] KVSender::sendDelBucketsRequest Unexpected Error During Del Buckets Request Projector 172.23.97.74:9999 Topic INIT_STREAM_TOPIC_25becdd68259f2ef88ec15e3bb31eeed Buckets [CUSTOMER]. Err genServer.closed
      2020-01-27T03:13:07.952-08:00 [Error] KVSender::deleteBucketsFromStream INIT_STREAM CUSTOMER Error Received genServer.closed from 172.23.97.74:9999
      

        Attachments

          Issue Links

          For Gerrit Dashboard: MB-37689
          # Subject Branch Project Status CR V

            Activity

            Hide
            mihir.kamdar Mihir Kamdar added a comment -

            Varun Velamuri following are the set of logs which were collected from the cluster at around 2020-02-29 21:03. I have logs from before this time stamp as well, let me know if you need them.

            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.122.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.18.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.183.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.190.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.191.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.207.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.209.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.210.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.212.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.214.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.216.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.219.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.220.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.221.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.223.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.254.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.48.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.148.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.149.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.150.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.151.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.74.zip

            Show
            mihir.kamdar Mihir Kamdar added a comment - Varun Velamuri following are the set of logs which were collected from the cluster at around 2020-02-29 21:03. I have logs from before this time stamp as well, let me know if you need them. url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.122.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.18.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.183.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.190.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.191.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.207.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.209.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.210.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.212.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.214.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.216.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.219.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.220.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.221.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.223.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.254.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.96.48.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.148.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.149.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.150.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.151.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1583039009/collectinfo-2020-03-01T050329-ns_1%40172.23.97.74.zip
            Hide
            deepkaran.salooja Deepkaran Salooja added a comment -

            The issue seen here is the same as tracked by MB-36980.

            4 of the buckets hit into the flush getting aborted:

            2020-02-29T00:46:17.365-08:00 [Warn] Indexer::MutationQueue Dequeue Aborted For Seqno 35199 Bucket ITEM Vbucket 122. Last Dequeue 35175 Head Seqno 35202.
            2020-02-29T00:48:46.080-08:00 [Warn] Indexer::MutationQueue Dequeue Aborted For Seqno 78832 Bucket default Vbucket 600. Last Dequeue 78143 Head Seqno 79236.
            2020-02-29T00:47:54.752-08:00 [Warn] Indexer::MutationQueue Dequeue Aborted For Seqno 147 Bucket ORDERS Vbucket 407. Last Dequeue 0 Head Seqno 160.
            2020-02-29T00:48:47.763-08:00 [Warn] Indexer::MutationQueue Dequeue Aborted For Seqno 5271 Bucket HISTORY Vbucket 95. Last Dequeue 5220 Head Seqno 5272.
            

            The same 4 buckets are reported by flushMonitor as being stuck:

            2020-02-29T13:12:29.004-08:00 [Warn] Timekeeper::flushMonitor Waiting for flush to finish for 44690 seconds. Stream MAINT_STREAM Bucket HISTORY.
            2020-02-29T13:12:31.243-08:00 [Warn] Timekeeper::flushMonitor Waiting for flush to finish for 44720 seconds. Stream MAINT_STREAM Bucket default.
            2020-02-29T13:12:31.429-08:00 [Warn] Timekeeper::flushMonitor Waiting for flush to finish for 44720 seconds. Stream MAINT_STREAM Bucket ORDERS.
            2020-02-29T13:12:33.595-08:00 [Warn] Timekeeper::flushMonitor Waiting for flush to finish for 44880 seconds. Stream MAINT_STREAM Bucket ITEM.
            

            Mihir Kamdar, have there been any changes done to the system test for 7.0 as compared to 6.5.0? Has this issue been seen in any 6.5.x system test run?

            Show
            deepkaran.salooja Deepkaran Salooja added a comment - The issue seen here is the same as tracked by MB-36980 . 4 of the buckets hit into the flush getting aborted: 2020-02-29T00:46:17.365-08:00 [Warn] Indexer::MutationQueue Dequeue Aborted For Seqno 35199 Bucket ITEM Vbucket 122. Last Dequeue 35175 Head Seqno 35202. 2020-02-29T00:48:46.080-08:00 [Warn] Indexer::MutationQueue Dequeue Aborted For Seqno 78832 Bucket default Vbucket 600. Last Dequeue 78143 Head Seqno 79236. 2020-02-29T00:47:54.752-08:00 [Warn] Indexer::MutationQueue Dequeue Aborted For Seqno 147 Bucket ORDERS Vbucket 407. Last Dequeue 0 Head Seqno 160. 2020-02-29T00:48:47.763-08:00 [Warn] Indexer::MutationQueue Dequeue Aborted For Seqno 5271 Bucket HISTORY Vbucket 95. Last Dequeue 5220 Head Seqno 5272. The same 4 buckets are reported by flushMonitor as being stuck: 2020-02-29T13:12:29.004-08:00 [Warn] Timekeeper::flushMonitor Waiting for flush to finish for 44690 seconds. Stream MAINT_STREAM Bucket HISTORY. 2020-02-29T13:12:31.243-08:00 [Warn] Timekeeper::flushMonitor Waiting for flush to finish for 44720 seconds. Stream MAINT_STREAM Bucket default. 2020-02-29T13:12:31.429-08:00 [Warn] Timekeeper::flushMonitor Waiting for flush to finish for 44720 seconds. Stream MAINT_STREAM Bucket ORDERS. 2020-02-29T13:12:33.595-08:00 [Warn] Timekeeper::flushMonitor Waiting for flush to finish for 44880 seconds. Stream MAINT_STREAM Bucket ITEM. Mihir Kamdar , have there been any changes done to the system test for 7.0 as compared to 6.5.0? Has this issue been seen in any 6.5.x system test run?
            Hide
            mihir.kamdar Mihir Kamdar added a comment -

            Deepkaran Salooja The test hasnt been changed since 6.5.0. This issue wasn't seen in 6.5.0 or 6.5.1 runs

            Show
            mihir.kamdar Mihir Kamdar added a comment - Deepkaran Salooja The test hasnt been changed since 6.5.0. This issue wasn't seen in 6.5.0 or 6.5.1 runs
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-7.0.0-1527 contains indexing commit 8d8a710 with commit message:
            MB-37689 use sessionId in sync message

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1527 contains indexing commit 8d8a710 with commit message: MB-37689 use sessionId in sync message
            Hide
            mihir.kamdar Mihir Kamdar added a comment -

            Verified on 7.0.0-3625

            Show
            mihir.kamdar Mihir Kamdar added a comment - Verified on 7.0.0-3625

              People

              Assignee:
              deepkaran.salooja Deepkaran Salooja
              Reporter:
              mihir.kamdar Mihir Kamdar
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty