Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49004

Performance : Deployment hung

    XMLWordPrintable

Details

    Description

      Build: 7.1.0 build 1517 , not seen on 1472

      Test job: http://perf.jenkins.couchbase.com/job/themis/13189/console

      Error in logs

      2021-10-19T03:09:05.825-07:00 [Warn] client::Serve [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Exiting c++ worker with error: exit status 1
      2021-10-19T03:09:05.825-07:00 [Info] client::Serve [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] After worker exit, stopCalled: false
      2021-10-19T03:09:05.825-07:00 [Info] eventing-consumer [worker_perf-test1_7:/tmp/127.0.0.1:8091_7_1711403943.sock:48984] [lcb,confmon L:183 I:2818644853] Setting initial configuration. Received via CCCP (bucket="eventing-bucket-1", rev=1:139, address="172.23.96.16")
      2021-10-19T03:09:05.825-07:00 [Info] client::Serve [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Informing Eventing.Producer to stop Eventing.Consumer instance: consumer => app: perf-test1 name: worker_perf-test1_16 tcpPort: /tmp/127.0.0.1:8091_16_1711403943.sock ospid: 48944 dcpEventProcessed:  v8EventProcessed: log_level:1 thr_map:1 thr_count:1 v8_init:1 v8_load:1 vb_map:1
      2021-10-19T03:09:05.825-07:00 [Info] Producer::KillAndRespawnEventingConsumer [perf-test1:11] IndexToPurge: 1 ConsumerIndex: 16 Shutting down Eventing.Consumer instance: consumer => app: perf-test1 name: worker_perf-test1_16 tcpPort: /tmp/127.0.0.1:8091_16_1711403943.sock ospid: 48944 dcpEventProcessed:  v8EventProcessed: thr_map:1 thr_count:1 v8_init:1 v8_load:1 vb_map:1 log_level:1
      2021-10-19T03:09:05.825-07:00 [Info] Supervisor::Remove Token id: {8589966985}
      2021-10-19T03:09:05.825-07:00 [Info] Producer::KillAndRespawnEventingConsumer [perf-test1:11] IndexToPurge: 1 ConsumerIndex: 16 Closing down listener handles
      2021-10-19T03:09:05.825-07:00 [Info] DCPT[eventing:FkUq1baH-295655:{eventing:FkUq1baH-295654:worker_perf-test1_21_172.23.96.16:11210_172.23.97.177:8096}/0] ##abcd feed started ... collectionAware: true
      2021-10-19T03:09:05.825-07:00 [Info] Consumer::Stop [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Gracefully shutting down consumer routine
      2021-10-19T03:09:05.825-07:00 [Info] Consumer::RemoveSupervisorToken [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Removing supervisor token
      2021-10-19T03:09:05.825-07:00 [Info] Supervisor::Remove Token id: {139139760521216}
      2021-10-19T03:09:05.825-07:00 [Info] Consumer::Stop [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Requested to remove supervision of eventing-consumer
      2021-10-19T03:09:05.825-07:00 [Error] Producer::handleV8Consumer [perf-test1:11] Accept failed in main loop, err: accept unix /tmp/127.0.0.1:8091_16_1711403943.sock: use of closed network connection
      2021-10-19T03:09:05.825-07:00 [Info] Consumer::Stop [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Stopped checkpoint, restart vb dcp stream and stats tickers
      2021-10-19T03:09:05.825-07:00 [Info] Consumer::sendMessageLoop [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Exiting send message routine
      2021-10-19T03:09:05.825-07:00 [Info] Consumer::Stop [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Sent signal over channel to stop timer routines
      2021-10-19T03:09:05.825-07:00 [Info] Consumer::Stop [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Sent signal to stop cpp worker stat collection routine
      2021-10-19T03:09:05.825-07:00 [Info] Consumer::Stop [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Sent signal over channel to stop checkpointing routine
      2021-10-19T03:09:05.825-07:00 [Error] Producer::handleV8Consumer [perf-test1:11] Accept failed in feedback loop, err: accept unix /tmp/f_127.0.0.1:8091_16_1711403943.sock: use of closed network connection
      2021-10-19T03:09:05.825-07:00 [Info] Producer::KillAndRespawnEventingConsumer [perf-test1:11] ConsumerIndex: 16 respawning the Eventing.Consumer instance
      2021-10-19T03:09:05.825-07:00 [Info] DCPT[eventing:FkUq1baH-295570:{eventing:FkUq1baH-295569:worker_perf-test1_16_172.23.96.16:11210_172.23.97.177:8096}/0] doReceive(): connection closed
      2021-10-19T03:09:05.825-07:00 [Info] client::Stop [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Exiting c++ worker
      2021-10-19T03:09:05.825-07:00 [Error] client::Stop [worker_perf-test1_16:/tmp/127.0.0.1:8091_16_1711403943.sock:48944] Unable to notify  c++ worker, err: os: process already finished
      2021-10-19T03:09:05.825-07:00 [Info] DCPT[eventing:FkUq1baH-295570:{eventing:FkUq1baH-295569:worker_perf-test1_16_172.23.96.16:11210_172.23.97.177:8096}/0] ##abcd ... stopped 

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Vikas Chaudhary Is it always reproducible or only once or sometimes? I ask because we do not see this issue in other tests.

            jeelan.poola Jeelan Poola added a comment - Vikas Chaudhary Is it always reproducible or only once or sometimes? I ask because we do not see this issue in other tests.
            sujay.gad Sujay Gad added a comment - - edited

            Jeelan Poola I am seeing this issue continuously in functional as well as system tests. Latest component system test run for eventing was blocked due to this issue - MB-49002.

            sujay.gad Sujay Gad added a comment - - edited Jeelan Poola I am seeing this issue continuously in functional as well as system tests. Latest component system test run for eventing was blocked due to this issue - MB-49002 .

            Jeelan Poola Its happening with all the tests.

            vikas.chaudhary Vikas Chaudhary added a comment - Jeelan Poola  Its happening with all the tests.

            Vikas ChaudharySujay Gad Thank you! Ankit Prabhu is looking into resolving it asap.

            jeelan.poola Jeelan Poola added a comment - Vikas Chaudhary Sujay Gad Thank you! Ankit Prabhu is looking into resolving it asap.

            Looks like AddressSanitiser aborting the eventing consumer process. Looking at the message some code trying to access 4 bytes and which is not in the accessible region.

            ==eventing-consumer==27554==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fbaff2edcc0 at pc 0x7fbb062b321e bp 0x7fbaff2ed250 sp 0x7fbaff2ed248
            READ of size 4 at 0x7fbaff2edcc0 thread T3
            

            Looking at the shadow memory

            Shadow bytes around the buggy address:
              0x0ff7dfe55b40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
              0x0ff7dfe55b50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
              0x0ff7dfe55b60: 00 00 00 00 00 00 00 00 00 00 00 00 00 f3 f3 f3
              0x0ff7dfe55b70: f3 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
              0x0ff7dfe55b80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 01 f2 01 f2
            =>0x0ff7dfe55b90: 01 f2 01 f2 01 f2 01 f2[01]f2 01 f2 01 f2 01 f2
              0x0ff7dfe55ba0: 01 f2 01 f2 01 f2 f8 f2 f8 f2 04 f2 f8 f2 f8 f2
              0x0ff7dfe55bb0: 04 f2 f8 f2 f8 f2 04 f2 04 00 00 f2 00 00 00 f2
              0x0ff7dfe55bc0: 00 00 f8 f2 00 00 00 f2 f2 f2 00 00 00 00 f2 f2
              0x0ff7dfe55bd0: f2 f2 00 00 00 00 f2 f2 f2 f2 f8 f8 f8 f8 f2 f2
              0x0ff7dfe55be0: f2 f2 f8 f8 f8 f8 f2 f2 f2 f2 00 00 00 00 f2 f2
            Shadow byte legend (one shadow byte represents 8 application bytes):
            

            There is only 1 byte used out of 8 bytes and program tried to read 4 bytes which is in red zone.

            Looking at object its "enable_detailed_err_codes" which is a bool variable which has 1 byte of memory.
            https://github.com/couchbase/eventing/blob/ca57e4aa712406824d8f2cf2fc23b77ca2df64e0/features/src/bucket.cc#L163
            https://github.com/couchbase/eventing-ee/blob/44cef1c76cab866bca4cfc0d942fdb07f8597380/features/src/timer_store.cc#L377

            Looking at the lcb code "detailed_neterr" is an unsigned int and lcb tried to convert it to int.
            https://github.com/couchbase/libcouchbase/blob/b3294247961261db72fe6f0d413b8f972a59047d/src/cntl.cc#L202

            AddressSanitiser detected it since its converting 1 byte to 4 bytes and aborted the process.

            This looks like regressed in build 1513
            http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1512&toVersion=7.1.0&toBuild=1513&f_asterixdb=off&f_backup=off&f_build=on&f_cbas-core=off&f_cbbs=on&f_cbgt=off&f_chronicle=on&f_couchbase-cli=off&f_couchdb=off&f_eventing=on&f_eventing-ee=off&f_goxdcr=off&f_indexing=off&f_kv_engine=off&f_libcouchbase=off&f_magma=on&f_n1fty=on&f_ns_server=off&f_product-metadata=off&f_query=off&f_testrunner=off&f_tlm=on&f_voltron=on

            Looks like http://review.couchbase.org/c/voltron/+/162538 and http://review.couchbase.org/c/build/+/162540 are making changes to packaging of the centos builds.
            Chris Hillery, Could you please take a look why we are including address sanitiser in centos builds?
            This can effect the performance as well
             

            ankit.prabhu Ankit Prabhu added a comment - Looks like AddressSanitiser aborting the eventing consumer process. Looking at the message some code trying to access 4 bytes and which is not in the accessible region. ==eventing-consumer==27554==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fbaff2edcc0 at pc 0x7fbb062b321e bp 0x7fbaff2ed250 sp 0x7fbaff2ed248 READ of size 4 at 0x7fbaff2edcc0 thread T3 Looking at the shadow memory Shadow bytes around the buggy address: 0x0ff7dfe55b40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ff7dfe55b50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ff7dfe55b60: 00 00 00 00 00 00 00 00 00 00 00 00 00 f3 f3 f3 0x0ff7dfe55b70: f3 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 0x0ff7dfe55b80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 01 f2 01 f2 =>0x0ff7dfe55b90: 01 f2 01 f2 01 f2 01 f2[01]f2 01 f2 01 f2 01 f2 0x0ff7dfe55ba0: 01 f2 01 f2 01 f2 f8 f2 f8 f2 04 f2 f8 f2 f8 f2 0x0ff7dfe55bb0: 04 f2 f8 f2 f8 f2 04 f2 04 00 00 f2 00 00 00 f2 0x0ff7dfe55bc0: 00 00 f8 f2 00 00 00 f2 f2 f2 00 00 00 00 f2 f2 0x0ff7dfe55bd0: f2 f2 00 00 00 00 f2 f2 f2 f2 f8 f8 f8 f8 f2 f2 0x0ff7dfe55be0: f2 f2 f8 f8 f8 f8 f2 f2 f2 f2 00 00 00 00 f2 f2 Shadow byte legend (one shadow byte represents 8 application bytes): There is only 1 byte used out of 8 bytes and program tried to read 4 bytes which is in red zone. Looking at object its "enable_detailed_err_codes" which is a bool variable which has 1 byte of memory. https://github.com/couchbase/eventing/blob/ca57e4aa712406824d8f2cf2fc23b77ca2df64e0/features/src/bucket.cc#L163 https://github.com/couchbase/eventing-ee/blob/44cef1c76cab866bca4cfc0d942fdb07f8597380/features/src/timer_store.cc#L377 Looking at the lcb code "detailed_neterr" is an unsigned int and lcb tried to convert it to int. https://github.com/couchbase/libcouchbase/blob/b3294247961261db72fe6f0d413b8f972a59047d/src/cntl.cc#L202 AddressSanitiser detected it since its converting 1 byte to 4 bytes and aborted the process. This looks like regressed in build 1513 http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1512&toVersion=7.1.0&toBuild=1513&f_asterixdb=off&f_backup=off&f_build=on&f_cbas-core=off&f_cbbs=on&f_cbgt=off&f_chronicle=on&f_couchbase-cli=off&f_couchdb=off&f_eventing=on&f_eventing-ee=off&f_goxdcr=off&f_indexing=off&f_kv_engine=off&f_libcouchbase=off&f_magma=on&f_n1fty=on&f_ns_server=off&f_product-metadata=off&f_query=off&f_testrunner=off&f_tlm=on&f_voltron=on Looks like http://review.couchbase.org/c/voltron/+/162538 and http://review.couchbase.org/c/build/+/162540 are making changes to packaging of the centos builds. Chris Hillery , Could you please take a look why we are including address sanitiser in centos builds? This can effect the performance as well  

            Another duplicate of MB-48996. Resolved in builds starting with 7.1.0-1528.

            ceej Chris Hillery added a comment - Another duplicate of MB-48996 . Resolved in builds starting with 7.1.0-1528.

            Build couchbase-server-7.1.0-1580 contains eventing commit 4913161 with commit message:
            MB-49004: Pass correct type to enable detailed error codes

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1580 contains eventing commit 4913161 with commit message: MB-49004 : Pass correct type to enable detailed error codes

            Build couchbase-server-7.1.0-1580 contains eventing-ee commit b8fcd22 with commit message:
            MB-49004: Pass correct type to enable detailed error

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1580 contains eventing-ee commit b8fcd22 with commit message: MB-49004 : Pass correct type to enable detailed error

            People

              vikas.chaudhary Vikas Chaudhary
              vikas.chaudhary Vikas Chaudhary
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty