Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46221

memcached crash in CouchKVStore::commit

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown
    • KV-Engine CC Final Sprint

    Description

      In the 4B rebalance runs, we saw auto failover happened during the load phase. We increased failover threshold from 5 seconds to 30 seconds in the second run (rebalance-in), but the run still hit auto failover. Both runs didn't hit the issue with 6.6.2 build.

       

      Rebalance-out (min), 5 -> 4, 4B x 1KB, 15K ops/sec (90/10 R/W), 10%% cache miss rate

      Build: 7.0.0-5071

      Job: http://perf.jenkins.couchbase.com/job/titan-reb/2119/ 

      [user:info,2021-05-05T12:00:57.796-07:00,ns_1@172.23.96.100:<0.5861.0>:auto_failover:log_failover_success:544]Node ('ns_1@172.23.96.100') was automatically failed over. Reason: The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service.

       

      Rebalance-in (min), 4 -> 5, 4B x 1KB, 15K ops/sec (90/10 R/W), 10%% cache miss rate

      Build: 7.0.0-5071

      Job: http://perf.jenkins.couchbase.com/job/titan-reb/2124/ 

      [user:info,2021-05-07T20:39:41.757-07:00,ns_1@172.23.96.100:<0.6131.0>:auto_failover:log_failover_success:544]Node ('ns_1@172.23.96.102') was automatically failed over. Reason: The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service.

      Attachments

        1. 101.bt_txt
          47 kB
        2. 103.txt
          3 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            drigby Dave Rigby added a comment -

            Bo-Chun Wang that’s very helpful - thanks.

            If you aren’t already, it would be great to now schedule an ASan run - from Jim’s comments we appear to have some form of memory corruption; the minidumps might allow us to track that down but address-sanitizer should pinpoint exactly where the issue is (assuming it is memory corruption due to a bug in Couchbase code).

            drigby Dave Rigby added a comment - Bo-Chun Wang that’s very helpful - thanks. If you aren’t already, it would be great to now schedule an ASan run - from Jim’s comments we appear to have some form of memory corruption; the minidumps might allow us to track that down but address-sanitizer should pinpoint exactly where the issue is (assuming it is memory corruption due to a bug in Couchbase code).
            jwalker Jim Walker added a comment -

            Thanks Bo-Chun Wang ASAN next.

            For reference https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-05-12T194355-ns_1%40172.23.96.103.zip again crashed in jemalloc whilst doing a free. Very much the same backtrace as seen in the earlier run on node 101 -> https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-05-12T162058-ns_1%40172.23.96.101.zip

            jwalker Jim Walker added a comment - Thanks Bo-Chun Wang ASAN next. For reference https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-05-12T194355-ns_1%40172.23.96.103.zip again crashed in jemalloc whilst doing a free. Very much the same backtrace as seen in the earlier run on node 101 -> https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-05-12T162058-ns_1%40172.23.96.101.zip
            jwalker Jim Walker added a comment - Double free in couchstore::replay Issue occurs when a size threshold is exceeded and we take a different path. docinfo allocated in here https://github.com/couchbase/couchstore/blob/master/src/couch_db.cc#L1060 https://github.com/couchbase/couchstore/blob/master/src/couch_db.cc#L773 spool 'saves' docinfo here https://github.com/couchbase/couchstore/blob/3c944de8e81a2c335cf6e8b55d44fc88b6b1ba77/src/db_compact.cc#L551 may call flush, if a threshold is hit here https://github.com/couchbase/couchstore/blob/3c944de8e81a2c335cf6e8b55d44fc88b6b1ba77/src/db_compact.cc#L554-L558 flush is freeing docinfo https://github.com/couchbase/couchstore/blob/3c944de8e81a2c335cf6e8b55d44fc88b6b1ba77/src/db_compact.cc#L575-L577 but so is the outer caller here https://github.com/couchbase/couchstore/blob/master/src/couch_db.cc#L1100

            Build couchbase-server-7.0.0-5176 contains couchstore commit b7ce6cf with commit message:
            MB-46221: Fix double-free in replay

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5176 contains couchstore commit b7ce6cf with commit message: MB-46221 : Fix double-free in replay
            jwalker Jim Walker added a comment - - edited

            toy-build with the fix passed the 4B test.

            7.0.0-5176 is proceding well, 3 hours in and fine. Previously would of failed by now.

            Closing out.

            jwalker Jim Walker added a comment - - edited toy-build with the fix passed the 4B test. 7.0.0-5176 is proceding well, 3 hours in and fine. Previously would of failed by now. http://perf.jenkins.couchbase.com/job/titan-reb/2153/ Closing out.

            People

              bo-chun.wang Bo-Chun Wang
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty