Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6638

beam.smp ( erl_bestfit_alloc.c:435 ) segmentation fault when rebalancing a new node to a cluster with 2 buckets and 1 design doc per bucket

    Details

      Description

      Cluster information:

      • Load 9 million items to both buckets
      • Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Create cluster with 10 nodes installed couchbase server 2.0.0-1708

      10.3.121.13
      10.3.121.14
      10.3.121.15
      10.3.121.16
      10.3.121.17
      10.3.121.20
      10.3.121.22
      10.3.121.24
      10.3.121.25
      10.3.121.23

      • Data path /data
      • View path /data

      Doing rebalanc in, out, failover.
      Last failed was rebalance add 2 nodes, 23 and 24 into cluster.
      During rebalance, node 23 suddenly down. Rebalance failed as expected.
      Then later, memcached on node 13 was killed and restart.

      [ns_server:debug,2012-09-12T17:06:05.705,ns_1@10.3.121.13:<0.28110.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

      {ns_config_events,<0.28109.33>} exited with reason {timeout,
      {gen_server,
      call,
      [ns_node_disco,
      nodes_wanted]}}
      [ns_server:debug,2012-09-12T17:06:05.712,ns_1@10.3.121.13:<0.28166.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {ns_config_events,<0.28165.33>} exited with reason shutdown
      [ns_server:error,2012-09-12T17:06:05.718,ns_1@10.3.121.13:<0.29417.33>:ns_memcached:verify_report_long_call:274]call {stats,<<>>} took too long: 5881903 us
      [ns_server:debug,2012-09-12T17:06:05.720,ns_1@10.3.121.13:<0.28132.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {ns_node_disco_events,<0.28130.33>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:05.732,ns_1@10.3.121.13:<0.28110.33>:ns_pubsub:do_subscribe_link:149]Deleting {ns_config_events,<0.28109.33>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:05.733,ns_1@10.3.121.13:<0.28166.33>:ns_pubsub:do_subscribe_link:149]Deleting

      {ns_config_events,<0.28165.33>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:05.739,ns_1@10.3.121.13:<0.5306.465>:compaction_daemon:do_chain_compactors:500]Got exit signal from parent:

      {'EXIT',<0.4974.465>,shutdown}

      [ns_server:debug,2012-09-12T17:06:05.740,ns_1@10.3.121.13:<0.28132.33>:ns_pubsub:do_subscribe_link:149]Deleting

      {ns_node_disco_events,<0.28130.33>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:05.740,ns_1@10.3.121.13:ns_node_disco:ns_node_disco:init:103]Initting ns_node_disco with ['ns_1@10.3.121.15','ns_1@10.3.121.16',
      'ns_1@10.3.121.17','ns_1@10.3.121.22',
      'ns_1@10.3.121.20','ns_1@10.3.121.14',
      'ns_1@10.3.121.25','ns_1@10.3.121.24']
      [ns_server:debug,2012-09-12T17:06:05.741,ns_1@10.3.121.13:<0.28129.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

      {ns_config_events,<0.28128.33>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:05.743,ns_1@10.3.121.13:<0.19269.466>:janitor_agent:new_style_query_vbucket_states_loop:111]Exception from query_vbucket_states of "saslbucket":'ns_1@10.3.121.23'
      {'EXIT',{{nodedown,'ns_1@10.3.121.23'},
      {gen_server,call,
      [{'janitor_agent-saslbucket','ns_1@10.3.121.23'},
      query_vbucket_states,infinity]}}}
      [ns_server:debug,2012-09-12T17:06:05.745,ns_1@10.3.121.13:<0.5309.465>:compaction_daemon:do_chain_compactors:500]Got exit signal from parent: {'EXIT',<0.5306.465>,shutdown}
      [ns_server:debug,2012-09-12T17:06:05.746,ns_1@10.3.121.13:<0.28126.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {ns_config_events,<0.28125.33>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:05.761,ns_1@10.3.121.13:<0.19269.466>:janitor_agent:new_style_query_vbucket_states_loop_next_step:116]Waiting for "saslbucket" on 'ns_1@10.3.121.23'
      [ns_server:debug,2012-09-12T17:06:05.897,ns_1@10.3.121.13:<0.28129.33>:ns_pubsub:do_subscribe_link:149]Deleting {ns_config_events,<0.28128.33>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:05.898,ns_1@10.3.121.13:<0.14047.466>:compaction_daemon:do_spawn_view_index_compactor:742]Got exit signal from parent:

      {'EXIT',<0.5309.465>,shutdown}

      [ns_server:debug,2012-09-12T17:06:05.903,ns_1@10.3.121.13:ns_cookie_manager:ns_cookie_manager:do_cookie_sync:115]ns_cookie_manager do_cookie_sync
      [ns_server:debug,2012-09-12T17:06:05.903,ns_1@10.3.121.13:<0.28126.33>:ns_pubsub:do_subscribe_link:149]Deleting

      {ns_config_events,<0.28125.33>}

      event handler: ok
      [couchdb:info,2012-09-12T17:06:05.906,ns_1@10.3.121.13:<0.5719.436>:couch_log:info:39]Set view `default`, replica group `_design/d1`, canceling compaction (pid <0.18999.466>)
      [ns_server:debug,2012-09-12T17:06:06.000,ns_1@10.3.121.13:ns_cookie_manager:ns_cookie_manager:do_cookie_save:152]saving cookie to "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie"
      [ns_server:debug,2012-09-12T17:06:06.002,ns_1@10.3.121.13:<0.28122.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

      {ns_stats_event,<0.28121.33>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:06.009,ns_1@10.3.121.13:<0.28122.33>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.28121.33>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:06.040,ns_1@10.3.121.13:<0.28120.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

      {ns_tick_event,<0.28119.33>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:06.044,ns_1@10.3.121.13:<0.28648.34>:single_bucket_sup:top_loop:28]Delegating exit {'EXIT',<0.28117.33>,shutdown} to child supervisor: <0.28649.34>

      [ns_server:debug,2012-09-12T17:06:06.047,ns_1@10.3.121.13:ns_cookie_manager:ns_cookie_manager:do_cookie_save:154]attempted to save cookie to "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie": ok
      [ns_server:debug,2012-09-12T17:06:06.074,ns_1@10.3.121.13:<0.28120.33>:ns_pubsub:do_subscribe_link:149]Deleting {ns_tick_event,<0.28119.33>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:06.074,ns_1@10.3.121.13:<0.19276.466>:ns_node_disco:do_nodes_wanted_updated_fun:202]ns_node_disco: nodes_wanted updated: ['ns_1@10.3.121.13','ns_1@10.3.121.14',
      'ns_1@10.3.121.15','ns_1@10.3.121.16',
      'ns_1@10.3.121.17','ns_1@10.3.121.20',
      'ns_1@10.3.121.22','ns_1@10.3.121.23',
      'ns_1@10.3.121.24','ns_1@10.3.121.25'], with cookie: oetfisqvqrrxvaat
      [ns_server:debug,2012-09-12T17:06:06.079,ns_1@10.3.121.13:<0.28689.34>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

      {ns_stats_event,<0.28688.34>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:06.080,ns_1@10.3.121.13:<0.28687.34>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {ns_tick_event,<0.28686.34>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:06.083,ns_1@10.3.121.13:<0.28689.34>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.28688.34>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:06.084,ns_1@10.3.121.13:<0.28687.34>:ns_pubsub:do_subscribe_link:149]Deleting

      {ns_tick_event,<0.28686.34>}

      event handler: ok
      [stats:warn,2012-09-12T17:06:06.087,ns_1@10.3.121.13:<0.29425.33>:stats_collector:latest_tick:201]Dropped 7 ticks
      [rebalance:info,2012-09-12T17:06:06.091,ns_1@10.3.121.13:<0.25334.457>:ebucketmigrator_srv:do_confirm_sent_messages:655]Got close ack!

      [rebalance:info,2012-09-12T17:06:06.120,ns_1@10.3.121.13:<0.23919.457>:ebucketmigrator_srv:do_confirm_sent_messages:655]Got close ack!

      [ns_server:debug,2012-09-12T17:06:06.120,ns_1@10.3.121.13:<0.1936.453>:ns_pubsub:do_subscribe_link:149]Deleting

      {master_activity_events_ingress,<0.28014.33>}

      event handler:

      {error, module_not_found}

      [rebalance:info,2012-09-12T17:06:06.128,ns_1@10.3.121.13:<0.23192.457>:ebucketmigrator_srv:do_confirm_sent_messages:655]Got close ack!

      [rebalance:info,2012-09-12T17:06:06.147,ns_1@10.3.121.13:<0.22947.457>:ebucketmigrator_srv:do_confirm_sent_messages:655]Got close ack!

      [rebalance:info,2012-09-12T17:06:06.201,ns_1@10.3.121.13:<0.21996.457>:ebucketmigrator_srv:do_confirm_sent_messages:655]Got close ack!

      [rebalance:info,2012-09-12T17:06:06.218,ns_1@10.3.121.13:<0.11912.457>:ebucketmigrator_srv:do_confirm_sent_messages:655]Got close ack!

      [rebalance:info,2012-09-12T17:06:06.223,ns_1@10.3.121.13:<0.10815.457>:ebucketmigrator_srv:do_confirm_sent_messages:655]Got close ack!

      [ns_server:debug,2012-09-12T17:06:06.228,ns_1@10.3.121.13:<0.28050.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

      {buckets_events,<0.28049.33>} exited with reason {shutdown,
      {gen_server,
      call,
      ['ns_vbm_new_sup-default',
      which_children,
      infinity]}}
      [ns_server:debug,2012-09-12T17:06:06.239,ns_1@10.3.121.13:<0.28050.33>:ns_pubsub:do_subscribe_link:149]Deleting {buckets_events,<0.28049.33>}

      event handler: ok
      [user:info,2012-09-12T17:06:06.245,ns_1@10.3.121.13:'ns_memcached-default':ns_memcached:terminate:625]Shutting down bucket "default" on 'ns_1@10.3.121.13' for server shutdown
      [ns_server:info,2012-09-12T17:06:06.261,ns_1@10.3.121.13:'ns_memcached-default':ns_memcached:terminate:636]This bucket shutdown is not due to bucket deletion. Doing nothing
      [ns_server:debug,2012-09-12T17:06:06.276,ns_1@10.3.121.13:<0.28651.34>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

      {ns_config_events,<0.28650.34>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:06.276,ns_1@10.3.121.13:<0.28663.34>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {mc_couch_events,<0.28650.34>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:06.276,ns_1@10.3.121.13:<0.28662.34>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {ns_node_disco_events,<0.28650.34>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:06.276,ns_1@10.3.121.13:<0.28648.34>:single_bucket_sup:top_loop:24]per-bucket supervisor for "default" died with reason shutdown

      [ns_server:debug,2012-09-12T17:06:06.279,ns_1@10.3.121.13:<0.28651.34>:ns_pubsub:do_subscribe_link:149]Deleting {ns_config_events,<0.28650.34>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:06.281,ns_1@10.3.121.13:<0.28663.34>:ns_pubsub:do_subscribe_link:149]Deleting

      {mc_couch_events,<0.28650.34>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:06.283,ns_1@10.3.121.13:<0.28662.34>:ns_pubsub:do_subscribe_link:149]Deleting

      {ns_node_disco_events,<0.28650.34>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:06.283,ns_1@10.3.121.13:<0.29393.33>:single_bucket_sup:top_loop:28]Delegating exit

      {'EXIT',<0.28117.33>,shutdown}

      to child supervisor: <0.29394.33>

      [ns_server:debug,2012-09-12T17:06:06.299,ns_1@10.3.121.13:<0.29428.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

      {ns_stats_event,<0.29427.33>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:06.300,ns_1@10.3.121.13:<0.29426.33>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {ns_tick_event,<0.29425.33>} exited with reason shutdown
      [ns_server:debug,2012-09-12T17:06:06.305,ns_1@10.3.121.13:<0.29428.33>:ns_pubsub:do_subscribe_link:149]Deleting {ns_stats_event,<0.29427.33>}

      event handler: ok
      [ns_server:debug,2012-09-12T17:06:06.307,ns_1@10.3.121.13:<0.29426.33>:ns_pubsub:do_subscribe_link:149]Deleting

      {ns_tick_event,<0.29425.33>}

      event handler: ok
      [rebalance:info,2012-09-12T17:06:06.310,ns_1@10.3.121.13:<0.11978.466>:ebucketmigrator_srv:do_confirm_sent_messages:655]Got close ack!

      [ns_server:info,2012-09-12T17:06:06.323,ns_1@10.3.121.13:ns_port_memcached:ns_port_server:log:169]memcached<0.28094.33>: Thu Sep 13 00:06:06.117323 3: TAP (Consumer) eq_tapq:anon_509 - disconnected
      memcached<0.28094.33>: Thu Sep 13 00:06:06.123981 3: TAP (Consumer) eq_tapq:anon_1652 - disconnected
      memcached<0.28094.33>: Thu Sep 13 00:06:06.144392 3: TAP (Consumer) eq_tapq:anon_480 - disconnected
      memcached<0.28094.33>: Thu Sep 13 00:06:06.151955 3: TAP (Consumer) eq_tapq:anon_488 - disconnected
      memcached<0.28094.33>: Thu Sep 13 00:06:06.203038 3: TAP (Consumer) eq_tapq:anon_504 - disconnected
      memcached<0.28094.33>: Thu Sep 13 00:06:06.221683 3: TAP (Consumer) eq_tapq:anon_719 - disconnected
      memcached<0.28094.33>: Thu Sep 13 00:06:06.227307 3: TAP (Consumer) eq_tapq:anon_499 - disconnected

      memcached log from node 13 at the time it was killed

      Thu Sep 13 00:06:07.606021 3: Schedule cleanup of "eq_tapq:anon_20"
      Thu Sep 13 00:06:07.606058 3: Schedule cleanup of "eq_tapq:anon_34"
      Thu Sep 13 00:06:07.606083 3: Schedule cleanup of "eq_tapq:anon_36"
      Thu Sep 13 00:06:07.606115 3: Schedule cleanup of "eq_tapq:anon_40"
      Thu Sep 13 00:06:07.606142 3: Schedule cleanup of "eq_tapq:anon_730"
      Thu Sep 13 00:06:07.606165 3: Schedule cleanup of "eq_tapq:anon_2425"
      Thu Sep 13 00:06:21.096648 3: Had to wait 13 s for shutdown
      Thu Sep 13 00:06:31.465158 3: Shutting down tap connections!
      Thu Sep 13 00:06:31.465199 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.15"
      Thu Sep 13 00:06:31.465278 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.16"
      Thu Sep 13 00:06:31.465357 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.22"
      Thu Sep 13 00:06:31.465405 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.17"
      Thu Sep 13 00:06:31.465414 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.15 - Clear the tap queues by force
      Thu Sep 13 00:06:31.465441 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.20"
      Thu Sep 13 00:06:31.465471 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.14"
      Thu Sep 13 00:06:31.465492 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.25"
      Thu Sep 13 00:06:31.465517 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.24"
      Thu Sep 13 00:06:31.465604 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.16 - Clear the tap queues by force
      Thu Sep 13 00:06:31.467071 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.22 - Clear the tap queues by force
      Thu Sep 13 00:06:31.467363 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.20 - Clear the tap queues by force
      Thu Sep 13 00:06:31.467606 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.24 - Clear the tap queues by force
      Thu Sep 13 00:06:31.467710 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.25 - Clear the tap queues by force
      Thu Sep 13 00:06:31.467917 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.17 - Clear the tap queues by force
      Thu Sep 13 00:06:31.468102 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.14 - Clear the tap queues by force

      Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/10nodes-1708-memcached-killed-after-reb-failed-20120912.tgz

      Link to atop of node 13 https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13-20120912

      1. memcached.log.7.txt.gz
        27 kB
        Thuan Nguyen
      2. memcached.log.6.txt.gz
        217 kB
        Thuan Nguyen
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Tested on /data2 partition that's on lvm. And found that barriers actually don't work.

        So system tests physical box is NOT ok. Looks like older (or maybe even modern) LVM does not provide durable disk configuration.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Tested on /data2 partition that's on lvm. And found that barriers actually don't work. So system tests physical box is NOT ok. Looks like older (or maybe even modern) LVM does not provide durable disk configuration.
        Hide
        damien damien added a comment -

        Mounting evidence suggests this bug is caused by the +A erlang startup settings, which turns on async IO for port drivers by using a pool of threads to perform the IO. We still don't understand the root cause, but it appears there is a race condition/cache coherency problem with how port drivers are freed in the VM.

        Reassigning to Alk, as he will check in the change to disable the async threads and bump up the # of schedulers to mitigate problems with timeouts due to blocking IO.

        Show
        damien damien added a comment - Mounting evidence suggests this bug is caused by the +A erlang startup settings, which turns on async IO for port drivers by using a pool of threads to perform the IO. We still don't understand the root cause, but it appears there is a race condition/cache coherency problem with how port drivers are freed in the VM. Reassigning to Alk, as he will check in the change to disable the async threads and bump up the # of schedulers to mitigate problems with timeouts due to blocking IO.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        We merged "async off" commit.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - We merged "async off" commit.
        Hide
        kzeller kzeller added a comment -

        Added to RN : Couchbase Server had intermittently crashed
        during rebalance due to Erlang virtual machine
        issues; we now
        disable asynchronous threads and perform garbage collection
        more often to avoid timeouts and process crashes.

        Show
        kzeller kzeller added a comment - Added to RN : Couchbase Server had intermittently crashed during rebalance due to Erlang virtual machine issues; we now disable asynchronous threads and perform garbage collection more often to avoid timeouts and process crashes.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-couchdb-preview #544 (See http://qa.hq.northscale.net/job/github-couchdb-preview/544/)
        MB-6638: use fork of file_sorter that doesn't use compression (Revision 6bbe1cf89b2f6b5c9cf098b81c5ea60d339f8f0a)

        Result = SUCCESS
        Aliaksey Kandratsenka :
        Files :

        • src/couch_set_view/src/couch_set_view_updater.erl
        • src/couchdb/couch_db_updater.erl
        • src/couchdb/Makefile.am
        • Makefile.am
        • src/couchdb/file_sorter_2.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-couchdb-preview #544 (See http://qa.hq.northscale.net/job/github-couchdb-preview/544/ ) MB-6638 : use fork of file_sorter that doesn't use compression (Revision 6bbe1cf89b2f6b5c9cf098b81c5ea60d339f8f0a) Result = SUCCESS Aliaksey Kandratsenka : Files : src/couch_set_view/src/couch_set_view_updater.erl src/couchdb/couch_db_updater.erl src/couchdb/Makefile.am Makefile.am src/couchdb/file_sorter_2.erl

          People

          • Assignee:
            damien damien
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes