Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: 2.0
Affects Version/s: 2.0
Component/s: ns_server, XDCR
Security Level: Public
Labels:
None
Environment:

Hide
Ubuntu 12.04 LTS ec2 xlarge instances (15GB Memory)
http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.0-1944-rel.deb.manifest.xml

Live clusters:
C1: http://ec2-177-71-167-196.sa-east-1.compute.amazonaws.com:8091/
C2: http://ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com:8091/

biXDCR_bucket: C1 <--> C2
uniXDCR_src: C1 --> C2

Show
Ubuntu 12.04 LTS ec2 xlarge instances (15GB Memory) http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.0-1944-rel.deb.manifest.xml Live clusters: C1: http://ec2-177-71-167-196.sa-east-1.compute.amazonaws.com:8091/ C2: http://ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com:8091/ biXDCR_bucket: C1 <--> C2 uniXDCR_src: C1 --> C2

Description

Front end loads for biXDCR_bucket on C1 and C2 and for uniXDCR_src on C1, and replication going on
On C2:
3 nodes down: With erl_crash.dump files generated (will be attached)
2 nodes with erlang possibly hung, and in pend state. (In top, beam.smp keeps appearing and disappearing using up 1.0G of resident memory, but no cores generated, no erl_crash.dump files, memcached seems to be still running)
Unable to grab diags off any of these nodes.
Result - All items in biXDCR_bucket on C2 lost .
Half the items in uniXDCR_dest on C2 lost.

Noticed a whole bunch of these crash reports on one of the "Pending" nodes on C2:

- Reason for termination ==
- {noproc,
  {gen_server,call,
  [remote_clusters_info,
  
  Unknown macro: {get_remote_bucket, [{hostname, "ec2-177-71-147-19.sa-east-1.compute.amazonaws.com:8091"}, {uuid,<<"0b3a63d5d8805e0c6670c619cc346299">>}, {name,"SANPAULO (C2)"},
  {username,"Administrator"},
  {password,"password"}],
  "biXDCR_bucket",false,30000},
  infinity]}}
  
  [error_logger:error,2012-11-07T5:57:56.025,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
  =========================CRASH REPORT=========================
  crasher:
  initial call: xdc_vbucket_rep:init/1
  pid: <0.28161.8>
  registered_name: []
  exception exit: {noproc,
  {gen_server,call,
  [remote_clusters_info,
  {get_remote_bucket,
  [{hostname, "ec2-177-71-147-19.sa-east-1.compute.amazonaws.com:8091"},
  {uuid, <<"0b3a63d5d8805e0c6670c619cc346299">>},
  {name,"SANPAULO (C2)"}, {username,"Administrator"}, {password,"password"}], "biXDCR_bucket",false,30000}
  
  ,
  infinity]}}
  in function gen_server:terminate/6
  ancestors: [<0.3608.5>,<0.3603.5>,xdc_replication_sup,ns_server_sup,
  ns_server_cluster_sup,<0.64.0>]
  messages: []
  links: [<0.3608.5>]
  dictionary: []
  trap_exit: true
  status: running
  heap_size: 514229
  stack_size: 24
  reductions: 35035
  neighbours:

- Reason for termination ==
- killed

[error_logger:error,2012-11-07T5:58:41.704,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: couch_db:init/1
pid: <0.19405.4>
registered_name: []
exception exit: killed
in function gen_server:terminate/6
ancestors: [couch_server,couch_primary_services,couch_server_sup,
cb_couch_sup,ns_server_cluster_sup,<0.64.0>]
messages: []
links: []
dictionary: []
trap_exit: true
status: running
heap_size: 1597
stack_size: 24
reductions: 11968
neighbours:

Attached are the grabbed diags from one of the non-down nodes on C2.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com-8091-diag.txt.gz
4.27 MB
08/Nov/12 11:47 AM
Healthcheck_report_C2.zip
76 kB
08/Nov/12 12:01 PM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Abhi Dangeti

Reporter:: Abhi Dangeti

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Nov/12 11:45 AM

Updated:: 04/Feb/13 9:40 AM

Resolved:: 04/Feb/13 9:40 AM

Gerrit Reviews

There are no open Gerrit changes

Multiple nodes go down with erlang crash (crash.dump available) in a 10:10 XDCR setup and erlang possibly hung in a couple of nodes as well, all in the same cluster

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty