Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6799

[RN 2.0.1][system test] view index disk size grows too big during rebalance

    XMLWordPrintable

Details

    • Release Note

    Description

      Cluster information:

      • 8 centos 6.2 64bit server with 4 cores CPU
      • Each server has 32 GB RAM and 400 GB SSD disk.
      • 24.8 GB RAM for couchbase server at each node
      • SSD disk format ext4 on /data
      • Each server has its own SSD drive, no disk sharing with other server.
      • Create cluster with 6 nodes installed couchbase server 2.0.0-1781
      • Cluster has 2 buckets, default (12GB) and saslbucket (12GB).
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Enable consistent view on cluster (by default)

      10.6.2.37
      10.6.2.38
      10.6.2.44
      10.6.2.45
      10.6.2.42
      10.6.2.43

      • Load 14 million items to both bucket. Each key has size from 512 bytes to 1024 bytes
      • Queries all 4 views from 2 docs

      10.6.2.39
      10.6.2.40

      • Data path /data
      • View path /data

      Manifest info from build 1781
      http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1781-rel.rpm.manifest.xml

        • Add 2 nodes: 39 and 40 and rebalance. During rebalance, reboot node 42 and 43. Rebalance failed as expected.
      • After node finished warmup, rebalance again. Rebalance failed with bug MB-6490 on node 44.
      • Failover node 44 and rebalance
        • Monitor disk size of all nodes, I see node 45 and 37 having biggest disk size

      Thuans-MacBook-Pro:testrunner thuan$ python scripts/ssh.py -i ../ini/10-c-long.ini "df -kh | grep data"
      10.6.2.44
      394G 468M 394G 1% /data
      10.6.2.39
      394G 44G 331G 12% /data
      10.6.2.42
      394G 69G 326G 18% /data
      10.6.2.40
      394G 55G 319G 15% /data
      10.6.2.45
      394G 346G 48G 88% /data
      10.6.2.43
      394G 110G 284G 28% /data
      10.6.2.37
      394G 299G 76G 80% /data
      10.6.2.38
      394G 184G 191G 50% /data

      • Then check on index file size of all nodes, I see file size of replica index of node 45 is too big, 114GB compare to other nodes.

      total 2.4G
      rw-rr-. 1 couchbase couchbase 686M Oct 2 14:36 main_ae72f9d24da5d9368eed3fb3519c1687.view.21
      rw-rr-. 1 couchbase couchbase 1.7G Oct 2 14:38 replica_ae72f9d24da5d9368eed3fb3519c1687.view.57
      drwxr-xr-x. 2 couchbase couchbase 4.0K Oct 1 14:59 tmp_ae72f9d24da5d9368eed3fb3519c1687_main
      drwxr-xr-x. 2 couchbase couchbase 4.0K Oct 1 11:03 tmp_ae72f9d24da5d9368eed3fb3519c1687_replica

      10.6.2.43
      total 2.1G
      rw-rr-. 1 couchbase couchbase 674M Oct 2 14:33 main_ae72f9d24da5d9368eed3fb3519c1687.view.16
      rw-rr-. 1 couchbase couchbase 1.4G Oct 2 14:37 replica_ae72f9d24da5d9368eed3fb3519c1687.view.22
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 29 13:20 tmp_ae72f9d24da5d9368eed3fb3519c1687_main
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 29 13:20 tmp_ae72f9d24da5d9368eed3fb3519c1687_replica

      10.6.2.39
      total 2.4G
      rw-rr-. 1 couchbase couchbase 702M Oct 2 14:36 main_ae72f9d24da5d9368eed3fb3519c1687.view.10
      rw-rr-. 1 couchbase couchbase 1.8G Oct 2 14:40 replica_ae72f9d24da5d9368eed3fb3519c1687.view.52
      drwxr-xr-x. 2 couchbase couchbase 4.0K Oct 1 14:44 tmp_ae72f9d24da5d9368eed3fb3519c1687_main
      drwxr-xr-x. 2 couchbase couchbase 4.0K Oct 1 11:03 tmp_ae72f9d24da5d9368eed3fb3519c1687_replica

      10.6.2.45
      total 132G
      rw-rr-. 1 couchbase couchbase 18G Oct 2 14:40 main_ae72f9d24da5d9368eed3fb3519c1687.view.13
      rw-rr-. 1 couchbase couchbase 114G Oct 2 14:41 replica_ae72f9d24da5d9368eed3fb3519c1687.view.72
      rw-rr-. 1 couchbase couchbase 4.0M Oct 2 14:41 replica_ae72f9d24da5d9368eed3fb3519c1687.view.72.compact
      rw-rr-. 1 couchbase couchbase 0 Oct 2 14:40 replica_ae72f9d24da5d9368eed3fb3519c1687.view.log
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 30 01:51 tmp_ae72f9d24da5d9368eed3fb3519c1687_main
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 29 19:29 tmp_ae72f9d24da5d9368eed3fb3519c1687_replica

      10.6.2.42
      total 12G
      rw-rr-. 1 couchbase couchbase 620M Oct 2 14:41 main_ae72f9d24da5d9368eed3fb3519c1687.view.18
      rw-rr-. 1 couchbase couchbase 11G Oct 2 14:41 replica_ae72f9d24da5d9368eed3fb3519c1687.view.18
      rw-rr-. 1 couchbase couchbase 27M Oct 2 14:41 replica_ae72f9d24da5d9368eed3fb3519c1687.view.18.compact
      rw-rr-. 1 couchbase couchbase 0 Oct 2 14:41 replica_ae72f9d24da5d9368eed3fb3519c1687.view.log
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 29 13:20 tmp_ae72f9d24da5d9368eed3fb3519c1687_main
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 29 13:20 tmp_ae72f9d24da5d9368eed3fb3519c1687_replica

      10.6.2.38
      total 2.1G
      rw-rr-. 1 couchbase couchbase 682M Oct 2 14:38 main_ae72f9d24da5d9368eed3fb3519c1687.view.11
      rw-rr-. 1 couchbase couchbase 1.4G Oct 2 14:40 replica_ae72f9d24da5d9368eed3fb3519c1687.view.12
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 29 13:20 tmp_ae72f9d24da5d9368eed3fb3519c1687_main
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 29 13:21 tmp_ae72f9d24da5d9368eed3fb3519c1687_replica

      10.6.2.37
      total 67G
      rw-rr-. 1 couchbase couchbase 4.1G Oct 2 14:36 main_ae72f9d24da5d9368eed3fb3519c1687.view.12
      rw-rr-. 1 couchbase couchbase 63G Oct 2 14:41 replica_ae72f9d24da5d9368eed3fb3519c1687.view.16
      rw-rr-. 1 couchbase couchbase 9.8M Oct 2 14:41 replica_ae72f9d24da5d9368eed3fb3519c1687.view.16.compact
      rw-rr-. 1 couchbase couchbase 0 Oct 2 14:41 replica_ae72f9d24da5d9368eed3fb3519c1687.view.log
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 29 13:20 tmp_ae72f9d24da5d9368eed3fb3519c1687_main
      drwxr-xr-x. 2 couchbase couchbase 4.0K Sep 29 13:20 tmp_ae72f9d24da5d9368eed3fb3519c1687_replica

      • I go to couchdb log of node 45 and see index compaction started at Tue Oct 02 2012 13:21:14 and stop at 2 percent (see in log couchdb.9 and couchdb.10)

      At log couchdb.9
      [couchdb:debug,2012-10-02T13:21:16.593,ns_1@10.6.2.45:couch_task_status:couch_log:debug:36]New task status for <0.11726.933>: [

      {changes_done,10000}

      ,

      {design_documents,[<<"_design/d11">>]},
      {indexer_type,replica},
      {original_target,{[{type,bucket}]}},
      {progress,0},
      {set,<<"saslbucket">>},
      {signature, <<"ae72f9d24da5d9368eed3fb3519c1687">>},
      {started_on,1349209274},
      {total_changes,6339225},
      {trigger_type,scheduled},
      {type,view_compaction},
      {updated_on,1349209276}]

      [couchdb:debug,2012-10-02T13:21:23.124,ns_1@10.6.2.45:couch_task_status:couch_log:debug:36]New task status for <0.11726.933>: [{changes_done,40000},
      {design_documents,[<<"_design/d11">>]}

      ,

      {indexer_type,replica},
      {original_target,{[{type,bucket}]}},
      {progress,0},
      {set,<<"saslbucket">>},
      {signature, <<"ae72f9d24da5d9368eed3fb3519c1687">>},
      {started_on,1349209274},
      {total_changes,6339225},
      {trigger_type,scheduled},
      {type,view_compaction},
      {updated_on,1349209283}]

      [couchdb:debug,2012-10-02T13:21:30.203,ns_1@10.6.2.45:couch_task_status:couch_log:debug:36]New task status for <0.11726.933>: [{changes_done,70000},
      {design_documents,[<<"_design/d11">>]},
      {indexer_type,replica}

      ,
      {original_target,{[

      {type,bucket}]}},
      {progress,1},
      {set,<<"saslbucket">>},
      {signature, <<"ae72f9d24da5d9368eed3fb3519c1687">>},
      {started_on,1349209274},
      {total_changes,6339225},
      {trigger_type,scheduled},
      {type,view_compaction},
      {updated_on,1349209290}]

      [couchdb:debug,2012-10-02T13:21:42.138,ns_1@10.6.2.45:couch_task_status:couch_log:debug:36]New task status for <0.11726.933>: [{changes_done,130000},
      {design_documents,[<<"_design/d11">>]},
      {indexer_type,replica},
      {original_target,{[{type,bucket}

      ]}},

      {progress,2},
      {set,<<"saslbucket">>},
      {signature, <<"ae72f9d24da5d9368eed3fb3519c1687">>},
      {started_on,1349209274},
      {total_changes,6339225},
      {trigger_type,scheduled},
      {type,view_compaction},
      {updated_on,1349209302}]

      [couchdb:debug,2012-10-02T13:21:52.828,ns_1@10.6.2.45:couch_task_status:couch_log:debug:36]New task status for <0.11726.933>: [{changes_done,140000},
      {design_documents,[<<"_design/d11">>]},
      {indexer_type,replica},
      {original_target,{[{type,bucket}]}},
      {progress,2}

      ,

      {set,<<"saslbucket">>},
      {signature, <<"ae72f9d24da5d9368eed3fb3519c1687">>},
      {started_on,1349209274},
      {total_changes,6339225},
      {trigger_type,scheduled},
      {type,view_compaction},
      {updated_on,1349209312}]

      ** At log couchdb.10

      [couchdb:debug,2012-10-02T13:22:13.490,ns_1@10.6.2.45:couch_task_status:couch_log:debug:36]New task status for <0.11726.933>: [{changes_done,150000},
      {design_documents,[<<"_design/d11">>]},
      {indexer_type,replica},
      {original_target,{[{type,bucket}]}},
      {progress,2},
      {set,<<"saslbucket">>}

      ,

      {signature, <<"ae72f9d24da5d9368eed3fb3519c1687">>},
      {started_on,1349209274},
      {total_changes,6339225},
      {trigger_type,scheduled},
      {type,view_compaction},
      {updated_on,1349209333}]

      [couchdb:debug,2012-10-02T13:22:34.317,ns_1@10.6.2.45:couch_task_status:couch_log:debug:36]New task status for <0.11726.933>: [{changes_done,160000},
      {design_documents,[<<"_design/d11">>]},
      {indexer_type,replica},
      {original_target,{[{type,bucket}]}},
      {progress,2},
      {set,<<"saslbucket">>},
      {signature, <<"ae72f9d24da5d9368eed3fb3519c1687">>}

      ,

      {started_on,1349209274},
      {total_changes,6339225},
      {trigger_type,scheduled},
      {type,view_compaction},
      {updated_on,1349209354}]



      [root@localhost logs]# grep "started_on,1349209274}" couchdb.*
      couchdb.10: {started_on,1349209274}

      ,
      couchdb.10:

      {started_on,1349209274},
      couchdb.9: {started_on,1349209274}

      ,
      couchdb.9:

      {started_on,1349209274},
      couchdb.9: {started_on,1349209274}

      ,
      couchdb.9:

      {started_on,1349209274},
      couchdb.9: {started_on,1349209274}

      ,
      couchdb.9:

      {started_on,1349209274}

      ,

      [root@localhost logs]# ls -latrh | grep couchdb
      rw-rr-. 1 couchbase couchbase 13 Sep 28 20:12 couchdb.siz
      rw-rr-. 1 couchbase couchbase 10M Oct 2 05:06 couchdb.13
      rw-rr-. 1 couchbase couchbase 10M Oct 2 05:36 couchdb.14
      rw-rr-. 1 couchbase couchbase 10M Oct 2 06:04 couchdb.15
      rw-rr-. 1 couchbase couchbase 10M Oct 2 06:33 couchdb.16
      rw-rr-. 1 couchbase couchbase 10M Oct 2 06:59 couchdb.17
      rw-rr-. 1 couchbase couchbase 10M Oct 2 07:29 couchdb.18
      rw-rr-. 1 couchbase couchbase 10M Oct 2 07:55 couchdb.19
      rw-rr-. 1 couchbase couchbase 10M Oct 2 08:22 couchdb.20
      rw-rr-. 1 couchbase couchbase 10M Oct 2 09:15 couchdb.1
      rw-rr-. 1 couchbase couchbase 10M Oct 2 10:13 couchdb.2
      rw-rr-. 1 couchbase couchbase 10M Oct 2 10:36 couchdb.3
      rw-rr-. 1 couchbase couchbase 10M Oct 2 11:04 couchdb.4
      rw-rr-. 1 couchbase couchbase 10M Oct 2 11:36 couchdb.5
      rw-rr-. 1 couchbase couchbase 10M Oct 2 12:05 couchdb.6
      rw-rr-. 1 couchbase couchbase 10M Oct 2 12:28 couchdb.7
      rw-rr-. 1 couchbase couchbase 10M Oct 2 12:55 couchdb.8
      rw-rr-. 1 couchbase couchbase 10M Oct 2 13:21 couchdb.9
      rw-rr-. 1 couchbase couchbase 10M Oct 2 13:46 couchdb.10
      rw-rr-. 1 couchbase couchbase 170 Oct 2 14:12 couchdb.idx
      rw-rr-. 1 couchbase couchbase 10M Oct 2 14:12 couchdb.11
      rw-rr-. 1 couchbase couchbase 9.7M Oct 2 14:37 couchdb.12

      • Then compaction restart at Tue Oct 02 2012 13:47:05 and stop at 2 percent at in log couchdb.10 and couchdb.11

      [root@localhost logs]# grep "started_on,1349210733" couchdb.*
      couchdb.10:

      {started_on,1349210733},
      couchdb.10: {started_on,1349210733}

      ,
      couchdb.10:

      {started_on,1349210733},
      couchdb.10: {started_on,1349210733}

      ,
      couchdb.10:

      {started_on,1349210733},
      couchdb.11: {started_on,1349210733}

      ,
      couchdb.11:

      {started_on,1349210733},
      couchdb.11: {started_on,1349210733}

      ,
      couchdb.11:

      {started_on,1349210733}

      ,

      Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201210/8nodes-col-info-1781-rebalance-hang-20121002-114333.tgz

      Attachments

        For Gerrit Dashboard: MB-6799
        # Subject Branch Project Status CR V

        Activity

          People

            kzeller kzeller
            thuan Thuan Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty