Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37179

[System test]: Eventing rebalance failing with Some apps are being paused

    XMLWordPrintable

Details

    Description

      Build: 6.5.0-4917, not seen on 6.5.0-4908

      Test: MH longevity 

      Cycle: 2nd

      Day: 2nd

      Seeing multiple rebalance failure for eventing 

      [user:error,2019-12-05T15:10:18.944-08:00,ns_1@172.23.108.103:<0.13067.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {service_rebalance_failed,eventing,
                                    {worker_died,
                                     {'EXIT',<0.29188.1182>,
                                      {{badmatch,
                                        {error,
                                         {unknown_error,
                                          <<"Some apps are being paused, node: 172.23.96.148:8096">>}}},
                                       [{service_rebalancer,rebalance_worker,1,
                                         [{file,"src/service_rebalancer.erl"},
                                          {line,170}]},
                                        {proc_lib,init_p,3,
                                         [{file,"proc_lib.erl"},{line,232}]}]}}}}.
      Rebalance Operation Id = da3ae28e304d693f84567e0bdc3cab39
       
       
      [user:error,2019-12-05T15:35:01.557-08:00,ns_1@172.23.108.103:<0.13067.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {service_rebalance_failed,eventing,
                                    {worker_died,
                                     {'EXIT',<0.23000.1196>,
                                      {{badmatch,
                                        {error,
                                         {unknown_error,
                                          <<"Some apps are being paused, node: 172.23.96.148:8096">>}}},
                                       [{service_rebalancer,rebalance_worker,1,
                                         [{file,"src/service_rebalancer.erl"},
                                          {line,170}]},
                                        {proc_lib,init_p,3,
                                         [{file,"proc_lib.erl"},{line,232}]}]}}}}.
      Rebalance Operation Id = 5ee66f3b1ce6892b90e0b7af66d97beb
       
       
       
       
      [user:error,2019-12-05T16:11:57.527-08:00,ns_1@172.23.108.103:<0.13067.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {service_rebalance_failed,eventing,
                                    {worker_died,
                                     {'EXIT',<0.9006.1218>,
                                      {{badmatch,
                                        {error,
                                         {unknown_error,
                                          <<"Some apps are being paused, node: 172.23.96.148:8096">>}}},
                                       [{service_rebalancer,rebalance_worker,1,
                                         [{file,"src/service_rebalancer.erl"},
                                          {line,170}]},
                                        {proc_lib,init_p,3,
                                         [{file,"proc_lib.erl"},{line,232}]}]}}}}.
      Rebalance Operation Id = 56a74378a8d5a9737f816839f813f608 

      Rebalance is even failing when it involves only index. We should ignore other services rebalance

      [2019-12-05T15:05:15-08:00, sequoiatools/couchbase-cli:6.5:6c7db0] server-add -c 172.23.108.103:8091 --server-add https://172.23.104.164 -u Administrator -p password --server-add-username Administrator --server-add-password password --services index
      [2019-12-05T15:05:44-08:00, sequoiatools/couchbase-cli:6.5:580796] rebalance -c 172.23.108.103:8091 -u Administrator -p password
      → 
       
       
      Error occurred on container - sequoiatools/couchbase-cli:6.5:[rebalance -c 172.23.108.103:8091 -u Administrator -p password]
       
       
      docker logs 580796
      docker start 580796
       
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2019-12-05T15:10:35-08:00, sequoiatools/cmd:3ee376] 60
       
       
       
       
      [2019-12-05T15:28:52-08:00, sequoiatools/couchbase-cli:6.5:5bfb5a] rebalance -c 172.23.108.103:8091 --server-remove 172.23.97.242 -u Administrator -p password
      warning using 'json' filter:  unexpected end of JSON input []
      → 
       
       
      Error occurred on container - sequoiatools/couchbase-cli:6.5:[rebalance -c 172.23.108.103:8091 --server-remove 172.23.97.242 -u Administrator -p password]
       
       
      docker logs 5bfb5a
      docker start 5bfb5a
       
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2019-12-05T15:35:16-08:00, sequoiatools/cmd:a23d6f] 60
       
       
       
       
      [2019-12-05T16:07:42-08:00, sequoiatools/couchbase-cli:6.5:c0ad80] rebalance -c 172.23.108.103:8091 --server-remove 172.23.97.242 -u Administrator -p password
      → 
       
       
      Error occurred on container - sequoiatools/couchbase-cli:6.5:[rebalance -c 172.23.108.103:8091 --server-remove 172.23.97.242 -u Administrator -p password]
       
       
      docker logs c0ad80
      docker start c0ad80
       
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2019-12-05T16:12:18-08:00, sequoiatools/cmd:12bade] 60
       

      Test not recovered after this all rebalance failing 

      eventing : 3 ===== > [172.23.104.87:8091 172.23.96.148:8091 172.23.98.135:8091]

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          vikas.chaudhary Vikas Chaudhary created issue -
          jeelan.poola Jeelan Poola made changes -
          Field Original Value New Value
          Assignee Jeelan Poola [ jeelan.poola ] Ankit Prabhu [ ankit.prabhu ]

          Ankit Prabhu Could you please look into this on priority?

          jeelan.poola Jeelan Poola added a comment - Ankit Prabhu Could you please look into this on priority?
          jeelan.poola Jeelan Poola made changes -
          Assignee Ankit Prabhu [ ankit.prabhu ] Gautham Banasandra [ gautham.banasandra ]

          Node 172.23.96.148 was rebalanced in. Each of the consumer processes for Function timer_op tried to send a mutation and got stuck while trying to write to the socket -

          3 @ 0x90f27a 0x90a59a 0x909c17 0x9a4bcb 0x9a4cbd 0x9a65b7 0xa6ed1f 0xa7fdba 0x96f256 0x1093977 0x1091769 0x10ad168 0x109d2a1 0x93cb11
          #	0x909c16	internal/poll.runtime_pollWait+0x56						/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/runtime/netpoll.go:173
          #	0x9a4bca	internal/poll.(*pollDesc).wait+0x9a						/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/internal/poll/fd_poll_runtime.go:85
          #	0x9a4cbc	internal/poll.(*pollDesc).waitWrite+0x3c					/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/internal/poll/fd_poll_runtime.go:94
          #	0x9a65b6	internal/poll.(*FD).Write+0x236							/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/internal/poll/fd_unix.go:264
          #	0xa6ed1e	net.(*netFD).Write+0x4e								/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/net/fd_unix.go:220
          #	0xa7fdb9	net.(*conn).Write+0x69								/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/net/net.go:188
          #	0x96f255	bytes.(*Buffer).WriteTo+0xb5							/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/bytes/buffer.go:240
          #	0x1093976	github.com/couchbase/eventing/consumer.(*Consumer).sendMessage+0xbb6		/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/handle_messages.go:670
          #	0x1091768	github.com/couchbase/eventing/consumer.(*Consumer).sendDcpEvent+0x5f8		/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/handle_messages.go:455
          #	0x10ad167	github.com/couchbase/eventing/consumer.(*Consumer).sendEvent+0x787		/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/process_events.go:1169
          #	0x109d2a0	github.com/couchbase/eventing/consumer.(*Consumer).processDCPEvents+0x5660	/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/process_events.go:91
          

          The rest of the go routines which tried to write to the socket were waiting to get the lock. For example, the stats updater routine is trying to get the lock -

          3 @ 0x90f27a 0x90f32e 0x91fc14 0x91f92d 0x944ce8 0x945bed 0x1092ebf 0x1090e8f 0x10b91a8 0x93cb11
          #	0x91f92c	sync.runtime_SemacquireMutex+0x3c						/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/runtime/sema.go:71
          #	0x944ce7	sync.(*Mutex).Lock+0x107							/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/sync/mutex.go:134
          #	0x945bec	sync.(*RWMutex).Lock+0x2c							/home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/sync/rwmutex.go:93
          #	0x1092ebe	github.com/couchbase/eventing/consumer.(*Consumer).sendMessage+0xfe		/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/handle_messages.go:631
          #	0x1090e8e	github.com/couchbase/eventing/consumer.(*Consumer).sendGetExecutionStats+0x1fe	/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/handle_messages.go:384
          #	0x10b91a7	github.com/couchbase/eventing/consumer.(*Consumer).updateWorkerStats+0x347	/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/stats_updater.go:111
          

          The fact that the write to socket is getting stuck seems to be an issue in golang - https://github.com/golang/go/issues/27752 . The solution suggested by them is to upgrade to golang version 1.11 - https://github.com/golang/go/issues/27752#issuecomment-424387032 . Eventing is using golang version 1.10.3.

          Since the stats update messages aren't going through, it prevents the producer from trying to kill and respawn the consumer processes.

          Gautham.Banasandra Gautham Banasandra (Inactive) added a comment - Node 172.23.96.148 was rebalanced in. Each of the consumer processes for Function timer_op tried to send a mutation and got stuck while trying to write to the socket - 3 @ 0x90f27a 0x90a59a 0x909c17 0x9a4bcb 0x9a4cbd 0x9a65b7 0xa6ed1f 0xa7fdba 0x96f256 0x1093977 0x1091769 0x10ad168 0x109d2a1 0x93cb11 # 0x909c16 internal/poll.runtime_pollWait+0x56 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/runtime/netpoll.go:173 # 0x9a4bca internal/poll.(*pollDesc).wait+0x9a /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/internal/poll/fd_poll_runtime.go:85 # 0x9a4cbc internal/poll.(*pollDesc).waitWrite+0x3c /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/internal/poll/fd_poll_runtime.go:94 # 0x9a65b6 internal/poll.(*FD).Write+0x236 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/internal/poll/fd_unix.go:264 # 0xa6ed1e net.(*netFD).Write+0x4e /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/net/fd_unix.go:220 # 0xa7fdb9 net.(*conn).Write+0x69 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/net/net.go:188 # 0x96f255 bytes.(*Buffer).WriteTo+0xb5 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/bytes/buffer.go:240 # 0x1093976 github.com/couchbase/eventing/consumer.(*Consumer).sendMessage+0xbb6 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/handle_messages.go:670 # 0x1091768 github.com/couchbase/eventing/consumer.(*Consumer).sendDcpEvent+0x5f8 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/handle_messages.go:455 # 0x10ad167 github.com/couchbase/eventing/consumer.(*Consumer).sendEvent+0x787 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/process_events.go:1169 # 0x109d2a0 github.com/couchbase/eventing/consumer.(*Consumer).processDCPEvents+0x5660 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/process_events.go:91 The rest of the go routines which tried to write to the socket were waiting to get the lock. For example, the stats updater routine is trying to get the lock - 3 @ 0x90f27a 0x90f32e 0x91fc14 0x91f92d 0x944ce8 0x945bed 0x1092ebf 0x1090e8f 0x10b91a8 0x93cb11 # 0x91f92c sync.runtime_SemacquireMutex+0x3c /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/runtime/sema.go:71 # 0x944ce7 sync.(*Mutex).Lock+0x107 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/sync/mutex.go:134 # 0x945bec sync.(*RWMutex).Lock+0x2c /home/couchbase/.cbdepscache/exploded/x86_64/go-1.10.3/go/src/sync/rwmutex.go:93 # 0x1092ebe github.com/couchbase/eventing/consumer.(*Consumer).sendMessage+0xfe /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/handle_messages.go:631 # 0x1090e8e github.com/couchbase/eventing/consumer.(*Consumer).sendGetExecutionStats+0x1fe /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/handle_messages.go:384 # 0x10b91a7 github.com/couchbase/eventing/consumer.(*Consumer).updateWorkerStats+0x347 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/consumer/stats_updater.go:111 The fact that the write to socket is getting stuck seems to be an issue in golang - https://github.com/golang/go/issues/27752 . The solution suggested by them is to upgrade to golang version 1.11 - https://github.com/golang/go/issues/27752#issuecomment-424387032 . Eventing is using golang version 1.10.3. Since the stats update messages aren't going through, it prevents the producer from trying to kill and respawn the consumer processes.
          jeelan.poola Jeelan Poola added a comment -

          Given the signature of the problem, which is caused by a WriteTo stuck in go lang, this is not a regression. Have not seen this signature before.

          jeelan.poola Jeelan Poola added a comment - Given the signature of the problem, which is caused by a WriteTo stuck in go lang, this is not a regression. Have not seen this signature before.
          jeelan.poola Jeelan Poola made changes -
          Is this a Regression? Yes [ 10450 ] No [ 10451 ]
          lynn.straus Lynn Straus made changes -
          Labels system-test approved-for-mad-hatter system-test
          lynn.straus Lynn Straus made changes -
          Due Date 10/Dec/19
          lynn.straus Lynn Straus made changes -
          Link This issue blocks MB-36676 [ MB-36676 ]

          Build couchbase-server-6.5.0-4935 contains eventing commit ad17df9 with commit message:
          MB-37179: Takes sleep out of critical section

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4935 contains eventing commit ad17df9 with commit message: MB-37179 : Takes sleep out of critical section

          Build couchbase-server-7.0.0-1119 contains eventing commit 8d41973 with commit message:
          MB-37179: Takes sleep out of critical section

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1119 contains eventing commit 8d41973 with commit message: MB-37179 : Takes sleep out of critical section

          Build couchbase-server-7.0.0-1119 contains eventing commit c4caf9e with commit message:
          MB-37179 Move stats loading into a separate go routine

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1119 contains eventing commit c4caf9e with commit message: MB-37179 Move stats loading into a separate go routine

          Build couchbase-server-6.5.0-4936 contains eventing commit 9253758 with commit message:
          MB-37179 Move stats loading into a separate go routine

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4936 contains eventing commit 9253758 with commit message: MB-37179 Move stats loading into a separate go routine
          jeelan.poola Jeelan Poola made changes -
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]

          Build couchbase-server-6.5.1-6005 contains eventing commit 9253758 with commit message:
          MB-37179 Move stats loading into a separate go routine

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.1-6005 contains eventing commit 9253758 with commit message: MB-37179 Move stats loading into a separate go routine

          Build couchbase-server-6.5.1-6005 contains eventing commit ad17df9 with commit message:
          MB-37179: Takes sleep out of critical section

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.1-6005 contains eventing commit ad17df9 with commit message: MB-37179 : Takes sleep out of critical section

          Not seen on 6.5.0-4947

          vikas.chaudhary Vikas Chaudhary added a comment - Not seen on 6.5.0-4947
          vikas.chaudhary Vikas Chaudhary made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            Gautham.Banasandra Gautham Banasandra (Inactive)
            vikas.chaudhary Vikas Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty