Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-8687

Server failed, data corrupted and view intermittent

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.2.0
    • Component/s: ns_server, view-engine
    • Security Level: Public
    • Labels:
    • Environment:
      Cluster has three machines, all have 3.1GHz processors and 8GB RAM. Two of them have 1TB HDDs and Ubuntu 12.04.2 LTS and the other one has 500GB HDD and Debian 6.0.7.

      Couchbase 2.0.1 is installed, community version

      Description

      One of the Ubuntu servers auto-failover-ed tonight and although it came back up, we had some downtime reported on the website that uses it (so we still noticed a problem which we shouldn't have done as it should handle it shouldn't it?) and (at least) one of the documents was corrupted. This particular document is stored in bzcompress() format (via PHP) and is a relatively core piece of data to the website that uses it, so this had to be rebuilt. Fortunately it was only cached data rather than actual data but this is very concerning for me as I'm trying to move away from MySQL at the moment and rely solely on Couchbase but I can't have that data corrupted.

      We also have some view queries ran on page load to retrieve certain things that are critical and whilst the rebalance was taking place the results of the view queries were intermittent to say the list. All queries were "stale=false" so it was trying to build the index each time, however according to the documentation it's supposed to handle this too.

      I'm also unsure as to why the server failed in the first place as there doesn't look to have been anything wrong with the actual server.

      I've ran the cbcollect_info and I've attached it. Any advice and/or recommendations on preventing this from happening again in the future is greatly received!

      Will delete the zip from the server when someone can confirm it's ok to do so.

      Thanks,

      Graeme

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        chiyoung Chiyoung Seo added a comment -

        Sundar,

        Can you take a look at this issue to see anything suspicious that caused data corruption.?

        Thanks,

        Show
        chiyoung Chiyoung Seo added a comment - Sundar, Can you take a look at this issue to see anything suspicious that caused data corruption.? Thanks,
        Hide
        sundar Sundar Sridharan added a comment -

        Hope that the following messages are not suspicious ..
        Tue Jul 23 17:50:28.185420 BST 3: Had to wait 1062 usec for shutdown
        Tue Jul 23 17:50:30.969603 BST 3: Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker...
        ------------------------------------------------------------
        and the following order of log messages...
        ------------------------------------------------------------
        Tue Jul 23 17:50:59.689204 BST 3: warmup completed in 33 ms
        Tue Jul 23 17:51:07.902808 BST 3: metadata loaded in 13 s
        Tue Jul 23 17:51:07.938839 BST 3: 25 items loaded from access log, completed in 34 usec
        Tue Jul 23 17:51:07.938906 BST 3: warmup completed in 13 s
        Tue Jul 23 17:52:25.202369 BST 3: Shutting down tap connections!

        if this is normal I am not able to spot anything else suspicious from memcached logs
        thanks

        Show
        sundar Sundar Sridharan added a comment - Hope that the following messages are not suspicious .. Tue Jul 23 17:50:28.185420 BST 3: Had to wait 1062 usec for shutdown Tue Jul 23 17:50:30.969603 BST 3: Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker... ------------------------------------------------------------ and the following order of log messages... ------------------------------------------------------------ Tue Jul 23 17:50:59.689204 BST 3: warmup completed in 33 ms Tue Jul 23 17:51:07.902808 BST 3: metadata loaded in 13 s Tue Jul 23 17:51:07.938839 BST 3: 25 items loaded from access log, completed in 34 usec Tue Jul 23 17:51:07.938906 BST 3: warmup completed in 13 s Tue Jul 23 17:52:25.202369 BST 3: Shutting down tap connections! if this is normal I am not able to spot anything else suspicious from memcached logs thanks
        Hide
        anil Anil Kumar added a comment -

        Please reopen if you reproduce it. Not enough information to troubleshoot or arrive at solution.

        Show
        anil Anil Kumar added a comment - Please reopen if you reproduce it. Not enough information to troubleshoot or arrive at solution.
        Hide
        glambert Graeme Lambert added a comment -

        Had another incident of the same data being corrupted today after one node failed over. Is there a way to attach the screenshot?

        I think the views being intermittent was because of using full_set=true during a rebalance, not sure why but removing that made them work every time afterwards.

        Show
        glambert Graeme Lambert added a comment - Had another incident of the same data being corrupted today after one node failed over. Is there a way to attach the screenshot? I think the views being intermittent was because of using full_set=true during a rebalance, not sure why but removing that made them work every time afterwards.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        >> Is there a way to attach the screenshot?

        I think simply attaching image file will work

        >> I think the views being intermittent was because of using full_set=true during a rebalance, not sure why but removing that made them work every time afterwards.

        That's weird. full_set is only taken into account when it's sent to development view. That seems to imply that you're actually using development views and not production.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - >> Is there a way to attach the screenshot? I think simply attaching image file will work >> I think the views being intermittent was because of using full_set=true during a rebalance, not sure why but removing that made them work every time afterwards. That's weird. full_set is only taken into account when it's sent to development view. That seems to imply that you're actually using development views and not production.

          People

          • Assignee:
            sundar Sundar Sridharan
            Reporter:
            glambert Graeme Lambert
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes