Details

      Description

      (updated by alk: I cannot fix top posting but I took out some names out from this)

      Set up a node, fill the filesystem, watch processes run but see memcached take connections and just fail to respond.

      Also, set up a node, stop Couchbase. Fill the filesystem. Start Couchbase.

      On Wed, Mar 28, 2012 at 5:07 PM, Sharon Barr <XXXX> wrote:

      Unix is more mature then Couchbase at the edge cases. we are getting there.. or trying NOT to get there at all (another alternative..).

      From: Matt Ingenthron
      Sent: Wednesday, March 28, 2012 8:04 AM
      To: Frank Weigel; Perry Krug; Dipti Borkar
      Cc: Sharon Barr; Alex Ma; support-internal

      Subject: Re: YYYY having issues

      Incidentally, while testing the hotfix for AAAA with TMP_OOM, I accidentally ran my CentOS out of disk. The OS is running happily and so are our processes, but moxi is just returning errors and the memcached process isn't responding to stats requests.

      There is still free memory available, but happily we've (kinda) lived within our quota. Confusingly the quota is set to 512MByte, but the resident memory size of memcached is only 445MByte. The virtual size is larger, but it's likely not tried to allocate.

      So at least this UNIX-like OS is fine when out of disk.

      Matt

      On 3/27/12 9:55 PM, "Frank Weigel" <XXXXXX> wrote:

      In principal agree, but if this is the only disk, UNIX doesn't do well when entirely out of disk AFAIK, so we may need to do this when poor man's disk alert kicks in?

      That's a myth. Only buggy UNIXes (or UNIX-like OSs) don't do well there. I've worked with many a UNIX that is perfectly fine with a full disk.*

      I agree with Perry that it should end in TMP_OOM. We should leave ourselves some memory of course (since we need to receive the packet to respond with TMP_OOM), but there is no reason why this is not doable. It's simply a matter of writing and testing the software.

      Matt

      • the myth came from BSD that way, way, way back when required 2x the swap possible per process's memory to keep going. that "2x" is another myth that seems to keep perpetuating.

      From: Perry Krug <XXX>
      Date: Tue, 27 Mar 2012 02:27:01 -0700
      To: Frank Weigel <XXX>
      Cc: (skipped)
      Subject: Re: YYYYY having issues

      Can we please actually do something about this in the code so that the entire server doesn't just crash? We should start sending tmp_oom or something as soon as we detect that we are unable to write to disk.

      From: Sharon Barr <xxX>
      Date: Mon, 26 Mar 2012 17:11:58 -0700
      To: Alex Ma <XXX>, Perry Krug <xXX>
      Cc: skipped
      Subject: RE: YYYYY having issues

      Apparently they run out of disk space on all nodes..

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          peter peter added a comment -

          This issue cannot be addressed in the 2.0 timeframe. Deferred to .next for now.

          Show
          peter peter added a comment - This issue cannot be addressed in the 2.0 timeframe. Deferred to .next for now.
          Hide
          chiyoung Chiyoung Seo added a comment -

          Assign it back to me as Liang will work on the ns-server for the time being.

          Show
          chiyoung Chiyoung Seo added a comment - Assign it back to me as Liang will work on the ns-server for the time being.
          Hide
          perry Perry Krug added a comment -

          Raising the awareness of this bug.

          We have had numerous occasions where a full disk has caused a total cluster collapse, corruption on a node and various other problems.

          The biggest source of pain is when the configuration partition fills up and ns_server can't successfully write the config/stats/etc out. This causes very strange and unpredictable behavior like the node becoming zombied (beam.smp not able to shut down, etc) as well as the rest of the cluster not being able to fail it over or communicate.

          Show
          perry Perry Krug added a comment - Raising the awareness of this bug. We have had numerous occasions where a full disk has caused a total cluster collapse, corruption on a node and various other problems. The biggest source of pain is when the configuration partition fills up and ns_server can't successfully write the config/stats/etc out. This causes very strange and unpredictable behavior like the node becoming zombied (beam.smp not able to shut down, etc) as well as the rest of the cluster not being able to fail it over or communicate.
          Hide
          chiyoung Chiyoung Seo added a comment -

          We should first discuss how Couchbase Server can handle running out of disk space across multiple components (e.g., ep-engine, view-engine, ns-server) in a consistent way.

          Please refer to the ticket:

          http://www.couchbase.com/issues/browse/MB-8067

          Show
          chiyoung Chiyoung Seo added a comment - We should first discuss how Couchbase Server can handle running out of disk space across multiple components (e.g., ep-engine, view-engine, ns-server) in a consistent way. Please refer to the ticket: http://www.couchbase.com/issues/browse/MB-8067
          Hide
          pvarley Patrick Varley added a comment -

          Supportability scrub - Closing this as a Dup.

          Show
          pvarley Patrick Varley added a comment - Supportability scrub - Closing this as a Dup.

            People

            • Assignee:
              pvarley Patrick Varley
              Reporter:
              steve Steve Yen
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes