Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4849

Server Crash - {write_loop_died,{badmatch,{error,enospc}}}

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Won't Fix
    • Affects Version/s: 2.0-developer-preview-4
    • Fix Version/s: 2.0-beta
    • Component/s: view-engine
    • Security Level: Public
    • Environment:
      dp4 build 717
      3 node cluster
      5 million docs
      20 ddocs (1 view each - generic emit all map functions))

      Description

      Looks like a cluster I left to create replica indexes overnight has crashed. At time of crash an empty MnesiaCore file was created, and attempts to restart couchbase service creates an empty erl_crash.dump. Excerpt from error log below with diags attached:

      [error_logger:error] [2012-02-28 22:40:00] [ns_1@10.2.2.32:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server <0.29993.3> terminating

        • Last message in was {'EXIT',<0.29996.3>,
          Unknown macro: {badmatch,{error,enospc}}}
          ** When Server state == {file,<0.29995.3>,<0.29996.3>,15623309}
          ** Reason for termination ==
          ** {write_loop_died,{badmatch,{error,enospc}}

          }

      [error_logger:error] [2012-02-28 22:40:00] [ns_1@10.2.2.32:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: couch_file:init/1
      pid: <0.29993.3>
      registered_name: []
      exception exit: {write_loop_died,{badmatch,

      {error,enospc}

      }}
      in function gen_server:terminate/6
      in call from couch_file:init/1
      ancestors: [<0.29990.3>,<0.29975.3>,<0.29974.3>]
      messages: [{'$gen_call',

      1. diags.tar.gz
        5.95 MB
        Tommie McAfee
      2. 10.2.2.31_errors.1
        432 kB
        Tommie McAfee
      1. Screen Shot 2012-02-29 at 10.06.58 AM.png
        45 kB
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        no space means no space. There's not much you can do when you exhaust FS space

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - no space means no space. There's not much you can do when you exhaust FS space
        Hide
        tommie Tommie McAfee added a comment -

        did compaction fail?

        The cluster has 120gb space.

        Show
        tommie Tommie McAfee added a comment - did compaction fail? The cluster has 120gb space.
        Hide
        tommie Tommie McAfee added a comment -

        Spoke with Filipe about this....who explained that compaction doesn't start until indexing finishes.
        So what happens is my couch data disk size is 3.5 GB and
        Couchbase is going to try and create main and replica index files for 20 view, and worst case(if I re-emit the entire db), my cluster would have to reserve an extra 140 GB (3.5gb*40) for queries.

        Filipe says it's possible to implement some sort of incremental compaction or possibly giving compaction threads priority when necessary.

        Show
        tommie Tommie McAfee added a comment - Spoke with Filipe about this....who explained that compaction doesn't start until indexing finishes. So what happens is my couch data disk size is 3.5 GB and Couchbase is going to try and create main and replica index files for 20 view, and worst case(if I re-emit the entire db), my cluster would have to reserve an extra 140 GB (3.5gb*40) for queries. Filipe says it's possible to implement some sort of incremental compaction or possibly giving compaction threads priority when necessary.
        Hide
        tommie Tommie McAfee added a comment -

        UI view of disk usage overhead

        Show
        tommie Tommie McAfee added a comment - UI view of disk usage overhead
        Hide
        damien damien added a comment -

        Filipe, can you look at this. If this bug is a invalid or limitation, just mark as Won't fix with a small explanation.

        Show
        damien damien added a comment - Filipe, can you look at this. If this bug is a invalid or limitation, just mark as Won't fix with a small explanation.
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Unfortunately once we get out of disk space, we can't have query views with ?stale=ok or ?stale=update_after (default).
        We get a file_error from within ns_server (somewhere in the HTTP handlers / ALE logger):

        $ curl 'http://localhost:9500/default/_design/test/_view/view1?limit=10'
        {"error":"badmatch","reason":"{error,{file_error,\"logs/n_0/log\",enospc}}"}

        The relevant full stack trace:

        [menelaus:warn] [2012-04-16 15:10:51] [n_0@192.168.1.80:<0.29010.0>:menelaus_web:loop:358] Server error during processing: ["web request failed",

        {path,"/pools/default"}

        ,

        {type,error}

        ,

        {what,function_clause}

        ,
        {trace,
        [{menelaus_stats,
        'invoke_archiver/3-lc$^0/1-0',
        [{'EXIT',
        {{badmatch,
        {error,

        {file_error,"logs/n_0/log",enospc}

        }},
        [

        {'ale_logger-stats',error,5}

        ,

        {stats_reader,latest,4}

        ,

        {menelaus_stats,invoke_archiver,3}

        ,

        {menelaus_stats,last_membase_sample,2},
        {menelaus_stats,last_bucket_stats,3},
        {menelaus_stats,basic_stats,3},
        {ns_storage_conf, '-do_cluster_storage_info/1-fun-2-', 3},
        {lists,foldl,3}]}}]},
        {menelaus_stats,last_membase_sample,2}

        ,

        {menelaus_stats,last_bucket_stats,3}

        ,

        {menelaus_stats,basic_stats,3}

        ,

        {ns_storage_conf, '-do_cluster_storage_info/1-fun-2-',3}

        ,

        {lists,foldl,3}

        ,

        {ns_storage_conf,do_cluster_storage_info,1}

        ,

        {menelaus_web,build_pool_info,4}

        ]}]

        Technically the view engine is capable of serving queries with stale=ok|update_after if there's no space left on disk, as long as the logger doesn't crash when there's no disk space left.
        Queries with ?stale=false will always get an error mentioning the posix error code 'enospc'.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Unfortunately once we get out of disk space, we can't have query views with ?stale=ok or ?stale=update_after (default). We get a file_error from within ns_server (somewhere in the HTTP handlers / ALE logger): $ curl 'http://localhost:9500/default/_design/test/_view/view1?limit=10' {"error":"badmatch","reason":"{error,{file_error,\"logs/n_0/log\",enospc}}"} The relevant full stack trace: [menelaus:warn] [2012-04-16 15:10:51] [n_0@192.168.1.80:<0.29010.0>:menelaus_web:loop:358] Server error during processing: ["web request failed", {path,"/pools/default"} , {type,error} , {what,function_clause} , {trace, [{menelaus_stats, ' invoke_archiver/3-lc$^0/1-0 ', [{'EXIT', {{badmatch, {error, {file_error,"logs/n_0/log",enospc} }}, [ {'ale_logger-stats',error,5} , {stats_reader,latest,4} , {menelaus_stats,invoke_archiver,3} , {menelaus_stats,last_membase_sample,2}, {menelaus_stats,last_bucket_stats,3}, {menelaus_stats,basic_stats,3}, {ns_storage_conf, '-do_cluster_storage_info/1-fun-2-', 3}, {lists,foldl,3}]}}]}, {menelaus_stats,last_membase_sample,2} , {menelaus_stats,last_bucket_stats,3} , {menelaus_stats,basic_stats,3} , {ns_storage_conf, '-do_cluster_storage_info/1-fun-2-',3} , {lists,foldl,3} , {ns_storage_conf,do_cluster_storage_info,1} , {menelaus_web,build_pool_info,4} ]}] Technically the view engine is capable of serving queries with stale=ok|update_after if there's no space left on disk, as long as the logger doesn't crash when there's no disk space left. Queries with ?stale=false will always get an error mentioning the posix error code 'enospc'.
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Damien, what's your call?

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Damien, what's your call?
        Hide
        damien damien added a comment -

        Running out of disk space shouldn't cause corruptions, but other than that we cannot do anything. A possible future feature is to have an admin function to purge all indexes, which will have them rebuilt from scratch, but will take a lot of disk IO and potential application downtime, that might be easily resolved in another way by the administrator.

        Show
        damien damien added a comment - Running out of disk space shouldn't cause corruptions, but other than that we cannot do anything. A possible future feature is to have an admin function to purge all indexes, which will have them rebuilt from scratch, but will take a lot of disk IO and potential application downtime, that might be easily resolved in another way by the administrator.

          People

          • Assignee:
            damien damien
            Reporter:
            tommie Tommie McAfee
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes