Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7791

[windows] queries failed with batch_sort_failed eaccess error

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.1.0
    • Component/s: 3rd-party
    • Security Level: Public
    • Labels:
    • Environment:

      Description

      test viewquerytests.ViewQueryTests.test_employee_dataset_startkey_endkey_queries

      Test case info:

      Test uses employee data set:
      -documents are structured as

      {"name": name<string>, "join_yr" : year<int>, "join_mo" : month<int>, "join_day" : day<int>, "email": email<string>, "job_title" : title<string>, "type" : type<string>, "desc" : desc<tring>}

      Steps to repro:
      1. Start load data
      2. Simultaneously start querying(starkey endkey descending
      inclusive_end combinations)

      Views structure are:
      Views : ['test_view-af3f718 : map_fn=function (doc)

      { if(doc.job_title !== undefined) emit([doc.join_yr, doc.join_mo, doc.join_day], [doc.name, doc.email] ); }, reduce_fn=None', 'test_view-63a8408 : map_fn=function (doc) { if(doc.job_title !== undefined) { var myregexp = new RegExp("^UI "); if(doc.job_title.match(myregexp)){ emit([doc.join_yr, doc.join_mo, doc.join_day], [doc.name, doc.email] );}}}, reduce_fn=None', 'test_view-cc3009e : map_fn=function (doc) { if(doc.job_title !== undefined) { var myregexp = new RegExp("^System "); if(doc.job_title.match(myregexp)){ emit([doc.join_yr, doc.join_mo, doc.join_day], [doc.name, doc.email] );}}}, reduce_fn=None', 'test_view-3c844d7 : map_fn=function (doc) { if(doc.job_title !== undefined) { var myregexp = new RegExp("^Senior "); if(doc.job_title.match(myregexp)){ emit([doc.join_yr, doc.join_mo, doc.join_day], [doc.name, doc.email] );}}}, reduce_fn=None', 'test_view-8c85da1 : map_fn=function (doc) { if(doc.job_title !== undefined) emit([doc.join_yr, doc.join_mo, doc.join_day], [doc.name, doc.email] ); }

      , reduce_fn=_count', 'test_view-c3d973a : map_fn=function (doc, meta) { if(doc.job_title !== undefined) { var myregexp = new RegExp("^admin"); if(meta.id.match(myregexp))

      { emit([doc.join_yr, doc.join_mo, doc.join_day], [doc.name, doc.email] );}

      }}, reduce_fn=None']

      got partially results, and from one of the nodes eacces error:
      [{u'reason': u'{batch_sort_failed,{file_error,"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/@indexes/default/tmp_ab46cb6651b5cb601374dd94a9f8b1b4_main/22fe520aafb45bdc63def45dd3095140.sort",\n eacces}}', u'from': u'http://10.3.2.243:8092/_view_merge/?startkey=%5B2008%2C7%2Cnull%5D&stale=false&debug=true'}]

      attaching logs

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Show
        iryna iryna added a comment - logs: https://s3.amazonaws.com/bugdb/jira/MB-7791/4e64d520/10.3.2.239-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7791/4e64d520/10.3.2.243-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7791/4e64d520/10.3.3.38-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7791/4e64d520/10.3.3.39-diag.zip
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Thanks Iryna.

        This problem, with file eacces errors, which is happening all the time in different places, won't go away without CBD-790 being addressed.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Thanks Iryna. This problem, with file eacces errors, which is happening all the time in different places, won't go away without CBD-790 being addressed.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Siri/Filipe,

        does it make sense to rerun with the script that Siri gave Iryna last time to find out which process is locking the file ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Siri/Filipe, does it make sense to rerun with the script that Siri gave Iryna last time to find out which process is locking the file ?
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        No Farshid, we know what the problem is. It's pointless to keep doing workarounds on our code base when the issue is really in Erlang.
        Please read the mail I just sent, where you amongst others are included:

        "We've had many file eacces issues on Windows, happening in many different places,
        mostly somewhere in couchdb storage and view engine parts, which are file operation
        intensive:

        MB-7791
        MB-7788
        MB-7569
        MB-7371
        MB-7569
        MB-7788
        MB-6957
        CBSE-298
        CBSE-367
        (etc, etc, etc...)

        Why it happens was explained on MB-6957, an issue in Erlang's file driver where not
        all file open calls use all the necessary windows specific share flags. A patch was
        submitted upstream to fix the issue and is included in Erlang R16A (a beta release):

        https://github.com/erlang/otp/commit/0e02f488971b32ff9ab88a3f0cb144fe5db161b2

        On MB-7569, Siri gave some help by using a Windows specific tool that monitors and
        logs any file access and reports with which share flags files are open. The tool's log
        attached there verified indeed that often Erlang opens files without any share flags,
        which causes eacces errors when other Erlang processes (or async IO threads to be
        more exact if +A is used, or scheduler threads otherwise) attempt to access the same
        file (read, rename, delete):

        $ egrep 'ShareMode' Logfile.CSV | perl -ne '/ShareMode: (.*?), AllocationSize/; print "ShareFlags: $1\n";' | sort | uniq -c
        736 ShareFlags: None
        44129 ShareFlags: Read, Write, Delete

        The underlying Erlang VM file driver C functions that open files without all the necessary
        share flags are the following:

        • efile_fileinfo()
        • efile_readlink()
        • efile_write_info()

        The first 2 are run when the Erlang functions file:read_file_info/1 and file:read_link_info/1
        are called. These 2 functions are used by ns_server to periodically determine the amount of
        disk space used by databases and indexes. There might be other places that call these functions
        against view files, but none of them is in couchdb itself.

        These eacces errors are a problem that is showing up all the time in different places - almost
        on a weekly basis I get a Jira issue related to an eacces error on a specific place - it gets
        a workaround committed and later we get another issue but in a different place, and this loop goes
        on and on.
        As these places are uncovered, we attempt to workaround the issue by retrying file operations
        up to 5 seconds. Sometimes this is not enough, so I end up re-ordering certain file operations
        with the hope that the eacces error frequency is minimized - this is just another workaround.
        This makes our code bigger and harder to maintain, while at the same time not really being
        possible to fully address the issue, which is really in the Erlang VM.

        CBD-790 is needed to address this issue once and for all, it's a task for us to build our own
        Erlang on Windows, so that we can apply the Erlang patch that fixes the eacces errors.
        Further we need to build our own Erlang on Windows because of CBD-753 as well, even because the
        maximum number of allowed open file descriptors on Windows is much smaller than on GNU/Linux - after
        several (or maybe many) rebalances on a Windows cluster, nodes will run out of file descriptors soon
        or run out of disk space - this makes the system unusable and will require nodes to be restarted.

        So I would like to request to get priority on CBD-790 if Windows is really a platform that we
        want to fully support and have people using it for our software.
        "

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - No Farshid, we know what the problem is. It's pointless to keep doing workarounds on our code base when the issue is really in Erlang. Please read the mail I just sent, where you amongst others are included: "We've had many file eacces issues on Windows, happening in many different places, mostly somewhere in couchdb storage and view engine parts, which are file operation intensive: MB-7791 MB-7788 MB-7569 MB-7371 MB-7569 MB-7788 MB-6957 CBSE-298 CBSE-367 (etc, etc, etc...) Why it happens was explained on MB-6957 , an issue in Erlang's file driver where not all file open calls use all the necessary windows specific share flags. A patch was submitted upstream to fix the issue and is included in Erlang R16A (a beta release): https://github.com/erlang/otp/commit/0e02f488971b32ff9ab88a3f0cb144fe5db161b2 On MB-7569 , Siri gave some help by using a Windows specific tool that monitors and logs any file access and reports with which share flags files are open. The tool's log attached there verified indeed that often Erlang opens files without any share flags, which causes eacces errors when other Erlang processes (or async IO threads to be more exact if +A is used, or scheduler threads otherwise) attempt to access the same file (read, rename, delete): $ egrep 'ShareMode' Logfile.CSV | perl -ne '/ShareMode: (.*?), AllocationSize/; print "ShareFlags: $1\n";' | sort | uniq -c 736 ShareFlags: None 44129 ShareFlags: Read, Write, Delete The underlying Erlang VM file driver C functions that open files without all the necessary share flags are the following: efile_fileinfo() efile_readlink() efile_write_info() The first 2 are run when the Erlang functions file:read_file_info/1 and file:read_link_info/1 are called. These 2 functions are used by ns_server to periodically determine the amount of disk space used by databases and indexes. There might be other places that call these functions against view files, but none of them is in couchdb itself. These eacces errors are a problem that is showing up all the time in different places - almost on a weekly basis I get a Jira issue related to an eacces error on a specific place - it gets a workaround committed and later we get another issue but in a different place, and this loop goes on and on. As these places are uncovered, we attempt to workaround the issue by retrying file operations up to 5 seconds. Sometimes this is not enough, so I end up re-ordering certain file operations with the hope that the eacces error frequency is minimized - this is just another workaround. This makes our code bigger and harder to maintain, while at the same time not really being possible to fully address the issue, which is really in the Erlang VM. CBD-790 is needed to address this issue once and for all, it's a task for us to build our own Erlang on Windows, so that we can apply the Erlang patch that fixes the eacces errors. Further we need to build our own Erlang on Windows because of CBD-753 as well, even because the maximum number of allowed open file descriptors on Windows is much smaller than on GNU/Linux - after several (or maybe many) rebalances on a Windows cluster, nodes will run out of file descriptors soon or run out of disk space - this makes the system unusable and will require nodes to be restarted. So I would like to request to get priority on CBD-790 if Windows is really a platform that we want to fully support and have people using it for our software. "
        Hide
        siri Sriram Melkote added a comment -

        Trond has offered to build Erlang VM as a one off with the patches from Filipe for 2.0.1
        For 2.0.2, hopefully Phil will be able to address CBD-790
        Thanks Trond!

        Show
        siri Sriram Melkote added a comment - Trond has offered to build Erlang VM as a one off with the patches from Filipe for 2.0.1 For 2.0.2, hopefully Phil will be able to address CBD-790 Thanks Trond!
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Same as MB-7772, addressed by CBD-790 (patched erlang or erlang r16b)

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Same as MB-7772 , addressed by CBD-790 (patched erlang or erlang r16b)
        Hide
        maria Maria McDuff (Inactive) added a comment -
        Show
        maria Maria McDuff (Inactive) added a comment - MB-7772

          People

          • Assignee:
            trond Trond Norbye
            Reporter:
            iryna iryna
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes