Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7587

System Testing : Uneven compaction + Very high swap, over a cluster, some nodes have much higher (70 percent ) fragmentation than others.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: test-execution
    • Security Level: Public
    • Labels:
      None
    • Environment:

      Description

      Setup a cluster as above.
      Create 2 views per bucket.
      Run time : 10 hour+

      Seeing uneven compaction/ high doc fragmentation across the nodes. ( Screenshot attached below)

      Seeing very high swap on these nodes as well ( Screenshot below)

      • Indexing / Compaction is running continuosly, but never completes.

      Adding logs.

      • Does this high doc fragmentation(70-80 percent) indicate that compaction did not work as expected?
        Is this something we ve seen before/ know of?

      Please let me know if you need any other system information on this.

      1. Screen Shot 2013-01-23 at 8.51.32 AM.png
        87 kB
        Ketaki Gangal
      2. Screen Shot 2013-01-23 at 8.53.35 AM.png
        124 kB
        Ketaki Gangal
      3. Screen Shot 2013-01-23 at 9.07.08 AM.png
        34 kB
        Ketaki Gangal
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        aaron Aaron Miller (Inactive) added a comment -

        If we have compaction crashes I'll at least need stack traces or cores to have any hope of diagnosing them.

        Show
        aaron Aaron Miller (Inactive) added a comment - If we have compaction crashes I'll at least need stack traces or cores to have any hope of diagnosing them.
        Hide
        aaron Aaron Miller (Inactive) added a comment -

        Found the cores.

        It looks like tmpfile() is returning NULL on this machine, looks like error # 13, permission denied. Could be caused by Couchbase not having permission to create files in /tmp.
        Will do a fix to handle this (the compactor will still not work, It'll just exit with an error that's not a segfault).

        To make sure this doesn't happen, the directory tmpfile() creates file in (typically /tmp) needs to be writable by the user couchbase runs as.

        Show
        aaron Aaron Miller (Inactive) added a comment - Found the cores. It looks like tmpfile() is returning NULL on this machine, looks like error # 13, permission denied. Could be caused by Couchbase not having permission to create files in /tmp. Will do a fix to handle this (the compactor will still not work, It'll just exit with an error that's not a segfault). To make sure this doesn't happen, the directory tmpfile() creates file in (typically /tmp) needs to be writable by the user couchbase runs as.
        Hide
        dipti Dipti Borkar added a comment -

        Per bug-scrub: After bug is merged, QE will re-run

        Show
        dipti Dipti Borkar added a comment - Per bug-scrub: After bug is merged, QE will re-run
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Ketaki to rerun the test with build 144+

        Show
        farshid Farshid Ghods (Inactive) added a comment - Ketaki to rerun the test with build 144+
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        I don't think any fix was done here.
        Couchstore's compactor was only changed to exit with a status other than 139 (seg fault). From above, after database compaction failed with couchstore, the Erlang based compactor, which it fallbacks to, also failed:

        [couchdb:error,2013-01-23T0:10:57.085,ns_1@10.6.2.42:<0.16993.0>:couch_log:error:42]Native compact for "default/master" failed due to error

        {exit_status,139}

        . Falling back to erlang.
        [error_logger:error,2013-01-23T0:10:57.087,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]Error in process <0.16993.0> on node 'ns_1@10.6.2.42' with exit value: badmatch,{error,no_valid_header,[

        {couch_db_updater,start_copy_compact,2}

        ]}

        It needs to be analyzed why master vbucket compaction fails (apparently empty file or non-empty but no header written, as it has at least 1 doc).

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - I don't think any fix was done here. Couchstore's compactor was only changed to exit with a status other than 139 (seg fault). From above, after database compaction failed with couchstore, the Erlang based compactor, which it fallbacks to, also failed: [couchdb:error,2013-01-23T0:10:57.085,ns_1@10.6.2.42:<0.16993.0>:couch_log:error:42] Native compact for "default/master" failed due to error {exit_status,139} . Falling back to erlang. [error_logger:error,2013-01-23T0:10:57.087,ns_1@10.6.2.42:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76] Error in process <0.16993.0> on node 'ns_1@10.6.2.42' with exit value: badmatch,{error,no_valid_header ,[ {couch_db_updater,start_copy_compact,2} ]} It needs to be analyzed why master vbucket compaction fails (apparently empty file or non-empty but no header written, as it has at least 1 doc).

          People

          • Assignee:
            ketaki Ketaki Gangal
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes