Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0-developer-preview-4
    • Fix Version/s: 2.0-developer-preview-4
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      Created 10 node cluster. Created a view {"reduce":{"map":"function (doc)

      {\n emit(doc._id, null);\n}

      ","reduce":"_count"} and uploaded 100k json items using mcsoda. Queried the view with stale=false. Result was correct. Started removing nodes one by one from a cluster while running view queries. After second node was removed the view started returning more than 100k items. I figured out that all duplicated rows come from a single node. And on this node all the duplicated rows come from three vbuckets: 215, 216, 217. There was a period of time when these vbuckets were reported by set views both as passive and replicas:

      Set view `default`, main group `_design/dev_test`, partition states updated
      active partitions before: [73,74,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,101,102,103,240,241,242]
      active partitions after: [73,74,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,101,102,103,240,241,242]
      passive partitions before: [215,216,217]
      passive partitions after: [215,216,217]
      cleanup partitions before: []
      cleanup partitions after: []
      replica partitions before: [6,7,8,32,33,34,58,59,60,113,114,115,127,139,140,141,155,164,165,188,189,190,208,211,214,215,216,217,233,236,239,244,249]
      replica partitions after: [6,7,8,32,33,34,58,59,60,113,114,115,127,139,140,141,155,164,165,188,189,190,208,211,214,215,216,217,233,236,239,244,249]
      replicas on transfer before: [215,216,217]
      replicas on transfer after: [215,216,217]

      Sequence of calls that was performed by ns_server seems to be correct. I'm attaching full logs and diag from this node.

      1. add.py
        0.2 kB
        damien
      2. del.py
        0.1 kB
        damien
      3. incorrect_results.tar.bz2
        13.64 MB
        Aliaksey Artamonau
      4. logs.tar.bz2
        380 kB
        Aliaksey Artamonau
      5. ns-diag-20120124155027.txt.bz2
        792 kB
        Aliaksey Artamonau
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Aliaksey Artamonau Aliaksey Artamonau created issue -
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Following Filipe's advice I added additional check on intersection between active partitions in main index and partitions from replica index to couch_set_view:modify_bitmasks. But these sets are disjoint even when view give an incorrect result.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Following Filipe's advice I added additional check on intersection between active partitions in main index and partitions from replica index to couch_set_view:modify_bitmasks. But these sets are disjoint even when view give an incorrect result.
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        http://review.couchbase.org/#change,12711 does not make any difference for me.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - http://review.couchbase.org/#change,12711 does not make any difference for me.
        Hide
        damien damien added a comment -

        This is a exact list of instructions how to reproduce bad views with discrete, non-concurrent steps (100% reproducible, no need to try to make race conditions happen).

        Get a fresh repo named couchbase, and copy the del.py and add.py files to couchbase/ep-engine/management/

        cd couchbase
        make
        cd ns_server
        make dataclean
        ./cluster_run --nodes=2

        from another terminal:

        cd couchbase/ns_server
        ./cluster_connect -n 1
        cd ../ep-engine/management
        python add.py

        From the web ui, create a new view.
        Click "Views" at top
        Click "Create Development View" button
        Enter in test names
        Edit the view and change map function to:
        function (doc)

        { emit(doc._id, 1); }

        For the reduce, click _count:
        Click "Save" button
        Click "Full Cluster Data Set" button
        Click the generated Url to open the raw json view in another browser window
        Keep refreshing until Value is 100000 (or whatever you expect)

        From previous terminal window:
        python del.py

        NOTE: DO NOT REFRESH THE VIEW FROM BROWSER YET!
        From the web ui click "Server Nodes" at top.
        Click "Add Server" button
        Enter in same ip address, and increment the port by one:
        Example:
        Server:10.2.1.60:9001
        Username:Administrator
        Password: asdasd

        Click "Add Server" button
        Click "Rebalance" button

        When rebalance finishes, go back to your raw JSON view in the browser, and refresh. Keep refreshing until the value stops changes.

        The value should be non-zero. BUG!!!!!

        Now go back to "Server Nodes" in web ui
        Click the "Remove" button for the newly added node.
        Click "Rebalance" button

        When rebalance finishes, go back to your raw JSON view in the browser, and refresh. The value should be the same as before. This indicates the bad values are coming from the first node.

        Show
        damien damien added a comment - This is a exact list of instructions how to reproduce bad views with discrete, non-concurrent steps (100% reproducible, no need to try to make race conditions happen). Get a fresh repo named couchbase, and copy the del.py and add.py files to couchbase/ep-engine/management/ cd couchbase make cd ns_server make dataclean ./cluster_run --nodes=2 from another terminal: cd couchbase/ns_server ./cluster_connect -n 1 cd ../ep-engine/management python add.py From the web ui, create a new view. Click "Views" at top Click "Create Development View" button Enter in test names Edit the view and change map function to: function (doc) { emit(doc._id, 1); } For the reduce, click _count: Click "Save" button Click "Full Cluster Data Set" button Click the generated Url to open the raw json view in another browser window Keep refreshing until Value is 100000 (or whatever you expect) From previous terminal window: python del.py NOTE: DO NOT REFRESH THE VIEW FROM BROWSER YET! From the web ui click "Server Nodes" at top. Click "Add Server" button Enter in same ip address, and increment the port by one: Example: Server:10.2.1.60:9001 Username:Administrator Password: asdasd Click "Add Server" button Click "Rebalance" button When rebalance finishes, go back to your raw JSON view in the browser, and refresh. Keep refreshing until the value stops changes. The value should be non-zero. BUG!!!!! Now go back to "Server Nodes" in web ui Click the "Remove" button for the newly added node. Click "Rebalance" button When rebalance finishes, go back to your raw JSON view in the browser, and refresh. The value should be the same as before. This indicates the bad values are coming from the first node.
        Hide
        damien damien added a comment -

        Used to reproduce steps from damien

        Show
        damien damien added a comment - Used to reproduce steps from damien
        damien damien made changes -
        Field Original Value New Value
        Attachment add.py [ 12075 ]
        Hide
        damien damien added a comment -

        Used to reproduce steps from damien

        Show
        damien damien added a comment - Used to reproduce steps from damien
        damien damien made changes -
        Attachment del.py [ 12076 ]
        damien damien made changes -
        Attachment add.py [ 12075 ]
        damien damien made changes -
        Attachment del.py [ 12076 ]
        damien damien made changes -
        Attachment add.py [ 12077 ]
        damien damien made changes -
        Attachment del.py [ 12078 ]
        Hide
        filipe manana filipe manana added a comment -
        Show
        filipe manana filipe manana added a comment - http://review.couchbase.org/#change,12767 fixes it
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment - - edited

        Reproduced it with all the latest fixes using the same scenario (though it definitely happens less frequently). After another rebalance out view constantly returns more items than there are in bucket. I figured out that one of the nodes returns items from vbucket 250 that is not activated in the index. It used to be active but then set_partition_states with cleanup_partitions=[250] was called. Will attach full logs from this node soon.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - - edited Reproduced it with all the latest fixes using the same scenario (though it definitely happens less frequently). After another rebalance out view constantly returns more items than there are in bucket. I figured out that one of the nodes returns items from vbucket 250 that is not activated in the index. It used to be active but then set_partition_states with cleanup_partitions= [250] was called. Will attach full logs from this node soon.
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Attachment incorrect_results.tar.bz2 [ 12093 ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        ./testrunner -i b/resources/dev-4-nodes.ini -t viewtests.ViewTests.test_count_reduce_100k_docs

        it happens even with a single node but less frequeent than before

        Show
        farshid Farshid Ghods (Inactive) added a comment - ./testrunner -i b/resources/dev-4-nodes.ini -t viewtests.ViewTests.test_count_reduce_100k_docs it happens even with a single node but less frequeent than before
        Hide
        filipe manana filipe manana added a comment -

        @Aliaksey

        Need more info on how to reproduce this. Are the query results inconsistent during failover or rebalance (or both)? Are they temporary (only during rebalance or failover) or permanent?

        Please make sure all your nodes have the following couchdb commit:
        https://github.com/couchbase/couchdb/commit/43c6b744c8a110c5a1f6f9a2039fcc405cbff1a9

        @Farshid

        Farshid, I ran that test locally, sometimes fails for me too.
        One thing I notice is that the test's queries don't specify ?stale=false. I think this is what making the test fail often.
        I changed locally the test viewtests.ViewTests.test_count_reduce_100k_docs to add stale=false to all queries, and like this the test passes always for me:

        http://friendpaste.com/5OUPCfOUHxEG4HBB0qU7r9

        Can you verify that?

        Show
        filipe manana filipe manana added a comment - @Aliaksey Need more info on how to reproduce this. Are the query results inconsistent during failover or rebalance (or both)? Are they temporary (only during rebalance or failover) or permanent? Please make sure all your nodes have the following couchdb commit: https://github.com/couchbase/couchdb/commit/43c6b744c8a110c5a1f6f9a2039fcc405cbff1a9 @Farshid Farshid, I ran that test locally, sometimes fails for me too. One thing I notice is that the test's queries don't specify ?stale=false. I think this is what making the test fail often. I changed locally the test viewtests.ViewTests.test_count_reduce_100k_docs to add stale=false to all queries, and like this the test passes always for me: http://friendpaste.com/5OUPCfOUHxEG4HBB0qU7r9 Can you verify that?
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Results where permanently inconsistent after rebalancing out several nodes. All the nodes where build with the commit you're referring.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Results where permanently inconsistent after rebalancing out several nodes. All the nodes where build with the commit you're referring.
        Hide
        steve Steve Yen added a comment -

        need repro?

        Show
        steve Steve Yen added a comment - need repro?
        steve Steve Yen made changes -
        Assignee Filipe Manana [ filipe manana ] Karan Kumar [ karan ]
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Confirmed that test_count_reduce_x_docs passes.

        Show
        karan Karan Kumar (Inactive) added a comment - Confirmed that test_count_reduce_x_docs passes.
        karan Karan Kumar (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        karan Karan Kumar (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        farshid Farshid Ghods (Inactive) made changes -
        Component/s view-merging [ 10145 ]
        Component/s b-superstar [ 10143 ]
        peter peter made changes -
        Component/s ns_server [ 10019 ]
        Component/s view-merging [ 10145 ]

          People

          • Assignee:
            karan Karan Kumar (Inactive)
            Reporter:
            Aliaksey Artamonau Aliaksey Artamonau
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes