Description
While running the test suite py-view-pre-merge.conf, once and only once so far, reached a case where queries were being retried forever (or too long, but where no progress seemed to happen and the system was basically idle) because vbucket 0 was not marked as active in an index but ns_server was passing a vbucket map to the view merger where vbucket 0 was listed as active:
[couchdb:info] [2012-07-29 17:42:02] [n_0@192.168.1.80:<0.6954.1>:couch_log:info:39] Set view `default`, group `_design/dev_test_view-b2fa892`, missing partitions: [0]
[couchdb:info] [2012-07-29 17:42:07] [n_0@192.168.1.80:<0.6978.1>:couch_log:info:39] Set view `default`, group `_design/dev_test_view-b2fa892`, missing partitions: [0]
[couchdb:info] [2012-07-29 17:42:12] [n_0@192.168.1.80:<0.7006.1>:couch_log:info:39] Set view `default`, group `_design/dev_test_view-b2fa892`, missing partitions: [0]
(.... repeated lots of times ...)
In views.1 log (used by ns_server's capi_set_view_manager), noticed that vbucket 0 was marked for cleanup in the main index (where it was previously marked as active) at timestamp "2012-07-29 17:41:55", and requested to be removed from the replica index as well (but it was a no-op since it was not marked as replica).
The queries that started failing happened around timestamp "2012-07-29 17:42:02", shortly after vbucket 0 was marked for cleanup in main index of node n_0.
This can be seen in the logs of node n_0 at the end of views.1 and couchdb.1 (state transitions in both logs seem to match each other).
Not sure if this means that node n_0 was not supposed to mark vbucket 0 for cleanup, or if it later was supposed to mark it again as active. Vbucket 0 doesn't seem to be marked as active in any of the other 3 nodes as well.
Logs attached.