Aliaksey Artamonau Thanks.
If we want to try to improve the in the Mad-Hatter timeframe, I don't think we have time to implement the fully-flexible proposal outlined in Changes in ns_server for Sync Replication - STAT vbucket-details. I think what is feasible is something along the lines of the following:
- Add a new stat group - vbucket-details-durability <VBID> or similar which returns just the 2 stats ns_server typically needs. That would reduce down to 1024 STAT.request packets and 2048 STAT.response packets per call - i.e 25x fewer.
- We could actually encode this along the lines of the above proposal - i.e. STAT(key="vbucket-details", value="[\"state\", \"high_seqno\"]" - with the limitation that kv_enigne literally only supports filtering for those two specific stats. That should make it easier to expand to support a greater list of stats which can be requested in future.
- If the reduction in stats to 2 per vBucket still isn't sufficient (i.e. the problem is sending 1024x request and decoding 2048x response TCP packets; then we might be able to add a single stat call which returns the necessary stats for all vBuckets in a single JSON payload - i guess it depends on where the cost lies in ns_server.
In the case of (1) we can definitely get you some PoC code in a day or so to test out and see what difference it makes; for (2) that's a bit more work (need to do a bunch of aggregation across all vBuckets) so I'd rather see the results of (1) first and see if that's sufficient.
Let me know if you want us to prepare a patch for testing - including exactly which stats you need in the new stat group.
See also Changes in ns_server for Sync Replication - STAT vbucket-details.