Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: None
Affects Version/s: 5.5.0, 5.5.1, 5.5.2, 5.5.3, 5.5.4, 5.5.5, 5.5.6, 6.0.0, 6.0.1, 6.0.2, 6.0.3
Component/s: view-engine
Labels:

Triage:
Untriaged
Is this a Regression?:
Yes

Description

There is a regression introduced in Couchbase Server 5.5.0 through the addition of ~~MB-16737~~, causing view queries to hang for a very long time intermittently.

Steps to reproduce

Create a multi node cluster (I used a 4 node vagrant cluster) running Data on all nodes.

Create a bucket and load it bucket with json documents (I loaded it with ~500,000), e.g.:

 /opt/couchbase/bin/cbc pillowfight -U "couchbase://localhost/test" -u Administrator -P password -J -I 500000 --populate-only

Create a view with complex array keys and a reduce function, it is not important how complex or what the reduce actually does, the key aspect is that each node has to spend a reasonable amount of time preparing the results for the merge query.
For example (this is contrived to aggravate the problematic behavior):
Map

function (doc, meta) {

  for (var i=0; i<100; i++) {

  var l= [];

    for (var j=0; j<10; j++) {

      l.push(j*i);

    emit(l, Math.random());

Reduce

_stats

Once the index is built, query the index in a loop using a high group_level, for example:

{ while true; do date 1>&2; time curl -sS -u Administrator:password 'http://10.143.192.101:8092/test/_design/testy/_view/testy?stale=ok&inclusive_end=true&skip=0&group_level=5' ; done } > /dev/null

Expected Behavior

This query loop runs successfully forever (well not forever, but you get the idea) with reasonably consistent response times.

Actual Behavior

After 60 seconds, the view query becomes 'stuck' for a while, once this completes (with errors), the other queries are fast again for another 60 seconds or so, then the cycle repeats.
For example:

Fri Nov 29 03:18:03 PST 2019

real	0m0.936s

user	0m0.000s

sys	0m0.004s

...

Fri Nov 29 03:19:01 PST 2019

real	0m0.975s

user	0m0.004s

sys	0m0.000s

Fri Nov 29 03:19:02 PST 2019

real	1m0.313s

user	0m0.004s

sys	0m0.000s

Fri Nov 29 03:20:02 PST 2019

real	0m1.001s

user	0m0.000s

sys	0m0.004s

Investigation

From a process dump on the view engine being queried, I can see that it is stuck waiting for the results from the merge call to the other nodes in the cluster, in this example for node 10.143.192.102:

 {backtrace,

       [<<"Program counter: 0x00007f74e8735048 (lhttpc:get_body_part/2 + 56)">>,

        <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,

        <<"0x00007f74c11c4e30 Return addr 0x00007f74caf8cc10 (couch_index_merger:next_chunk/2 + 40)">>,

        <<"y(0)     60000">>,<<"y(1)     <0.17011.11>">>,<<>>,

        <<"0x00007f74c11c4e48 Return addr 0x00007f74caf92228 (couch_http_view_streamer:next_streamer_state/2 + 232)">>,

        <<>>,

        <<"0x00007f74c11c4e50 Return addr 0x00007f74caf91cd8 (couch_http_view_streamer:stream_loop/4 + 112)">>,

        <<"y(0)     #Fun<couch_index_merger.16.2253377>">>,

        <<"y(1)     <<0 bytes>>">>,<<>>,

        <<"0x00007f74c11c4e68 Return addr 0x00007f74caf91c10 (couch_http_view_streamer:parse/3 + 272)">>,

        <<"y(0)     []">>,<<"y(1)     <<119 bytes>>">>,

        <<"y(2)     #Fun<couch_index_merger.16.2253377>">>,

        <<"y(3)     <0.17004.11>">>,<<"y(4)     <<0 bytes>>">>,<<>>,

        <<"0x00007f74c11c4e98 Return addr 0x00007f74caf8b220 (couch_index_merger:run_http_index_folder/5 + 1024)">>,

        <<>>,

        <<"0x00007f74c11c4ea0 Return addr 0x00007f74caf8a908 (couch_index_merger:http_index_folder/5 + 304)">>,

        <<"y(0)     Catch 0x00007f74caf8b338 (couch_index_merger:run_http_index_folder/5 + 1304)">>,

        <<"y(1)     Catch 0x00007f74caf8b500 (couch_index_merger:run_http_index_folder/5 + 1760)">>,

        <<"y(2)     []">>,<<"y(3)     []">>,<<"y(4)     []">>,

        <<"y(5)     <0.17004.11>">>,

        <<"(6)     \"http://%40ns_server:a728b878680ed4605cb597a67c98a8e09372963eac710802bf7d1e982628ff97@10.143.192.102:8092/_view_">>,

        <<"y(7)     []">>,<<>>,

        <<"0x00007f74c11c4ee8 Return addr 0x000000000089d2f8 (<terminate process normally>)">>,

        <<"y(0)     Catch 0x00007f74caf8a928 (couch_index_merger:http_index_folder/5 + 336)">>,

        <<"y(1)     Catch 0x00007f74caf8ab20 (couch_index_merger:http_index_folder/5 + 840)">>,

        <<"y(2)     []">>,<<"y(3)     []">>,

        <<"(4)     {index_merge,[{set_view_spec,<<4 bytes>>,<<13 bytes>>,<<5 bytes>>,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,1">>,

        <<"y(5)     []">>,<<"y(6)     []">>,<<>>]},

...

   {messages,

       [<<"{response,<0.17011.11>,{error,connection_closed}}">>,

        <<"{ok,#Ref<0.0.79.74304>}">>]},

   {message_queue_len,2},

...

   {current_location,

       {lhttpc,get_body_part,2,

           [{file,

                "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc.erl"},

            {line,600}]}},

   {dictionary,

       [{from_url,

            <<"http://%40ns_server:a728b878680ed4605cb597a67c98a8e09372963eac710802bf7d1e982628ff97@10.143.192.102:8092/_view_merge/">>},

        {streamer_pid,<12402.17011.11>},

        {queue_ref,#Ref<12402.0.79.74304>}]}]},

What's interesting is that it seems to waiting for results from the other node, but it actually has received

{error,connection_closed}

in the process' message inbox, indicating that the connection was closed, yet for some reason it's still waiting.

Based on this analysis, I took a tcpdump on port 8092 to see what the network traffic was like at the time, this is attached as mb-37100.pcap.
We can see that eventually the view query returned with a partial results set, showing connection_closed errors:

"errors":[

{"from":"10.143.192.103:8092","reason":"connection_closed"},

{"from":"10.143.192.102:8092","reason":"connection_closed"},

{"from":"10.143.192.104:8092","reason":"connection_closed"}

From the pcap, we can also see that the connections to the other data nodes were terminated ~200ms after the call to /_view_merge (green packets are the HTTP call, grey packets are the FIN, ACK closing the connection):

This then causes the result set to be incomplete, as the primary view node did not receive the full result set from the nodes it is scatter-gathering from.
From the pcap we can also see that these connection closes happened approximately 60 seconds after the initial view query:

[Timestamps]

    [Time since first frame in this TCP stream: 59.413167000 seconds]

    [Time since previous frame in this TCP stream: 0.000075000 seconds]

Repeating the same test over and over shows a similar pattern, where the connections are closed ~60 seconds after they were opened.

Digging deeper, this is because the view engine is re-using the same TCP connections to perform multiple /_view_merge calls (which is sensible, as this is efficient) and as a result of the changes added in ~~MB-16737~~ these connections are being closed after the default internal HTTP timeout which is 60000 milliseconds - http://src.couchbase.org/source/xref/6.0.1/couchdb/src/couch_index_merger/src/couch_index_merger.erl#576-578.

We will need to fix this within the view engine to not close sockets due to a timeout if they have successfully served a HTTP call.

Workaround

If you are able to within your application (SDKs such as Java cannot set this) you can set a custom connection_timeout value as part of the view query.
If you set this to a very high value for all of your view queries then this mitigates the impact as it takes a lot longer for the inter-node view connections to be closed, so rather than hanging a query every minute, you could change it a much higher value.
For example to set it to 1 day:

http://10.143.192.101:8092/test/_design/testy/_view/testy?stale=ok&inclusive_end=true&skip=0&group_level=5&connection_timeout=86400

This however is not a very efficient workaround, as if there are other queries not using a high connection_timeout value, it will still affect the other connections that are using the higher value.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mb-37100.pcap
3.90 MB
03/Dec/19 2:53 AM
Screen Shot 2019-11-29 at 11.15.08.png
437 kB
03/Dec/19 2:53 AM

Issue Links

backports to

MB-37117 [BP MB-37100] - View queries become stuck due to merge connection timeouts

Closed

Clones

MB-37100 View queries become stuck due to merge connection timeouts

Closed

[BP MB-37100] - 6.0.3-MP1 Back port- View queries become stuck due to merge connection timeouts

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

PagerDuty