Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-38550

Add ability to create lookup table cache(s) in Eventing

    XMLWordPrintable

Details

    Description

      The Issue

      This is a version of CBSE-8043

      Exposing a function to cache a value or document for a period of time like crc64() is fairly straight forward. 

      For customers that have an extensive use of lookup tables Eventing code sometimes utilizes a complex N1QL statement (or statements) in the hander's JavaScript to perform this task for each mutation with the lookup table never (or rarely changing).

      Ideally all caching should be global in KV, but there are some benefits / use cases for caching on a per thread basis

      1. Lookup tables globally on a per thread basis.
      2. Special IDs (as a key into KV on a per thread basis)

      1. Lookup Tables

      Lookup tables that use KV are faster, but our customers are trying to use N1QL from inside their JavaScript Eventing functions as much as possible (for clarity and convenience).  This creates a large performance bottle neck - I will describe the reasons later in this discussion.

      The Solution an Example

      The basic idea is to be able able to cache the object returned by an arbitrary function for some duration.  Below the loader function is loadData(meta) and the cache value is 300s for five minutes.

      var my_cache = my_bucket.cached_lookup(loadData(meta), "300s");

      The above can be easily implemented and added to Eventing (the actual syntax of course is subject to change) a concrete example might look like the following:

      function loadData(meta) {
          log("beg build cache for worker/thread triggered via", meta.id);
          var t0 = new Date();
          var airline_destairport_2_metaid = {};
          // seems  to be faster to get DISTINCT VALUE of concatenated items
          var iter_cache = SELECT DISTINCT VALUE airline || '|' || 
                               destinationairport || '|' || 
                               META().id FROM `travel-sample`
          WHERE `type` = 'route';
          for (var citem of iter_cache) {
              var tmp = citem;
              var tmpary = citem.split('|');
              //log ('A: '+tmpary);
              if (airline_destairport_2_metaid[tmpary[0]] === undefined) {
                  airline_destairport_2_metaid[tmpary[0]] = {}
              }
              airline_destairport_2_metaid[tmpary[0]][tmpary[1]] = tmpary[2];
          }
          iter_cache.close(); // remove if pre 6.5
          var d0 = new Date() - t0;
          log("end build cache took "+d0+" ms. for worker/thread triggered via",
              meta.id);
          return airline_destairport_2_metaid;
      }
       
      function OnUpdate(doc, meta) {
          if (doc.type != "airport" || doc.iata === null) return;
       
          // cache the results of function loadData for the next 30 seconds.
          var my_cache = my_bucket.cached_lookup(loadData(meta), "300s");
       
          // does the airline fly to SFO - avoid a N1QL on every invocation
          var val = lookup_table[doc.iata]['SFO'];   
       
          // find all destination airports serviced
          if (my_cache[iata] !== undefined) {
              // get the keys from our threads cache 
              var keys = my_cache[iata];
              for (var k of keys) {
                  if (my_cache[iata][k] !== undefined) {
                      airports_serviced[k] = 'y';
                  }
              }
          }
      }

      The current performance bottlenek

      The performance bottle neck is that for every Eventing thread (and there are two threads per worker) it essentially runs a "fresh VM" so to speak on each and every mutation starting with no prior state or stored state other than the doc & meta for OnUpdate or meta for OnDelete..

      Thus consider an Eventing Handler then needs to do a lookup and that lookup is in N1QL takes 25 ms. We just limited the over performance to 40 mutations per second per thread.

      The situation gets even worse consider a nested N1QL the needs to perform two N1QL calls as a set of nested lookups:

       

      iter_dest = SELECT DISTINCT VALUE destinationairport FROM `travel-sample` 
                      WHERE `airline` = $iata AND `type` = 'route';
      for (var ditem of iter_dest) {
          var  dest = ditem["destinationairport"];
          airports_serviced[dest] = 'y';        
          var iter_city = SELECT city FROM `travel-sample` 
                              WHERE type = 'airport' AND faa = $dest;
              for (var citem of iter_city) {
                  var city = citem["city"];
                  cities_serviced[city] = 'y';
              }
              iter_city.close(); // remove if pre 6.5
          }
      }
      iter_dest.close(); // remove if pre 6.5

       
      In the above case we are lucky to hit 2 mutations per second with one worker (on a single node with 12 cores)

      Yes for eight workers that still a measly 16 mutations per second and this assumes perfectly linear scaling - we are slamming the N1QL Query service/engine with lookups here.

      In my experience on a one node system with any sort of complex query the N1QL engine gets saturated at about 300-400 QPS.  When N1QL comes from Eventing I haven't seen over 200 QPS on my small one node test system.

      Back to the Solution an Example

      In Eventing a caching function can be added lets consider the 'outer select' only where for one worker (2 threads) we can maintain about 80 mutations/sec. if we have the proper index.

      So lets cache a massive data set (about 500KB) the below query to determine airports serviced is contrived but illustrates the issue the return set is docs: 24024 size: 476147 bytes
       

      var iter_dest = SELECT DISTINCT VALUE destinationairport FROM `travel-sample`
                          WHERE `airline` = $iata AND `type` = 'route';
      for (var ditem of iter_dest) {
          var dest = ditem;
          airports_serviced[dest] = 'y';
      }
      iter_dest.close();
      

       
      As I said even with the best possible N1QL covering index (one node test case) a single worker thread is limited to about 80 mutations/second (on a single node with 12 cores).

      The idea is to cache the above data into a lookup table thin something called say "my_cache" and have this cache persist for each thread (or worker) and refresh on a user defined interval.   We showed how to grab the data in "The Solution an Example" above once it is in a cache and we have a reference to it, say my_cache we can reuse it over and over again without incurring N1QL penalties (except for periodic reloads which I did not implement). 

      if (my_cache[iata] !== undefined) {
          // get the keys from our threads cache
          var keys = my_cache[iata]);
          for (var k of keys) {
              if (my_cache[iata][k] !== undefined) {
                  airports_serviced[k] = 'y';
              }
          }
      }

      The above is  fast between 0 to 1 ms. (much closer to 0) even if we have 24,024 keys

      With guidance from Engineering in India I actually implemented such a cache (sans the refresh metric) The results can be astounding.

      ===========================================================
      Test #1 easy N1QL e.g. 149 airlines, with just 58 flying to an actual set of destinations
      ===========================================================

      USE_CACHE = false; total time was 1.004 seconds processing 289 items out of 31,591 items in the bucket (287.8 items/sec. 1 worker)

      USE_CACHE = true; total time was 1.025 seconds processing 289 items out of 31,591 items in the bucket (282.0 items/sec. 1 worker). Note the cache build time is 707 ms. and 734 ms. for the two threads this dominated the test. If we subtract 707 ms. we get (908.8 items/sec. 1 worker if we had a lot of data)

      ===========================================================
      Test #2 hard N1QL e.g. 31591 airlines (forced), all with with 432 destinations to fly too
      ===========================================================

      USE_CACHE = false; total time was 425.117 seconds processing 31591 items out of 31,591 items in the bucket (74.3 items/sec. 1 worker)

      USE_CACHE = true; total time was 3.599 seconds processing 31591 items out of 31,591 items in the bucket (8777.716 items/sec. 1 worker)

      Summary

      Using this caching technique I hit 8.7K items/sec. (each item scanning 24,024 keys). Compared to a much slower 80 items/sec. using direct N1QL queries.  This is a 110X speedup.

      But it gets better, if the cached lookup table was just 8192 (8K items) I achieved a performance  USE_CACHE = true give 21,519 items/sec (total time 1.468 sec. across 31591 items). USE_CACHE = false gives 74 items/sec (total time 429.0 sec. across 31591 items). Here we have a 292X speedup

      We see the further improvement because my scans across all keys note in many use cases we would just be looking up a value and that is just an O(1) operations.  

      2. Special IDs (as a key into KV on a per thread basis)

      ===========================================================
      Interim implementation of poor man's feeds
      ===========================================================

      As indicated above all caching should be global in KV, but there are some benefits / use cases for caching a rarely changing ID as a KEY to persistent storage on a per thread basis.

      A real world use case where we can benefit from adding a caching feature is an interim solution to overcome the fact that Eventing does not have or support the concept of feeds (as in a Kafka light).

      Assuming the we have a local (per thread) cache we can also avoid race conditions when operating on a KV items using a cached per thread KEY so we are guarenteed not to have contention or races between threads for example I supplied the Eventing fucntion "curl_bundle_via_cache.json" to Engineering.

      This Eventing Handler is a high speed distributed batching system for Eventing's built in cURL that on a 12 core 1 node system can maintain 13K cURL docs per seconds. This is accomplished by batching the requests and saving them in a KV doc on each mutation until they are flushed as arrays (or doc streams) via a single cURL call when a threads doc count gets above a threshold (say 512 docs) or specific elapsed timeouts have been exceeded. This implementation is distributed and stores all data to be sent in KV (so it robust and can recover from node failures). Unfortunately without a caching APIP this function that runs on 6.5 today uses a non-approved technique for caching an UUID that will can not be exposed to customers.

      It should be noted if a UUID is not cached / bound to a thread race conditions are introduced requiring additional KV documents (and reads and writes) plus the use of crc64() to ensure correct operation in 6.5.0 (this may change if CAS is exposed).

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-38550
          # Subject Branch Project Status CR V

          Activity

            People

              sujay.gad Sujay Gad
              jon.strabala Jon Strabala
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty