Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-28799

Investigate adding SDK-authored JSON/Snappy datatype bit

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Won't Do
    • Major
    • None
    • 5.5.0
    • clients, couchbase-bucket
    • None

    Description

      Spun out from MB-28409 discussion. Originally by Matt Ingenthron:

      I feel like there is probably another path here that preserves existing user expectations. It might turn this into a win amplifying a win. Suspend disbelief for a moment.

      Before I describe further, consider these three cases. This is a Java example, but there is an analog in each highlevel SDK.

              // The Majority Case
              // JsonDocument;    << will be json (modulo bugs)  << majority case
              // Create a JSON Document
              JsonObject aFragment = JsonObject.create()
                      .put("name", "Matt Ingenthron");
              JsonDocument userDoc = JsonDocument.create("u:ingenthr", aFragment);
       
       
              // An integration case, still common, still important.
              // RawJsonDocument;  << should be json
              RawJsonDocument userDocRaw = RawJsonDocument.create("u:ingenthr1", "{\"name\": \"Matt Ingenthron\"}");
       
              // A case that exists already, semantics since CB 2.0 is that (server) components detect JSON even when we don't know
              // BinaryDocument  << could be json
              String mightBeJsonStringButIsnt = "{\"name\": \"Matt Ingenthron\"}";
              ByteBuf maybeJson = Unpooled.wrappedBuffer(mightBeJsonStringButIsnt.getBytes("UTF-16"));
              BinaryDocument userDocBinary = BinaryDocument.create("u:ingenthr2", maybeJson);
              // upsert here, before release
              maybeJson.release();
      

      I might propose for the two bits in datatype that we ascribe a different meaning.
      1 => The client says this is JSON/snappy compressed
      0 => This is maybe JSON or maybe snappy compressed

      The key distinction, with this proposal, is that it it optimizes for the right cases. Non-JSON cases are not as likely using views/GSI/FTS (though it's possible on metadata). The most common case is also the fastest case. The reasonably common case, when done correctly, will also be fast.

      The objection might be one of purity or one of the efficiency gains by amortizing it to kv_engine. Since the feature here isn't really about validation, in that we don't reject non-JSON, I think with sufficient definition the purity is sound. 0 just means maybe, just like it has since inception.

      For the amortization, I'm not sure that benefits us that much. I acknowledge what [@marco] says there. My guess is that they could have done the same amortization in process and the big efficiency gain was by just not parsing the full object over and over. At least if projector is the only consumer, then operational complexity is the same, regardless of which process does it.

      One objection might be what if we leak something the client said was JSON, but it isn't? In every case I can think of, the production of the data is the same, the ability for GSI or Views or FTS to process it is also the same. The functional behavior can be the same, though we'd need to test to that to make sure there's not a failure somewhere. Said another way, every case I can think of where something is JSON, the next step is to try to process it as JSON, typically with some kind of platform parser. That could be verified.

      There are some variations on this proposal of course.

      One variation which preserves the amortization would be to trust but verify. This would mean doing the JSON/snappy verification on the DCP path. Since most of the consumers relying on this information on DCP are by definition behind and, if you want the amortization anyway, then letting kv_engine do the work before shipping it is okay. It still keeps this check outside the critical path.

      Another variation would be to have the snappy bit be a True/False and the JSON bit be Maybe/Json. This could run into some minor issues in situations where people are diving low level and doing their own snappy compression. It'd be snappy compressed and come across as False, but then converted to True and decompressed on cmd_get with compression enabled at the cluster. This seems release-noteable and disable-able.

      A couple of other notes:

      • I believe Couchbase Server 2.0 didn't do JSON detection in ep_engine. As I recall, in that era the cost was when views read the files from disk.
      • I do agree that we absolutely need to establish size thresholds. The threshold implementation will be easy but it'll be somewhat hard for users to determine when to set it. I think it'll be unfortunate if we set the thresholds high and expect users to turn it down and "pay the price" if they want compression and have no choice on JSON detection, so I'd rather we not do that.
      • We still have some dangerous waters here. I snuck in something that may throw off some of our JSON parsers in the code sample above. There are more variations on the theme though. There's a 3.5 year old bug I'd filed on this: MB-12788. While it is a bit nitpicky I admit, I seem to recall having seen it in the field with some very large numbers that parsers tripped on, etc. In practice, this hasn't hurt us too much, but since we're only one of three JSON databases I can think of, it seems like we should strive for correctness.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            drigby Dave Rigby (Inactive)
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty