Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-19856

[FTS] FTS Collections Support

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • 7.0.0
    • 5.0.0
    • fts
    • None
    • Untriaged
    • Unknown

    Description

      FTS needs to support collections. This work item is obviously dependent on Couchbase Server supporting collections elsewhere.

      Throwing in some history here.

      I’ll start with my basic understanding of the requirements as stated in the PRD (please correct me if I am mistaken):
      (a) Retain complete backward compatibility with today’s functionality, allowing FTS to create an index against the entire bucket.
      (b) The new ability to select one or more Collections and create an index based on those ("I don’t want the whole bucket, just give me the docs belonging to the ‘Products’ and the ‘Orders’ collections")

      Trying to avoid getting too deep into the implementation specifics, I have a few questions about the information you will require to make these work. For example:

      • Regarding (b) - Supposing you have requested several collections - as the documents come streaming in ready to be indexed, is it important to know which of the specific collections they belong to?. i.e. If you have a "multi-collection index” is it necessary to then be able to limit the search to a specific collection within that index?
      • The same question also applies to (a) - if you’re creating an index on the entire bucket, do you care / need to know that the documents may or may not belong to a collection?
      • Again re (b) - How do you plan to cope with a collection being deleted. Say you’ve asked for ‘Products' and ‘Orders’ but then the ‘Orders’ collection is deleted. Can you cope with a one-time notification that the Collection has been deleted, or would you require individual delete notifications for every document that was affected?

      Reply from FTS team

      In FTS we already have a concept we call "type". The primary function of types is to let users say, index these things one way, and index these other things some other way. Currently, our ability to detect/determine type is quite limited (by literal value of a field inside the document).

      My first though, from just reading the questions is that users will want to be able to determine the FTS type of a document, based on which collection it came from. So, if my beer-sample bucket had 2 collections, beers and breweries, they use this to map the type within FTS.

      So, the answer to the first 2 questions I think would be yes, in both cases (index on set of collections and index on whole bucket) it would be desirable to be able to know which collection a document came from.

      The 3rd question is a bit trickier. At the moment, we don't index the type, so we couldn't easily just drop everything of one type without knowing all the IDs. But this is an enhancement we expect in the future, so it probably wouldn't be a problem.

      There is one other issue I'd like ask about (I have read parts but not all of the Collections spec). As I mentioned since FTS already had this notion of "type" we already supported users creating indexes on just beers or just brewers (albeit in an inefficient way, still streaming all the data). But it often confuses users because stats displayed to users about "completeness" still had to be in the context of the whole bucket. To give a concrete example, if we only indexed "beer" documents, we could never know that our index was at 100% because we didn't know how many beer documents there were. And showing 45% constantly in the UI made it look broken. Our only recourse was to index stubs for all documents. So essentially we would index everything anyway, but only meaningfully for the "beers". This is a long setup for the question. Will we be able to do better with collections? IE, will I know a document count for the collection? And will DCP stream "completeness" be able to be determined for these collections?

      And one more reply

      Thanks so much for such a prompt and useful reply. Mapping collection name to type makes complete sense. We have been debating whether or not to expose the Collection property via the mutation stream - having someone (finally) say it’s a desirable feature and provide a tangible use-case is extremely helpful.

      Delivering individual deletion notifications on collection deletions is something we’d probably like to avoid if possible. It’s tricky do in an efficient manner and keep everything in sync if we were to encounter a failover during the deletion. I propose we see what the other DCP consumers (XDCR, 2i, query, indexing…) have to say and see if it becomes an obvious necessity.

      Regarding the stats and indexing progress, Jim or Manu can probably offer more insight but I believe we are planning to have individual collection counts so you would know how many beers and how many breweries to expect ahead of time. That should make things more accurate

      I’m going to try and make sure all of this is accurately captured in the design spec. I’ll let you know when I’ve done so. If you (or others) have more questions or suggestions in the mean time, we’ll be glad to hear them.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            abhinav Abhi Dangeti
            will.gardella Will Gardella (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty