Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-19314

[FTS] indexing tweet dataset fails, 100% cpu, no errors anywhere

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 4.5.0
    • 4.5.0
    • fts
    • None
    • Untriaged
    • Unknown

    Description

      I thought it would be cool to index Eben's tweet dataset for today's demo with Ravi. Unfortunately it doesn't work.

      1. Export Eben's data - ./cbbackup http://research.hq.couchbase.com:8091 /tmp/research/
      2. Create bucket Tweets
      3. Import it into your couchbase - ./cbrestore -b Tweets /tmp/research/ http://localhost:9000
      4. Create an FTS index

      I started with a custom mapping to ONLY index the "text" field, thinking that this would help limit any problems related to the sheer size/complexity of the docs.

      Observe that index count stays 0, cbft cpu is 100%, and no errors show up in UI or logs. Stats show we're not executing any batches either.

      I've tried different mappings, no difference.

      I thought maybe the mapping was leading to error, but I added logging for the mapped documents and I see reasonable output like:

      2016/04/21 09:51:27 mapped doc &document.Document{ID:631730930814713856, Fields: &document.TextField{Name:text, Options: INDEXED, TV, Analyzer: &{[] 0x5350b78 [0x5350b78 0xc8203d19b0]}, Value: Open Source CMS Built on Node.js and MongoDB - http://t.co/ofDwTwejaX via @remelehane, ArrayPositions: []}, CompositeFields: &document.CompositeField{name:"_all", includedFields:map[string]bool{}, excludedFields:map[string]bool{}, defaultInclude:true, options:5, totalLength:0, compositeFrequencies:analysis.TokenFrequencies{}}}
      

      1 field + the _all field, content looks reasonable.

      Then I looked for any bleve batch errors and found that some of the batches are getting REALLY big, perhaps just growing forever:

      2016/04/21 09:55:52 now batch size: 14354

      I thought at one point Steve Yen had put in some max batch size, but it doesn't seem to behave that way right now.

      Theory, some combination of data size (> 1M) plus doc size (somewhat large JSON) plus configured RAM size (512MB) leads to DCP behaving differently. I'm not sure of the terminology, but "backfill" or something like that. It's possible that we get fewer (or no?) snapshots.

      This then causes us to build ridiculously huge batches.

      NOTE: it's also possible this is now another source of OOMs that we haven't accounted for.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            mihir.kamdar Mihir Kamdar (Inactive)
            mschoch Marty Schoch [X] (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty