Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 4.5.0
Affects Version/s: 4.5.0
Component/s: fts
Labels:
None

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

I thought it would be cool to index Eben's tweet dataset for today's demo with Ravi. Unfortunately it doesn't work.

1. Export Eben's data - ./cbbackup http://research.hq.couchbase.com:8091 /tmp/research/
2. Create bucket Tweets
3. Import it into your couchbase - ./cbrestore -b Tweets /tmp/research/ http://localhost:9000
4. Create an FTS index

I started with a custom mapping to ONLY index the "text" field, thinking that this would help limit any problems related to the sheer size/complexity of the docs.

Observe that index count stays 0, cbft cpu is 100%, and no errors show up in UI or logs. Stats show we're not executing any batches either.

I've tried different mappings, no difference.

I thought maybe the mapping was leading to error, but I added logging for the mapped documents and I see reasonable output like:

2016/04/21 09:51:27 mapped doc &document.Document{ID:631730930814713856, Fields: &document.TextField{Name:text, Options: INDEXED, TV, Analyzer: &{[] 0x5350b78 [0x5350b78 0xc8203d19b0]}, Value: Open Source CMS Built on Node.js and MongoDB - http://t.co/ofDwTwejaX via @remelehane, ArrayPositions: []}, CompositeFields: &document.CompositeField{name:"_all", includedFields:map[string]bool{}, excludedFields:map[string]bool{}, defaultInclude:true, options:5, totalLength:0, compositeFrequencies:analysis.TokenFrequencies{}}}

1 field + the _all field, content looks reasonable.

Then I looked for any bleve batch errors and found that some of the batches are getting REALLY big, perhaps just growing forever:

2016/04/21 09:55:52 now batch size: 14354

I thought at one point Steve Yen had put in some max batch size, but it doesn't seem to behave that way right now.

Theory, some combination of data size (> 1M) plus doc size (somewhat large JSON) plus configured RAM size (512MB) leads to DCP behaving differently. I'm not sure of the terminology, but "backfill" or something like that. It's possible that we get fewer (or no?) snapshots.

This then causes us to build ridiculously huge batches.

NOTE: it's also possible this is now another source of OOMs that we haven't accounted for.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

slow.png
127 kB
21/Apr/16 7:05 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Mihir Kamdar (Inactive)

Reporter:: Marty Schoch [X] (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Apr/16 6:58 AM

Updated:: 18/Oct/16 1:42 PM

Resolved:: 25/Apr/16 10:20 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-19314 - option for bleveMaxOpsPerBatch, defaults to 200: Gerrit Review:

[FTS] indexing tweet dataset fails, 100% cpu, no errors anywhere

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty