Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 2.3.0
Affects Version/s: None
Component/s: None
Labels:
None

Story Points:
1

Description

Hello, my name is Joel LaFall and this is my first time commiting to this project. I work for a company called Hart which uses Apache Spark and Couchbase extensively. When evaluating Spark Structured Streaming and the Couchbase sink with a simple word count example using Kafka as the source and Couchbase as the sink, I noticed that the state of the aggregation was not being saved. And so every time a new instance of a word already seen came into Kafka and counted by the Spark driver, the count would effectively "reset" to 1. Looking into the Apache Spark source, I found that calling Dataset.rdd will breaking the state management machinery and the only way I found to get around this is to call Dataset.queryExecution.toRdd, which returns an RDD[InternRow] that does not break state management. After making the change to the Couchbase connector, everything worked as expected.

Attachments

Issue Links

links to

Github PR

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: David Nault

Reporter:: David Nault

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 23/Jan/18 2:26 PM

Updated:: 24/Apr/20 1:07 PM

Resolved:: 12/Feb/18 9:29 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

SPARKC-86 Fix CouchbaseSink so it can use Spark state management for aggregations: Gerrit Review:

SPARKC-86 Fix CouchbaseSink so it can use Spark state management for aggregations: Gerrit Review:

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty