Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
None
-
1
Description
Hello, my name is Joel LaFall and this is my first time commiting to this project. I work for a company called Hart which uses Apache Spark and Couchbase extensively. When evaluating Spark Structured Streaming and the Couchbase sink with a simple word count example using Kafka as the source and Couchbase as the sink, I noticed that the state of the aggregation was not being saved. And so every time a new instance of a word already seen came into Kafka and counted by the Spark driver, the count would effectively "reset" to 1. Looking into the Apache Spark source, I found that calling Dataset.rdd will breaking the state management machinery and the only way I found to get around this is to call Dataset.queryExecution.toRdd, which returns an RDD[InternRow] that does not break state management. After making the change to the Couchbase connector, everything worked as expected.
See also:
Attachments
Issue Links
- links to