Uploaded image for project: 'Couchbase Spark Connector'
  1. Couchbase Spark Connector
  2. SPARKC-86

Fix CouchbaseSink so it can use Spark state management for aggregations

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 2.3.0
    • None
    • None
    • None
    • 1

    Description

      Hello, my name is Joel LaFall and this is my first time commiting to this project. I work for a company called Hart which uses Apache Spark and Couchbase extensively. When evaluating Spark Structured Streaming and the Couchbase sink with a simple word count example using Kafka as the source and Couchbase as the sink, I noticed that the state of the aggregation was not being saved. And so every time a new instance of a word already seen came into Kafka and counted by the Spark driver, the count would effectively "reset" to 1. Looking into the Apache Spark source, I found that calling Dataset.rdd will breaking the state management machinery and the only way I found to get around this is to call Dataset.queryExecution.toRdd, which returns an RDD[InternRow] that does not break state management. After making the change to the Couchbase connector, everything worked as expected.

      See also:

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              david.nault David Nault
              david.nault David Nault
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty