Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46344

[CX] Infer schema from CSV header

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • Morpheus
    • Morpheus
    • analytics
    • 1

    Description

      Creating external datasets from CSV files should be able to infer the attribute names from the file header, if present, and sample records to infer the attributes' data types. For example, in a create statement there could be an "infer" flag that takes the number of records to scan, like the following open source syntax example (where the prescribed number is 10):

      CREATE EXTERNAL DATASET Employee() USING localfs (("path"="localhost:///employees.csv"), ("format"="delimited-text"), ("delimiter"=","), ("header"=true), ("infer"=10))

      One could imagine offering some different "infer" options - e.g.,
      "infer" = N — look at the header plus the first N rows to infer a likely schema
      "infer" = ALL — look at the whole file to infer a (bullet-proof) schema
      "infer" = SAMPLE(N) — pick N rows at random to infer a schema (don't know if this is useful or not)

      One could also imagine no-header versions of the above where field names come from the CREATE statement but the data types are inferred, though that seems to make less sense (as it helps to save less work).

      The overall goal would be to make the user experience for CSV/TSV as close as we can get to the ease of dealing with JSON.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              till Till Westmann
              till Till Westmann
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty