Description
Creating external datasets from CSV files should be able to infer the attribute names from the file header, if present, and sample records to infer the attributes' data types. For example, in a create statement there could be an "infer" flag that takes the number of records to scan, like the following open source syntax example (where the prescribed number is 10):
CREATE EXTERNAL DATASET Employee() USING localfs (("path"="localhost:///employees.csv"), ("format"="delimited-text"), ("delimiter"=","), ("header"=true), ("infer"=10))
One could imagine offering some different "infer" options - e.g.,
"infer" = N — look at the header plus the first N rows to infer a likely schema
"infer" = ALL — look at the whole file to infer a (bullet-proof) schema
"infer" = SAMPLE(N) — pick N rows at random to infer a schema (don't know if this is useful or not)
One could also imagine no-header versions of the above where field names come from the CREATE statement but the data types are inferred, though that seems to make less sense (as it helps to save less work).
The overall goal would be to make the user experience for CSV/TSV as close as we can get to the ease of dealing with JSON.
Attachments
Issue Links
- links to