CSVFileFormat-spark技术分享

CSVFileFormat

CSVFileFormat is a TextBasedFileFormat for csv format (i.e. registers itself to handle files in csv format and converts them to Spark SQL rows).



spark.read.format("csv").load("csv-datasets")

// or the same as above using a shortcut
spark.read.csv("csv-datasets")

spark.read.format("csv").load("csv-datasets")

// or the same as above using a shortcut

spark.read.csv("csv-datasets")

CSVFileFormat uses CSV options (that in turn are used to configure the underlying CSV parser from uniVocity-parsers project).

Table 1. CSVFileFormat’s Options
Option	Default Value	Description
`charset`	`UTF-8`	Alias of encoding
`charToEscapeQuoteEscaping`	`\\`	One character to…FIXME
`codec`		Compression codec that can be either one of the known aliases or a fully-qualified class name. Alias of compression
`columnNameOfCorruptRecord`
`comment`	`\u0000`
`compression`		Compression codec that can be either one of the known aliases or a fully-qualified class name. Alias of codec
`dateFormat`	`yyyy-MM-dd`	Uses `en_US` locale
`delimiter`	`,` (comma)	Alias of sep
`encoding`	`UTF-8`	Alias of charset
`escape`	`\\`
`escapeQuotes`	`true`
`header`
`ignoreLeadingWhiteSpace`	`false` (for reading) `true` (for writing)
`ignoreTrailingWhiteSpace`	`false` (for reading) `true` (for writing)
`inferSchema`
`maxCharsPerColumn`	`-1`
`maxColumns`	`20480`
`mode`	`PERMISSIVE`	Possible values: `DROPMALFORMED` `PERMISSIVE` (default) `FAILFAST`
`multiLine`	`false`
`nanValue`	`NaN`
`negativeInf`	`-Inf`
`nullValue`	(empty string)
`positiveInf`	`Inf`
`sep`	`,` (comma)	Alias of delimiter
`timestampFormat`	`yyyy-MM-dd’T’HH:mm:ss.SSSXXX`	Uses timeZone and `en_US` locale
`timeZone`	spark.sql.session.timeZone
`quote`	`\"`
`quoteAll`	`false`

Preparing Write Job — `prepareWrite` Method



prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

prepareWrite(

sparkSession: SparkSession,

job: Job,

options: Map[String, String],

dataSchema: StructType): OutputWriterFactory

Note	`prepareWrite` is part of the FileFormat Contract to prepare a write job.

prepareWrite…FIXME

Building Partitioned Data Reader — `buildReader` Method



buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

buildReader(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

Note	`buildReader` is part of the FileFormat Contract to build a PartitionedFile reader.

buildReader…FIXME

CSVFileFormat

CSVFileFormat

Preparing Write Job — `prepareWrite` Method

Building Partitioned Data Reader — `buildReader` Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

CSVFileFormat

Preparing Write Job — prepareWrite Method

Building Partitioned Data Reader — buildReader Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

Preparing Write Job — `prepareWrite` Method

Building Partitioned Data Reader — `buildReader` Method