FileFormat-spark技术分享

FileFormat — Data Sources to Read and Write Data In Files

FileFormat is the contract for data sources that read and write data stored in files.

Method Description

buildReader



buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReader(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

Builds a Catalyst data reader, i.e. a function that reads a PartitionedFile file as InternalRows.

buildReader throws an UnsupportedOperationException by default (and should therefore be overriden to work):



buildReader is not supported for [this]

buildReader is not supported for [this]

Used exclusively when FileFormat is requested to buildReaderWithPartitionValues

buildReaderWithPartitionValues



buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues builds a data reader with partition column values appended, i.e. a function that is used to read a single file in (as a PartitionedFile) as an Iterator of InternalRows (like buildReader) with the partition values appended.

Used exclusively when FileSourceScanExec physical operator is requested for the inputRDD (when requested for the inputRDDs and execution)

inferSchema



inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

inferSchema(

sparkSession: SparkSession,

options: Map[String, String],

files: Seq[FileStatus]): Option[StructType]

Infers (returns) the schema of the given files (as Hadoop’s FileStatuses) if supported. Otherwise, None should be returned.

Used when:

HiveMetastoreCatalog is requested to inferIfNeeded (when RelationConversions logical evaluation rule is requested to convert a HiveTableRelation to a LogicalRelation for parquet, native and hive ORC storage formats)
DataSource is requested to getOrInferFileFormatSchema and resolveRelation

isSplitable



isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

isSplitable(

sparkSession: SparkSession,

options: Map[String, String],

path: Path): Boolean

Controls whether the format (under the given path as Hadoop Path) can be split or not.

isSplitable is disabled (false) by default.

Used exclusively when FileSourceScanExec physical operator is requested to create an RDD for non-bucketed reads (when requested for the inputRDD and neither the optional bucketing specification of the HadoopFsRelation is defined nor bucketing is enabled)

prepareWrite



prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

prepareWrite(

sparkSession: SparkSession,

job: Job,

options: Map[String, String],

dataSchema: StructType): OutputWriterFactory

Prepares a write job and returns an OutputWriterFactory

Used exclusively when FileFormatWriter is requested to write query result

supportBatch



supportBatch(
  sparkSession: SparkSession,
  dataSchema: StructType): Boolean

supportBatch(

sparkSession: SparkSession,

dataSchema: StructType): Boolean

Flag that says whether the format supports columnar batch (i.e. vectorized decoding) or not.

isSplitable is off (false) by default.

Used exclusively when FileSourceScanExec physical operator is requested for the supportsBatch

vectorTypes



vectorTypes(
  requiredSchema: StructType,
  partitionSchema: StructType,
  sqlConf: SQLConf): Option[Seq[String]]

vectorTypes(

requiredSchema: StructType,

partitionSchema: StructType,

sqlConf: SQLConf): Option[Seq[String]]

vectorTypes is the concrete column vector class names for each column to be used in a columnar batch when enabled

vectorTypes is undefined (None) by default.

Used exclusively when FileSourceScanExec physical operator is requested for the vectorTypes

Table 2. FileFormats (Direct Implementations and Extensions)
FileFormat	Description
AvroFileFormat	Avro data source
HiveFileFormat	Writes hive tables
OrcFileFormat	ORC data source
ParquetFileFormat	Parquet data source
TextBasedFileFormat	Base for text splitable `FileFormats`

Building Data Reader With Partition Column Values Appended — `buildReaderWithPartitionValues` Method



buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function from buildReader.

Internally, buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:

Creates a converter by requesting GenerateUnsafeProjection to generate an UnsafeProjection for the attributes of the input requiredSchema and partitionSchema
Applies the data reader to a PartitionedFile and converts the result using the converter on the joined row with the partition column values appended.

Note	`buildReaderWithPartitionValues` is used exclusively when `FileSourceScanExec` physical operator is requested for the input RDDs.

FileFormat

FileFormat — Data Sources to Read and Write Data In Files

Building Data Reader With Partition Column Values Appended — `buildReaderWithPartitionValues` Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

FileFormat — Data Sources to Read and Write Data In Files

Building Data Reader With Partition Column Values Appended — buildReaderWithPartitionValues Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

Building Data Reader With Partition Column Values Appended — `buildReaderWithPartitionValues` Method