TextBasedFileFormat-spark技术分享

TextBasedFileFormat — Base for Text Splitable FileFormats

TextBasedFileFormat is an extension of the FileFormat contract for formats that can be splitable.

Table 1. TextBasedFileFormats
TextBasedFileFormat	Description
CSVFileFormat
JsonFileFormat
`LibSVMFileFormat`	Used in Spark MLlib
TextFileFormat

TextBasedFileFormat uses Hadoop’s CompressionCodecFactory to find the proper compression codec for the given file.

`isSplitable` Method



isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

isSplitable(

sparkSession: SparkSession,

options: Map[String, String],

path: Path): Boolean

Note	`isSplitable` is part of FileFormat Contract to know whether a given file is splitable or not.

isSplitable requests the CompressionCodecFactory to find the compression codec for the given file (as the input path) based on its filename suffix.

isSplitable returns true when the compression codec is not used (i.e. null) or is a Hadoop SplittableCompressionCodec (e.g. BZip2Codec).

If the CompressionCodecFactory is not defined, isSplitable creates a CompressionCodecFactory (with a Hadoop Configuration by requesting the SessionState for a new Hadoop Configuration with extra options).

Note	`isSplitable` uses the input `sparkSession` to access SessionState.

Note

SplittableCompressionCodec interface is for compression codecs that are capable to compress and de-compress a stream starting at any arbitrary position.

Such codecs are highly valuable, especially in the context of Hadoop, because an input compressed file can be split and hence can be worked on by multiple machines in parallel.

One such compression codec is BZip2Codec that provides output and input streams for bzip2 compression and decompression.

TextBasedFileFormat

TextBasedFileFormat — Base for Text Splitable FileFormats

`isSplitable` Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

TextBasedFileFormat — Base for Text Splitable FileFormats

isSplitable Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

`isSplitable` Method