TextBasedFileFormat — Base for Text Splitable FileFormats
TextBasedFileFormat
is an extension of the FileFormat contract for formats that can be splitable.
TextBasedFileFormat | Description |
---|---|
|
|
TextBasedFileFormat
uses Hadoop’s CompressionCodecFactory to find the proper compression codec for the given file.
isSplitable
Method
1 2 3 4 5 6 7 8 |
isSplitable( sparkSession: SparkSession, options: Map[String, String], path: Path): Boolean |
Note
|
isSplitable is part of FileFormat Contract to know whether a given file is splitable or not.
|
isSplitable
requests the CompressionCodecFactory to find the compression codec for the given file (as the input path
) based on its filename suffix.
isSplitable
returns true
when the compression codec is not used (i.e. null
) or is a Hadoop SplittableCompressionCodec (e.g. BZip2Codec).
If the CompressionCodecFactory is not defined, isSplitable
creates a CompressionCodecFactory (with a Hadoop Configuration
by requesting the SessionState
for a new Hadoop Configuration with extra options).
Note
|
isSplitable uses the input sparkSession to access SessionState.
|
Note
|
SplittableCompressionCodec interface is for compression codecs that are capable to compress and de-compress a stream starting at any arbitrary position. Such codecs are highly valuable, especially in the context of Hadoop, because an input compressed file can be split and hence can be worked on by multiple machines in parallel. One such compression codec is BZip2Codec that provides output and input streams for bzip2 compression and decompression. |