关注 spark技术分享,
撸spark源码 玩spark最佳实践

FileStreamSource

FileStreamSource

FileStreamSource is a Source that reads text files from path directory as they appear. It uses LongOffset offsets.

Note
It is used by DataSource.createSource for FileFormat.

You can provide the schema of the data and dataFrameBuilder – the function to build a DataFrame in getBatch at instantiation time.

Batches are indexed.

It lives in org.apache.spark.sql.execution.streaming package.

It tracks already-processed files in seenFiles hash map.

Tip

Enable DEBUG or TRACE logging level for org.apache.spark.sql.execution.streaming.FileStreamSource to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

Creating FileStreamSource Instance

Caution
FIXME

Options

maxFilesPerTrigger

maxFilesPerTrigger option specifies the maximum number of files per trigger (batch). It limits the file stream source to read the maxFilesPerTrigger number of files specified at a time and hence enables rate limiting.

It allows for a static set of files be used like a stream for testing as the file set is processed maxFilesPerTrigger number of files at a time.

schema

If the schema is specified at instantiation time (using optional dataSchema constructor parameter) it is returned.

Otherwise, fetchAllFiles internal method is called to list all the files in a directory.

When there is at least one file the schema is calculated using dataFrameBuilder constructor parameter function. Else, an IllegalArgumentException("No schema specified") is thrown unless it is for text provider (as providerName constructor parameter) where the default schema with a single value column of type StringType is assumed.

Note
text as the value of providerName constructor parameter denotes text file stream provider.

getOffset Method

The maximum offset (getOffset) is calculated by fetching all the files in path excluding files that start with _ (underscore).

When computing the maximum offset using getOffset, you should see the following DEBUG message in the logs:

When computing the maximum offset using getOffset, it also filters out the files that were already seen (tracked in seenFiles internal registry).

You should see the following DEBUG message in the logs (depending on the status of a file):

Generating DataFrame for Streaming Batch — getBatch Method

FileStreamSource.getBatch asks metadataLog for the batch.

You should see the following INFO and DEBUG messages in the logs:

The method to create a result batch is given at instantiation time (as dataFrameBuilder constructor parameter).

metadataLog

metadataLog is a metadata storage using metadataPath path (which is a constructor parameter).

Note
It extends HDFSMetadataLog[Seq[String]].
Caution
FIXME Review HDFSMetadataLog
赞(0) 打赏
未经允许不得转载:spark技术分享 » FileStreamSource
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏