关注 spark技术分享,
撸spark源码 玩spark最佳实践

RateStreamSource

RateStreamSource

RateStreamSource is a streaming source that generates consecutive numbers with timestamp that can be useful for testing and PoCs.

RateStreamSource is created for rate format (that is registered by RateSourceProvider).

Table 1. RateStreamSource’s Options
Name Default Value Description

numPartitions

(default parallelism)

Number of partitions to use

rampUpTime

0 (seconds)

rowsPerSecond

1

Number of rows to generate per second (has to be greater than 0)

RateStreamSource uses a predefined schema that cannot be changed.

Table 2. RateStreamSource’s Dataset Schema (in the positional order)
Name Type

timestamp

TimestampType

value

LongType

Table 3. RateStreamSource’s Internal Registries and Counters
Name Description

clock

lastTimeMs

maxSeconds

startTimeMs

Tip

Enable INFO or DEBUG logging levels for org.apache.spark.sql.execution.streaming.RateStreamSource to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

Getting Maximum Available Offsets — getOffset Method

Note
getOffset is a part of the Source Contract.
Caution
FIXME

Generating DataFrame for Streaming Batch — getBatch Method

Note
getBatch is a part of Source Contract.

Internally, getBatch calculates the seconds to start from and end at (from the input start and end offsets) or assumes 0.

getBatch then calculates the values to generate for the start and end seconds.

You should see the following DEBUG message in the logs:

If the start and end ranges are equal, getBatch creates an empty DataFrame (with the schema) and returns.

Otherwise, when the ranges are different, getBatch creates a DataFrame using SparkContext.range operator (for the start and end ranges and numPartitions partitions).

Creating RateStreamSource Instance

RateStreamSource takes the following when created:

  • SQLContext

  • Path to the metadata

  • Rows per second

  • RampUp time in seconds

  • Number of partitions

  • Flag to whether to use ManualClock (true) or SystemClock (false)

RateStreamSource initializes the internal registries and counters.

赞(0) 打赏
未经允许不得转载:spark技术分享 » RateStreamSource
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏