关注 spark技术分享,
撸spark源码 玩spark最佳实践

KafkaSource

KafkaSource

Note
Kafka topics are checked for new records every trigger and so there is some noticeable delay between when the records have arrived to Kafka topics and when a Spark application processes them.
Note

Structured Streaming support for Kafka is in a separate spark-sql-kafka-0-10 module (aka library dependency).

spark-sql-kafka-0-10 module is not included by default so you have to start spark-submit (and “derivatives” like spark-shell) with --packages command-line option to “install” it.

Replace the version of spark-sql-kafka-0-10 module (e.g. 2.4.0 above) with one of the available versions found at The Central Repository’s Search that matches your version of Spark.

KafkaSource uses the streaming metadata log directory to persist offsets. The directory is the source ID under the sources directory in the checkpointRoot (of the StreamExecution).

Note

The checkpointRoot directory is one of the following:

KafkaSource is created for kafka format (that is registered by KafkaSourceProvider).

KafkaSource creating instance.png
Figure 1. KafkaSource Is Created for kafka Format by KafkaSourceProvider
Table 1. KafkaSource’s Options
Name Default Value Description

kafkaConsumer.pollTimeoutMs

spark.network.timeout or 120s

The time (in milliseconds) spent waiting in Consumer.poll if data is not available in the buffer.

Used exclusively to create a KafkaSourceRDD when KafkaSource is requested to generate a DataFrame with records from Kafka for streaming batch.

maxOffsetsPerTrigger

(empty)

Number of records to fetch per trigger (to limit the number of records to fetch).


Unless defined, KafkaSource requests KafkaOffsetReader for the latest offsets.

startingoffsets

Possible values:

  • latest

  • earliest

  • JSON with topics, partitions and their offsets, e.g.

Tip

Use Scala’s tripple quotes for the JSON for topics, partitions and offsets.

assign

Topic subscription strategy that accepts a JSON with topic names and partitions, e.g.

Note
Exactly one topic subscription strategy is allowed (that KafkaSourceProvider validates before creating KafkaSource).

subscribe

Topic subscription strategy that accepts topic names as a comma-separated string, e.g.

Note
Exactly one topic subscription strategy is allowed (that KafkaSourceProvider validates before creating KafkaSource).

subscribepattern

Topic subscription strategy that uses Java’s java.util.regex.Pattern for the topic subscription regex pattern of topics to subscribe to, e.g.

Tip

Use Scala’s tripple quotes for the regular expression for topic subscription regex pattern.

Note
Exactly one topic subscription strategy is allowed (that KafkaSourceProvider validates before creating KafkaSource).

KafkaSource uses a predefined fixed schema (and cannot be changed).

Table 2. KafkaSource’s Dataset Schema (in the positional order)
Name Type

key

BinaryType

value

BinaryType

topic

StringType

partition

IntegerType

offset

LongType

timestamp

TimestampType

timestampType

IntegerType

Tip

Use cast method (of Column) to cast BinaryType to a string (for key and value columns).

KafkaSource also supports batch Datasets.

Table 3. KafkaSource’s Internal Registries and Counters
Name Description

currentPartitionOffsets

Current partition offsets (as Map[TopicPartition, Long])

Initially NONE and set when KafkaSource is requested to get the maximum available offsets or generate a DataFrame with records from Kafka for a batch.

Tip

Enable INFO or DEBUG logging levels for org.apache.spark.sql.kafka010.KafkaSource to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

rateLimit Internal Method

rateLimit requests KafkaOffsetReader to fetchEarliestOffsets.

Caution
FIXME
Note
rateLimit is used exclusively when KafkaSource gets available offsets (when maxOffsetsPerTrigger option is specified).

getSortedExecutorList Method

Caution
FIXME

reportDataLoss Internal Method

Caution
FIXME
Note

reportDataLoss is used when KafkaSource does the following:

Generating DataFrame with Records From Kafka for Streaming Batch — getBatch Method

Note
getBatch is a part of Source Contract.

getBatch initializes initial partition offsets (unless initialized already).

You should see the following INFO message in the logs:

getBatch requests KafkaSourceOffset for end partition offsets for the input end offset (known as untilPartitionOffsets).

getBatch requests KafkaSourceOffset for start partition offsets for the input start offset (if defined) or uses initial partition offsets (known as fromPartitionOffsets).

getBatch finds the new partitions (as the difference between the topic partitions in untilPartitionOffsets and fromPartitionOffsets) and requests KafkaOffsetReader to fetch their earliest offsets.

getBatch reports a data loss if the new partitions don’t match to what KafkaOffsetReader fetched.

You should see the following INFO message in the logs:

getBatch reports a data loss if the new partitions don’t have their offsets 0.

getBatch reports a data loss if the fromPartitionOffsets partitions differ from untilPartitionOffsets partitions.

You should see the following DEBUG message in the logs:

getBatch gets the executors (sorted by executorId and host of the registered block managers).

Important
That is when getBatch goes very low-level to allow for cached KafkaConsumers in the executors to be re-used to read the same partition in every batch (aka location preference).

You should see the following DEBUG message in the logs:

getBatch creates a KafkaSourceRDDOffsetRange per TopicPartition.

getBatch filters out KafkaSourceRDDOffsetRanges for which until offsets are smaller than from offsets. getBatch reports a data loss if they are found.

getBatch creates a KafkaSourceRDD (with executorKafkaParams, pollTimeoutMs and reuseKafkaConsumer flag enabled) and maps it to an RDD of InternalRow.

Important
getBatch creates a KafkaSourceRDD with reuseKafkaConsumer flag enabled.

You should see the following INFO message in the logs:

getBatch sets currentPartitionOffsets if it was empty (which is when…​FIXME)

In the end, getBatch creates a DataFrame from the RDD of InternalRow and schema.

Fetching Offsets (From Metadata Log or Kafka Directly) — getOffset Method

Note
getOffset is a part of the Source Contract.

Internally, getOffset fetches the initial partition offsets (from the metadata log or Kafka directly).

KafkaSource initialPartitionOffsets.png
Figure 2. KafkaSource Initializing initialPartitionOffsets While Fetching Initial Offsets
Note
initialPartitionOffsets is a lazy value and is initialized the very first time getOffset is called (which is when StreamExecution constructs a streaming batch).

getOffset requests KafkaOffsetReader to fetchLatestOffsets (known later as latest).

Note
(Possible performance degradation?) It is possible that getOffset will request the latest offsets from Kafka twice, i.e. while initializing initialPartitionOffsets (when no metadata log is available and KafkaSource’s KafkaOffsetRangeLimit is LatestOffsetRangeLimit) and always as part of getOffset itself.

getOffset then calculates currentPartitionOffsets based on the maxOffsetsPerTrigger option.

Table 4. getOffset’s Offset Calculation per maxOffsetsPerTrigger
maxOffsetsPerTrigger Offsets

Unspecified (i.e. None)

latest

Defined (but currentPartitionOffsets is empty)

rateLimit with limit limit, initialPartitionOffsets as from, until as latest

Defined (and currentPartitionOffsets contains partitions and offsets)

rateLimit with limit limit, currentPartitionOffsets as from, until as latest

You should see the following DEBUG message in the logs:

In the end, getOffset creates a KafkaSourceOffset with offsets (as Map[TopicPartition, Long]).

Creating KafkaSource Instance

KafkaSource takes the following when created:

KafkaSource initializes the internal registries and counters.

Fetching and Verifying Specific Offsets — fetchAndVerify Internal Method

fetchAndVerify requests KafkaOffsetReader to fetchSpecificOffsets for the given specificOffsets.

fetchAndVerify makes sure that the starting offsets in specificOffsets are the same as in Kafka and reports a data loss otherwise.

In the end, fetchAndVerify creates a KafkaSourceOffset (with the result of KafkaOffsetReader).

Note
fetchAndVerify is used exclusively when KafkaSource initializes initial partition offsets.

Initial Partition Offsets (of 0th Batch) — initialPartitionOffsets Internal Lazy Property

initialPartitionOffsets is the initial partition offsets for the batch 0 that were already persisted in the streaming metadata log directory or persisted on demand.

As the very first step, initialPartitionOffsets creates a custom HDFSMetadataLog (of KafkaSourceOffsets metadata) in the streaming metadata log directory.

initialPartitionOffsets requests the HDFSMetadataLog for the metadata of the 0th batch (as KafkaSourceOffset).

If the metadata is available, initialPartitionOffsets requests the metadata for the collection of TopicPartitions and their offsets.

If the metadata could not be found, initialPartitionOffsets creates a new KafkaSourceOffset per KafkaOffsetRangeLimit:

initialPartitionOffsets requests the custom HDFSMetadataLog to add the offsets to the metadata log (as the metadata of the 0th batch).

initialPartitionOffsets prints out the following INFO message to the logs:

Note

initialPartitionOffsets is used when KafkaSource is requested for the following:

HDFSMetadataLog.serialize

Note
serialize is part of the HDFSMetadataLog Contract to…​FIXME.

serialize requests the OutputStream to write a zero byte (to support Spark 2.1.0 as per SPARK-19517).

serialize creates a BufferedWriter over a OutputStreamWriter over the OutputStream (with UTF_8 charset encoding).

serialize requests the BufferedWriter to write the v1 version indicator followed by a new line.

serialize then requests the KafkaSourceOffset for a JSON-serialized representation and the BufferedWriter to write it out.

In the end, serialize requests the BufferedWriter to flush (the underlying stream).

赞(0) 打赏
未经允许不得转载:spark技术分享 » KafkaSource
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏