spark-sql-spark技术分享-第46页

KafkaSourceRDDOffsetRange

KafkaSourceRDDOffsetRange is an offset range that one KafkaSourceRDDPartition partition of a KafkaSourceRDD has to read.

KafkaSourceRDDOffsetRange is created when:

KafkaRelation is requested to build a distributed data scan with column pruning (as a TableScan) (and creates a KafkaSourceRDD)
KafkaSourceRDD is requested to resolveRange
(Spark Structured Streaming) KafkaSource is requested to getBatch

KafkaSourceRDDOffsetRange takes the following when created:

Kafka TopicPartition
fromOffset
untilOffset
Preferred location

Note	TopicPartition is a topic name and partition number.

KafkaSourceRDD

KafkaSourceRDD is an RDD of Kafka’s ConsumerRecords (with keys and values being collections of bytes, i.e. Array[Byte]).

KafkaSourceRDD uses KafkaSourceRDDPartition for the partitions.

KafkaSourceRDD has a specialized API for the following RDD operators:

count
countApprox
isEmpty
persist
take

KafkaSourceRDD is created when:

KafkaRelation is requested to build a distributed data scan with column pruning (as a TableScan)
(Spark Structured Streaming) KafkaSource is requested to getBatch

Creating KafkaSourceRDD Instance

KafkaSourceRDD takes the following when created:

SparkContext
Collection of key-value settings for executors reading records from Kafka topics
Collection of KafkaSourceRDDOffsetRanges
Timeout (in milliseconds) to poll data from Kafka

Used exclusively when KafkaSourceRDD is requested to compute a RDD partition (and requests a KafkaDataConsumer for a ConsumerRecord)
failOnDataLoss flag to control…FIXME
reuseKafkaConsumer flag to control…FIXME

KafkaSourceRDD initializes the internal registries and counters.

Computing Partition (in TaskContext) — `compute` Method



compute(
  thePart: Partition,
  context: TaskContext): Iterator[ConsumerRecord[Array[Byte], Array[Byte]]]

1

2

3

4

5

6

7

compute(

thePart: Partition,

context: TaskContext): Iterator[ConsumerRecord[Array[Byte], Array[Byte]]]

Note	`compute` is part of Spark Core’s `RDD` Contract to compute a partition (in a `TaskContext`).

compute…FIXME

`count` Operator



count(): Long

1

2

3

4

5

count(): Long

Note	`count` is part of Spark Core’s `RDD` Contract to…FIXME.

count…FIXME

`countApprox` Operator



countApprox(timeout: Long, confidence: Double): PartialResult[BoundedDouble]

1

2

3

4

5

countApprox(timeout: Long, confidence: Double): PartialResult[BoundedDouble]

Note	`countApprox` is part of Spark Core’s `RDD` Contract to…FIXME.

countApprox…FIXME

`isEmpty` Operator



isEmpty(): Boolean

1

2

3

4

5

isEmpty(): Boolean

Note	`isEmpty` is part of Spark Core’s `RDD` Contract to…FIXME.

isEmpty…FIXME

`persist` Operator



persist(newLevel: StorageLevel): this.type

1

2

3

4

5

persist(newLevel: StorageLevel): this.type

Note	`persist` is part of Spark Core’s `RDD` Contract to…FIXME.

persist…FIXME

`getPartitions` Method



getPartitions: Array[Partition]

1

2

3

4

5

getPartitions: Array[Partition]

Note	`getPartitions` is part of Spark Core’s `RDD` Contract to…FIXME

`getPreferredLocations` Method



getPreferredLocations(split: Partition): Seq[String]

1

2

3

4

5

getPreferredLocations(split: Partition): Seq[String]

Note	`getPreferredLocations` is part of the RDD Contract to…FIXME.

getPreferredLocations…FIXME

`resolveRange` Internal Method



resolveRange(
  consumer: KafkaDataConsumer,
  range: KafkaSourceRDDOffsetRange): KafkaSourceRDDOffsetRange

1

2

3

4

5

6

7

resolveRange(

consumer: KafkaDataConsumer,

range: KafkaSourceRDDOffsetRange): KafkaSourceRDDOffsetRange

resolveRange…FIXME

Note	`resolveRange` is used exclusively when `KafkaSourceRDD` is requested to compute a partition.

KafkaRelation

KafkaRelation is a BaseRelation with a TableScan.

KafkaRelation is created exclusively when KafkaSourceProvider is requested to create a BaseRelation (as a RelationProvider).

KafkaRelation uses the fixed schema.

Table 1. KafkaRelation’s Schema (in the positional order)
Field Name	Data Type
`key`	`BinaryType`
`value`	`BinaryType`
`topic`	`StringType`
`partition`	`IntegerType`
`offset`	`LongType`
`timestamp`	`TimestampType`
`timestampType`	`IntegerType`

KafkaRelation uses the following human-readable text representation:



KafkaRelation(strategy=[strategy], start=[startingOffsets], end=[endingOffsets])

1

2

3

4

5

KafkaRelation(strategy=[strategy], start=[startingOffsets], end=[endingOffsets])

Table 2. KafkaRelation’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`pollTimeoutMs`	Timeout (in milliseconds) to poll data from Kafka (pollTimeoutMs for `KafkaSourceRDD`) Initialized with the value of the following configuration properties (in the order until one found): `kafkaConsumer.pollTimeoutMs` in the source options `spark.network.timeout` in the `SparkConf` If neither is set, defaults to `120s`. Used exclusively when `KafkaRelation` is requested to build a distributed data scan with column pruning (and creates a KafkaSourceRDD).

Tip

Enable INFO or DEBUG logging level for org.apache.spark.sql.kafka010.KafkaRelation logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.kafka010.KafkaRelation=DEBUG

1

2

3

4

5

log4j.logger.org.apache.spark.sql.kafka010.KafkaRelation=DEBUG

Refer to Logging.

Creating KafkaRelation Instance

KafkaRelation takes the following when created:

SQLContext
ConsumerStrategy
Source options (as Map[String, String]) that directly correspond to the options of DataFrameReader
User-defined Kafka parameters (as Map[String, String])
failOnDataLoss flag
Starting offsets (as KafkaOffsetRangeLimit)
Ending offsets (as KafkaOffsetRangeLimit)

KafkaRelation initializes the internal registries and counters.

Building Distributed Data Scan with Column Pruning (as TableScan) — `buildScan` Method



buildScan(): RDD[Row]

1

2

3

4

5

buildScan(): RDD[Row]

Note	`buildScan` is part of TableScan Contract to build a distributed data scan with column pruning.

buildScan kafkaParamsForDriver from the user-defined Kafka parameters and uses it to create a KafkaOffsetReader (together with the ConsumerStrategy, the source options and a unique group ID of the format spark-kafka-relation-[randomUUID]-driver).

buildScan then uses the KafkaOffsetReader to getPartitionOffsets for the starting and ending offsets and closes it right after.

buildScan creates a KafkaSourceRDDOffsetRange for every pair of the starting and ending offsets.

buildScan prints out the following INFO message to the logs:



GetBatch generating RDD of offset range: [comma-separated offsetRanges]

1

2

3

4

5

GetBatch generating RDD of offset range: [comma-separated offsetRanges]

buildScan then kafkaParamsForExecutors and uses it to create a KafkaSourceRDD (with the pollTimeoutMs) and maps over all the elements (using RDD.map operator that creates a MapPartitionsRDD).

Tip	Use `RDD.toDebugString` to see the two RDDs, i.e. `KafkaSourceRDD` and `MapPartitionsRDD`, in the RDD lineage.

In the end, buildScan requests the SQLContext to create a DataFrame from the KafkaSourceRDD and the schema.

buildScan throws an IllegalStateException when the topic partitions for starting offsets are different from the ending offsets topics:



different topic partitions for starting offsets topics[[fromTopics]] and ending offsets topics[[untilTopics]]

1

2

3

4

5

different topic partitions for starting offsets topics[[fromTopics]] and ending offsets topics[[untilTopics]]

`getPartitionOffsets` Internal Method



getPartitionOffsets(
  kafkaReader: KafkaOffsetReader,
  kafkaOffsets: KafkaOffsetRangeLimit): Map[TopicPartition, Long]

1

2

3

4

5

6

7

getPartitionOffsets(

kafkaReader: KafkaOffsetReader,

kafkaOffsets: KafkaOffsetRangeLimit): Map[TopicPartition, Long]

getPartitionOffsets requests the input KafkaOffsetReader to fetchTopicPartitions.

getPartitionOffsets uses the input KafkaOffsetRangeLimit to return the mapping of offsets per Kafka TopicPartition fetched:

For EarliestOffsetRangeLimit, getPartitionOffsets returns a map with every TopicPartition and -2L (as the offset)
For LatestOffsetRangeLimit, getPartitionOffsets returns a map with every TopicPartition and -1L (as the offset)
For SpecificOffsetRangeLimit, getPartitionOffsets returns a map from validateTopicPartitions

Note	`getPartitionOffsets` is used exclusively when `KafkaRelation` is requested to build a distributed data scan with column pruning (as a TableScan).

Validating TopicPartitions (Against Partition Offsets) — `validateTopicPartitions` Inner Method



validateTopicPartitions(
  partitions: Set[TopicPartition],
  partitionOffsets: Map[TopicPartition, Long]): Map[TopicPartition, Long]

1

2

3

4

5

6

7

validateTopicPartitions(

partitions: Set[TopicPartition],

partitionOffsets: Map[TopicPartition, Long]): Map[TopicPartition, Long]

Note	`validateTopicPartitions` is a Scala inner method of getPartitionOffsets, i.e. `validateTopicPartitions` is defined within the body of `getPartitionOffsets` and so is visible and can only be used in `getPartitionOffsets`.

validateTopicPartitions asserts that the input set of Kafka TopicPartitions is exactly the set of the keys in the input partitionOffsets.

validateTopicPartitions prints out the following DEBUG message to the logs:



Partitions assigned to consumer: [partitions]. Seeking to [partitionOffsets]

1

2

3

4

5

Partitions assigned to consumer: [partitions]. Seeking to [partitionOffsets]

In the end, validateTopicPartitions returns the input partitionOffsets.

If the input set of Kafka TopicPartitions is not the set of the keys in the input partitionOffsets, validateTopicPartitions throws an AssertionError:



assertion failed: If startingOffsets contains specific offsets, you must specify all TopicPartitions.
Use -1 for latest, -2 for earliest, if you don't care.
Specified: [partitionOffsets] Assigned: [partitions]

1

2

3

4

5

6

7

assertion failed: If startingOffsets contains specific offsets, you must specify all TopicPartitions.

Use -1 for latest, -2 for earliest, if you don't care.

Specified: [partitionOffsets] Assigned: [partitions]

KafkaSourceProvider

KafkaSourceProvider is a DataSourceRegister and registers itself to handle kafka data source format.

Note	`KafkaSourceProvider` uses `META-INF/services/org.apache.spark.sql.sources.DataSourceRegister` file for the registration which is available in the source code of Apache Spark.

KafkaSourceProvider is a RelationProvider and a CreatableRelationProvider.



// start Spark application like spark-shell with the following package
// --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.0
scala> val fromKafkaTopic1 = spark.
  read.
  format("kafka").
  option("subscribe", "topic1").  // subscribe, subscribepattern, or assign
  option("kafka.bootstrap.servers", "localhost:9092").
  load("gauge_one")

1

2

3

4

5

6

7

8

9

10

11

12

// start Spark application like spark-shell with the following package

// --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.0

scala> val fromKafkaTopic1 = spark.

read.

format("kafka").

option("subscribe", "topic1"). // subscribe, subscribepattern, or assign

option("kafka.bootstrap.servers", "localhost:9092").

load("gauge_one")

KafkaSourceProvider uses a fixed schema (and makes sure that a user did not set a custom one).



import org.apache.spark.sql.types.StructType
val schema = new StructType().add($"id".int)
scala> spark
  .read
  .format("kafka")
  .option("subscribe", "topic1")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .schema(schema) // <-- defining a custom schema is not supported
  .load
org.apache.spark.sql.AnalysisException: kafka does not allow user-specified schemas.;
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
  ... 48 elided

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

import org.apache.spark.sql.types.StructType

val schema = new StructType().add($"id".int)

scala> spark

.read

.format("kafka")

.option("subscribe", "topic1")

.option("kafka.bootstrap.servers", "localhost:9092")

.schema(schema) // <-- defining a custom schema is not supported

.load

org.apache.spark.sql.AnalysisException: kafka does not allow user-specified schemas.;

at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)

... 48 elided

Note	`KafkaSourceProvider` is also a `StreamSourceProvider`, a `StreamSinkProvider`, a `StreamWriteSupport` and a `ContinuousReadSupport` that are contracts used in Spark Structured Streaming. You can find more on Spark Structured Streaming in my gitbook Spark Structured Streaming.

Creating BaseRelation — `createRelation` Method (from RelationProvider)



createRelation(
  sqlContext: SQLContext,
  parameters: Map[String, String]): BaseRelation

1

2

3

4

5

6

7

createRelation(

sqlContext: SQLContext,

parameters: Map[String, String]): BaseRelation

Note	`createRelation` is part of RelationProvider Contract to create a BaseRelation (for reading or writing).

createRelation starts by validating the Kafka options (for batch queries) in the input parameters.

createRelation collects all kafka.-prefixed key options (in the input parameters) and creates a local specifiedKafkaParams with the keys without the kafka. prefix (e.g. kafka.whatever is simply whatever).

createRelation gets the desired KafkaOffsetRangeLimit with the startingoffsets offset option key (in the given parameters) and EarliestOffsetRangeLimit as the default offsets.

createRelation makes sure that the KafkaOffsetRangeLimit is not EarliestOffsetRangeLimit or throws an AssertionError.

createRelation gets the desired KafkaOffsetRangeLimit, but this time with the endingoffsets offset option key (in the given parameters) and LatestOffsetRangeLimit as the default offsets.

createRelation makes sure that the KafkaOffsetRangeLimit is not EarliestOffsetRangeLimit or throws a AssertionError.

In the end, createRelation creates a KafkaRelation with the subscription strategy (in the given parameters), failOnDataLoss option, and the starting and ending offsets.

Validating Kafka Options (for Batch Queries) — `validateBatchOptions` Internal Method



validateBatchOptions(caseInsensitiveParams: Map[String, String]): Unit

1

2

3

4

5

validateBatchOptions(caseInsensitiveParams: Map[String, String]): Unit

validateBatchOptions gets the desired KafkaOffsetRangeLimit for the startingoffsets option in the input caseInsensitiveParams and with EarliestOffsetRangeLimit as the default KafkaOffsetRangeLimit.

validateBatchOptions then matches the returned KafkaOffsetRangeLimit as follows:

EarliestOffsetRangeLimit is acceptable and validateBatchOptions simply does nothing
LatestOffsetRangeLimit is not acceptable and validateBatchOptions throws an IllegalArgumentException:

starting offset can't be latest for batch queries on Kafka

1
2
3
4
5

starting offset can't be latest for batch queries on Kafka
SpecificOffsetRangeLimit is acceptable unless one of the offsets is -1L for which validateBatchOptions throws an IllegalArgumentException:

startingOffsets for [tp] can't be latest for batch queries on Kafka

1
2
3
4
5

startingOffsets for [tp] can't be latest for batch queries on Kafka

Note	`validateBatchOptions` is used exclusively when `KafkaSourceProvider` is requested to create a BaseRelation (as a RelationProvider).

Writing DataFrame to Kafka Topic — `createRelation` Method (from CreatableRelationProvider)



createRelation(
  sqlContext: SQLContext,
  mode: SaveMode,
  parameters: Map[String, String],
  df: DataFrame): BaseRelation

1

2

3

4

5

6

7

8

9

createRelation(

sqlContext: SQLContext,

mode: SaveMode,

parameters: Map[String, String],

df: DataFrame): BaseRelation

Note	`createRelation` is part of the CreatableRelationProvider Contract to write the rows of a structured query (a DataFrame) to an external data source.

createRelation gets the topic option from the input parameters.

createRelation gets the Kafka-specific options for writing from the input parameters.

createRelation then uses the KafkaWriter helper object to write the rows of the DataFrame to the Kafka topic.

In the end, createRelation creates a fake BaseRelation that simply throws an UnsupportedOperationException for all its methods.

createRelation supports Append and ErrorIfExists only. createRelation throws an AnalysisException for the other save modes:



Save mode [mode] not allowed for Kafka. Allowed save modes are [Append] and [ErrorIfExists] (default).

1

2

3

4

5

Save mode [mode] not allowed for Kafka. Allowed save modes are [Append] and [ErrorIfExists] (default).

`sourceSchema` Method



sourceSchema(
  sqlContext: SQLContext,
  schema: Option[StructType],
  providerName: String,
  parameters: Map[String, String]): (String, StructType)

1

2

3

4

5

6

7

8

9

sourceSchema(

sqlContext: SQLContext,

schema: Option[StructType],

providerName: String,

parameters: Map[String, String]): (String, StructType)

sourceSchema…FIXME



val fromKafka = spark.read.format("kafka")...
scala> fromKafka.printSchema
root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

val fromKafka = spark.read.format("kafka")...

scala> fromKafka.printSchema

root

|-- key: binary (nullable = true)

|-- value: binary (nullable = true)

|-- topic: string (nullable = true)

|-- partition: integer (nullable = true)

|-- offset: long (nullable = true)

|-- timestamp: timestamp (nullable = true)

|-- timestampType: integer (nullable = true)

Note	`sourceSchema` is part of Structured Streaming’s `StreamSourceProvider` Contract.

Getting Desired KafkaOffsetRangeLimit (for Offset Option) — `getKafkaOffsetRangeLimit` Object Method



getKafkaOffsetRangeLimit(
  params: Map[String, String],
  offsetOptionKey: String,
  defaultOffsets: KafkaOffsetRangeLimit): KafkaOffsetRangeLimit

1

2

3

4

5

6

7

8

getKafkaOffsetRangeLimit(

params: Map[String, String],

offsetOptionKey: String,

defaultOffsets: KafkaOffsetRangeLimit): KafkaOffsetRangeLimit

getKafkaOffsetRangeLimit tries to find the given offsetOptionKey in the input params and converts the value found to a KafkaOffsetRangeLimit as follows:

latest becomes LatestOffsetRangeLimit
earliest becomes EarliestOffsetRangeLimit
For a JSON text, getKafkaOffsetRangeLimit uses the JsonUtils helper object to read per-TopicPartition offsets from it and creates a SpecificOffsetRangeLimit

When the input offsetOptionKey was not found, getKafkaOffsetRangeLimit returns the input defaultOffsets.

Note	`getKafkaOffsetRangeLimit` is used when: `KafkaSourceProvider` is requested to validate Kafka options (for batch queries) and create a BaseRelation (as a RelationProvider) (Spark Structured Streaming) `KafkaSourceProvider` is requested to `createSource` and `createContinuousReader`

Getting ConsumerStrategy per Subscription Strategy Option — `strategy` Internal Method



strategy(caseInsensitiveParams: Map[String, String]): ConsumerStrategy

1

2

3

4

5

strategy(caseInsensitiveParams: Map[String, String]): ConsumerStrategy

strategy finds one of the strategy options: subscribe, subscribepattern and assign.

For assign, strategy uses the JsonUtils helper object to deserialize TopicPartitions from JSON (e.g. {"topicA":[0,1],"topicB":[0,1]}) and returns a new AssignStrategy.

For subscribe, strategy splits the value by , (comma) and returns a new SubscribeStrategy.

For subscribepattern, strategy returns a new SubscribePatternStrategy

Note	`strategy` is used when: `KafkaSourceProvider` is requested to create a BaseRelation (as a RelationProvider) (Spark Structured Streaming) `KafkaSourceProvider` is requested to `createSource` and `createContinuousReader`

`failOnDataLoss` Internal Method



failOnDataLoss(caseInsensitiveParams: Map[String, String]): Boolean

1

2

3

4

5

failOnDataLoss(caseInsensitiveParams: Map[String, String]): Boolean

failOnDataLoss…FIXME

Note	`failOnDataLoss` is used when `KafkaSourceProvider` is requested to create a BaseRelation (and also in `createSource` and `createContinuousReader` for Spark Structured Streaming).

Setting Kafka Configuration Parameters for Driver — `kafkaParamsForDriver` Object Method



kafkaParamsForDriver(specifiedKafkaParams: Map[String, String]): java.util.Map[String, Object]

1

2

3

4

5

kafkaParamsForDriver(specifiedKafkaParams: Map[String, String]): java.util.Map[String, Object]

kafkaParamsForDriver simply sets the additional Kafka configuration parameters for the driver.

Table 1. Driver’s Kafka Configuration Parameters
Name	Value	ConsumerConfig	Description
`key.deserializer`	`org.apache.kafka.common.serialization.ByteArrayDeserializer`	`KEY_DESERIALIZER_CLASS_CONFIG`	Deserializer class for keys that implements the Kafka `Deserializer` interface.
`value.deserializer`	`org.apache.kafka.common.serialization.ByteArrayDeserializer`	`VALUE_DESERIALIZER_CLASS_CONFIG`	Deserializer class for values that implements the Kafka `Deserializer` interface.
`auto.offset.reset`	`earliest`	`AUTO_OFFSET_RESET_CONFIG`	What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted): `earliest` — automatically reset the offset to the earliest offset `latest` — automatically reset the offset to the latest offset `none` — throw an exception to the Kafka consumer if no previous offset is found for the consumer’s group anything else — throw an exception to the Kafka consumer
`enable.auto.commit`	`false`	`ENABLE_AUTO_COMMIT_CONFIG`	If `true` the Kafka consumer’s offset will be periodically committed in the background
`max.poll.records`	`1`	`MAX_POLL_RECORDS_CONFIG`	The maximum number of records returned in a single call to `Consumer.poll()`
`receive.buffer.bytes`	`65536`	`MAX_POLL_RECORDS_CONFIG`	Only set if not set already

Tip

Enable DEBUG logging level for org.apache.spark.sql.kafka010.KafkaSourceProvider.ConfigUpdater logger to see updates of Kafka configuration parameters.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.kafka010.KafkaSourceProvider.ConfigUpdater=DEBUG

1

2

3

4

5

log4j.logger.org.apache.spark.sql.kafka010.KafkaSourceProvider.ConfigUpdater=DEBUG

Refer to Logging.

Note	`kafkaParamsForDriver` is used when: `KafkaRelation` is requested to build a distributed data scan with column pruning (as a TableScan) (Spark Structured Streaming) `KafkaSourceProvider` is requested to `createSource` and `createContinuousReader`

`kafkaParamsForExecutors` Object Method



kafkaParamsForExecutors(
  specifiedKafkaParams: Map[String, String],
  uniqueGroupId: String): java.util.Map[String, Object]

1

2

3

4

5

6

7

kafkaParamsForExecutors(

specifiedKafkaParams: Map[String, String],

uniqueGroupId: String): java.util.Map[String, Object]

kafkaParamsForExecutors…FIXME

Note	`kafkaParamsForExecutors` is used when…FIXME

`kafkaParamsForProducer` Object Method



kafkaParamsForProducer(parameters: Map[String, String]): Map[String, String]

1

2

3

4

5

kafkaParamsForProducer(parameters: Map[String, String]): Map[String, String]

kafkaParamsForProducer…FIXME

Note	`kafkaParamsForProducer` is used when…FIXME

Kafka Data Source Options

2012-05-04admin阅读(1498)

Kafka Data Source Options

Table 1. Kafka Data Source Options
Option	Default	Description
`assign`		One of the three subscription strategy options (with subscribe and subscribepattern) See KafkaSourceProvider.strategy
`endingoffsets`
`failondataloss`
`kafkaConsumer.pollTimeoutMs`		See kafkaConsumer.pollTimeoutMs
`startingoffsets`
`subscribe`		One of the three subscription strategy options (with subscribepattern and assign) See KafkaSourceProvider.strategy
`subscribepattern`		One of the three subscription strategy options (with subscribe and assign) See KafkaSourceProvider.strategy
`topic`		Required for writing a DataFrame to Kafka Used when: `KafkaSourceProvider` is requested to write a DataFrame to a Kafka topic and create a BaseRelation afterwards (Spark Structured Streaming) `KafkaSourceProvider` is requested to `createStreamWriter` and `createSink`

Kafka Data Source

2012-05-03admin阅读(1638)

Kafka Data Source

Spark SQL supports reading data from or writing data to one or more topics in Apache Kafka.

Note	Apache Kafka is a storage of records in a format-independent and fault-tolerant durable way. Read up on Apache Kafka in the official documentation or in my other gitbook Mastering Apache Kafka.

Kafka Data Source supports options to get better performance of structured queries that use it.

Reading Data from Kafka Topics

As a Spark developer, you use DataFrameReader.format method to specify Apache Kafka as the external data source to load data from.

You use kafka (or org.apache.spark.sql.kafka010.KafkaSourceProvider) as the input data source format.



val kafka = spark.read.format("kafka").load

// Alternatively
val kafka = spark.read.format("org.apache.spark.sql.kafka010.KafkaSourceProvider").load

1

2

3

4

5

6

7

8

val kafka = spark.read.format("kafka").load

// Alternatively

val kafka = spark.read.format("org.apache.spark.sql.kafka010.KafkaSourceProvider").load

These one-liners create a DataFrame that represents the distributed process of loading data from one or many Kafka topics (with additional properties).

Writing Data to Kafka Topics

As a Spark developer,…FIXME

JsonDataSource

2012-05-02admin阅读(1226)

JsonDataSource

Caution

FIXME

TextFileFormat

2012-05-01admin阅读(1706)

TextFileFormat

TextFileFormat is a TextBasedFileFormat for text format.



spark.read.format("text").load("text-datasets")

// or the same as above using a shortcut
spark.read.text("text-datasets")

1

2

3

4

5

6

7

8

spark.read.format("text").load("text-datasets")

// or the same as above using a shortcut

spark.read.text("text-datasets")

TextFileFormat uses text options while loading a dataset.

Table 1. TextFileFormat’s Options
Option	Default Value	Description
`compression`		Compression codec that can be either one of the known aliases or a fully-qualified class name.
`wholetext`	`false`	Enables loading a file as a single row (i.e. not splitting by “\n”)

`prepareWrite` Method



prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

1

2

3

4

5

6

7

8

9

prepareWrite(

sparkSession: SparkSession,

job: Job,

options: Map[String, String],

dataSchema: StructType): OutputWriterFactory

Note	`prepareWrite` is part of FileFormat Contract that is used when `FileFormatWriter` is requested to write the result of a structured query.

prepareWrite…FIXME

Building Partitioned Data Reader — `buildReader` Method



buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

1

2

3

4

5

6

7

8

9

10

11

12

buildReader(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

Note	`buildReader` is part of FileFormat Contract to…FIXME

buildReader…FIXME

`readToUnsafeMem` Internal Method



readToUnsafeMem(
  conf: Broadcast[SerializableConfiguration],
  requiredSchema: StructType,
  wholeTextMode: Boolean): (PartitionedFile) => Iterator[UnsafeRow]

1

2

3

4

5

6

7

8

readToUnsafeMem(

conf: Broadcast[SerializableConfiguration],

requiredSchema: StructType,

wholeTextMode: Boolean): (PartitionedFile) => Iterator[UnsafeRow]

readToUnsafeMem…FIXME

Note	`readToUnsafeMem` is used exclusively when `TextFileFormat` is requested to buildReader

JsonFileFormat

2012-04-30admin阅读(1788)

JsonFileFormat — Built-In Support for Files in JSON Format

JsonFileFormat is a TextBasedFileFormat for json format (i.e. registers itself to handle files in json format and convert them to Spark SQL rows).



spark.read.format("json").load("json-datasets")

// or the same as above using a shortcut
spark.read.json("json-datasets")

1

2

3

4

5

6

7

8

spark.read.format("json").load("json-datasets")

// or the same as above using a shortcut

spark.read.json("json-datasets")

JsonFileFormat comes with options to further customize JSON parsing.

Note	`JsonFileFormat` uses Jackson 2.6.7 as the JSON parser library and some options map directly to Jackson’s internal options (as `JsonParser.Feature`).

Option Default Value Description

allowBackslashEscapingAnyCharacter

false

Note	Internally, `allowBackslashEscapingAnyCharacter` becomes `JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER`.

allowComments

false

Note	Internally, `allowComments` becomes `JsonParser.Feature.ALLOW_COMMENTS`.

allowNonNumericNumbers

true

Note	Internally, `allowNonNumericNumbers` becomes `JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS`.

allowNumericLeadingZeros

false

Note	Internally, `allowNumericLeadingZeros` becomes `JsonParser.Feature.ALLOW_NUMERIC_LEADING_ZEROS`.

allowSingleQuotes

true

Note	Internally, `allowSingleQuotes` becomes `JsonParser.Feature.ALLOW_SINGLE_QUOTES`.

allowUnquotedControlChars

false

Note	Internally, `allowUnquotedControlChars` becomes `JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS`.

allowUnquotedFieldNames

false

Note	Internally, `allowUnquotedFieldNames` becomes `JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES`.

columnNameOfCorruptRecord

compression

Compression codec that can be either one of the known aliases or a fully-qualified class name.

dateFormat

yyyy-MM-dd

Date format

Note	Internally, `dateFormat` is converted to Apache Commons Lang’s `FastDateFormat`.

multiLine

false

Controls whether…FIXME

mode

PERMISSIVE

Case insensitive name of the parse mode

PERMISSIVE
DROPMALFORMED
FAILFAST

prefersDecimal

false

primitivesAsString

false

samplingRatio

1.0

timestampFormat

yyyy-MM-dd’T’HH:mm:ss.SSSXXX

Timestamp format

Note	Internally, `timestampFormat` is converted to Apache Commons Lang’s `FastDateFormat`.

timeZone

Java’s TimeZone

`isSplitable` Method



isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

1

2

3

4

5

6

7

8

isSplitable(

sparkSession: SparkSession,

options: Map[String, String],

path: Path): Boolean

Note	`isSplitable` is part of FileFormat Contract.

isSplitable…FIXME

`inferSchema` Method



inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

1

2

3

4

5

6

7

8

inferSchema(

sparkSession: SparkSession,

options: Map[String, String],

files: Seq[FileStatus]): Option[StructType]

Note	`inferSchema` is part of FileFormat Contract.

inferSchema…FIXME

Building Partitioned Data Reader — `buildReader` Method



buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

1

2

3

4

5

6

7

8

9

10

11

12

buildReader(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

Note	`buildReader` is part of the FileFormat Contract to build a PartitionedFile reader.

buildReader…FIXME

Preparing Write Job — `prepareWrite` Method



prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

1

2

3

4

5

6

7

8

9

prepareWrite(

sparkSession: SparkSession,

job: Job,

options: Map[String, String],

dataSchema: StructType): OutputWriterFactory

Note	`prepareWrite` is part of the FileFormat Contract to prepare a write job.

prepareWrite…FIXME

CSVFileFormat

2012-04-29admin阅读(1748)

CSVFileFormat

CSVFileFormat is a TextBasedFileFormat for csv format (i.e. registers itself to handle files in csv format and converts them to Spark SQL rows).



spark.read.format("csv").load("csv-datasets")

// or the same as above using a shortcut
spark.read.csv("csv-datasets")

1

2

3

4

5

6

7

8

spark.read.format("csv").load("csv-datasets")

// or the same as above using a shortcut

spark.read.csv("csv-datasets")

CSVFileFormat uses CSV options (that in turn are used to configure the underlying CSV parser from uniVocity-parsers project).

Table 1. CSVFileFormat’s Options
Option	Default Value	Description
`charset`	`UTF-8`	Alias of encoding
`charToEscapeQuoteEscaping`	`\\`	One character to…FIXME
`codec`		Compression codec that can be either one of the known aliases or a fully-qualified class name. Alias of compression
`columnNameOfCorruptRecord`
`comment`	`\u0000`
`compression`		Compression codec that can be either one of the known aliases or a fully-qualified class name. Alias of codec
`dateFormat`	`yyyy-MM-dd`	Uses `en_US` locale
`delimiter`	`,` (comma)	Alias of sep
`encoding`	`UTF-8`	Alias of charset
`escape`	`\\`
`escapeQuotes`	`true`
`header`
`ignoreLeadingWhiteSpace`	`false` (for reading) `true` (for writing)
`ignoreTrailingWhiteSpace`	`false` (for reading) `true` (for writing)
`inferSchema`
`maxCharsPerColumn`	`-1`
`maxColumns`	`20480`
`mode`	`PERMISSIVE`	Possible values: `DROPMALFORMED` `PERMISSIVE` (default) `FAILFAST`
`multiLine`	`false`
`nanValue`	`NaN`
`negativeInf`	`-Inf`
`nullValue`	(empty string)
`positiveInf`	`Inf`
`sep`	`,` (comma)	Alias of delimiter
`timestampFormat`	`yyyy-MM-dd’T’HH:mm:ss.SSSXXX`	Uses timeZone and `en_US` locale
`timeZone`	spark.sql.session.timeZone
`quote`	`\"`
`quoteAll`	`false`

Preparing Write Job — `prepareWrite` Method



prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

1

2

3

4

5

6

7

8

9

prepareWrite(

sparkSession: SparkSession,

job: Job,

options: Map[String, String],

dataSchema: StructType): OutputWriterFactory

Note	`prepareWrite` is part of the FileFormat Contract to prepare a write job.

prepareWrite…FIXME

Building Partitioned Data Reader — `buildReader` Method



buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

1

2

3

4

5

6

7

8

9

10

11

12

buildReader(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

Note	`buildReader` is part of the FileFormat Contract to build a PartitionedFile reader.

buildReader…FIXME

spark-sql 第46页

KafkaSourceRDDOffsetRange

KafkaSourceRDD

Creating KafkaSourceRDD Instance

Computing Partition (in TaskContext) — compute Method

count Operator

countApprox Operator

isEmpty Operator

persist Operator

getPartitions Method

getPreferredLocations Method

resolveRange Internal Method

KafkaRelation

Creating KafkaRelation Instance

Building Distributed Data Scan with Column Pruning (as TableScan) — buildScan Method

getPartitionOffsets Internal Method

Validating TopicPartitions (Against Partition Offsets) — validateTopicPartitions Inner Method

KafkaSourceProvider

Creating BaseRelation — createRelation Method (from RelationProvider)

Validating Kafka Options (for Batch Queries) — validateBatchOptions Internal Method

Writing DataFrame to Kafka Topic — createRelation Method (from CreatableRelationProvider)

sourceSchema Method

Getting Desired KafkaOffsetRangeLimit (for Offset Option) — getKafkaOffsetRangeLimit Object Method

Getting ConsumerStrategy per Subscription Strategy Option — strategy Internal Method

failOnDataLoss Internal Method

Setting Kafka Configuration Parameters for Driver — kafkaParamsForDriver Object Method

kafkaParamsForExecutors Object Method

kafkaParamsForProducer Object Method

Kafka Data Source Options

Kafka Data Source

Reading Data from Kafka Topics

Writing Data to Kafka Topics

JsonDataSource

TextFileFormat

prepareWrite Method

Building Partitioned Data Reader — buildReader Method

readToUnsafeMem Internal Method

JsonFileFormat — Built-In Support for Files in JSON Format

isSplitable Method

inferSchema Method

Building Partitioned Data Reader — buildReader Method

Preparing Write Job — prepareWrite Method

CSVFileFormat

Preparing Write Job — prepareWrite Method

Building Partitioned Data Reader — buildReader Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Computing Partition (in TaskContext) — `compute` Method

`count` Operator

`countApprox` Operator

`isEmpty` Operator

`persist` Operator

`getPartitions` Method

`getPreferredLocations` Method

`resolveRange` Internal Method

Building Distributed Data Scan with Column Pruning (as TableScan) — `buildScan` Method

`getPartitionOffsets` Internal Method

Validating TopicPartitions (Against Partition Offsets) — `validateTopicPartitions` Inner Method

Creating BaseRelation — `createRelation` Method (from RelationProvider)

Validating Kafka Options (for Batch Queries) — `validateBatchOptions` Internal Method

Writing DataFrame to Kafka Topic — `createRelation` Method (from CreatableRelationProvider)

`sourceSchema` Method

Getting Desired KafkaOffsetRangeLimit (for Offset Option) — `getKafkaOffsetRangeLimit` Object Method

Getting ConsumerStrategy per Subscription Strategy Option — `strategy` Internal Method

`failOnDataLoss` Internal Method

Setting Kafka Configuration Parameters for Driver — `kafkaParamsForDriver` Object Method

`kafkaParamsForExecutors` Object Method

`kafkaParamsForProducer` Object Method

`prepareWrite` Method

Building Partitioned Data Reader — `buildReader` Method

`readToUnsafeMem` Internal Method

`isSplitable` Method

`inferSchema` Method

Building Partitioned Data Reader — `buildReader` Method

Preparing Write Job — `prepareWrite` Method

Preparing Write Job — `prepareWrite` Method

Building Partitioned Data Reader — `buildReader` Method