spark-sql-spark技术分享-第45页

Avro Data Source

Spark SQL supports structured queries over Avro files as well as in columns (in a DataFrame).

Note

Apache Avro is a data serialization format and provides the following features:

Language-independent (with language bindings for popular programming languages, e.g. Java, Python)
Rich data structures
A compact, fast, binary data format (encoding)
A container file for sequences of Avro data (aka Avro data files)
Remote procedure call (RPC)
Optional code generation (optimization) to read or write data files, and implement RPC protocols

Avro data source is provided by the spark-avro external module. You should include it as a dependency in your Spark application (e.g. spark-submit --packages or in build.sbt).



org.apache.spark:spark-avro_2.12:2.4.0

1

2

3

4

5

org.apache.spark:spark-avro_2.12:2.4.0

The following shows how to include the spark-avro module in a spark-shell session.



$ ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0

1

2

3

4

5

$ ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0

Table 1. Functions for Avro

Name

Description

from_avro



from_avro(data: Column, jsonFormatSchema: String): Column

1

2

3

4

5

from_avro(data: Column, jsonFormatSchema: String): Column

Parses an Avro-encoded binary column and converts to a Catalyst value per JSON-encoded Avro schema

to_avro



to_avro(data: Column): Column

1

2

3

4

5

to_avro(data: Column): Column

Converts a column to an Avro-encoded binary column

After the module is loaded, you should import the org.apache.spark.sql.avro package to have the from_avro and to_avro functions available.



import org.apache.spark.sql.avro._

1

2

3

4

5

import org.apache.spark.sql.avro._

Converting Column to Avro-Encoded Binary Column — `to_avro` Method



to_avro(data: Column): Column

1

2

3

4

5

to_avro(data: Column): Column

to_avro creates a Column with the CatalystDataToAvro unary expression (with the Catalyst expression of the given data column).



import org.apache.spark.sql.avro._
val q = spark.range(1).withColumn("to_avro_id", to_avro('id))
scala> q.show
+---+----------+
| id|to_avro_id|
+---+----------+
|  0|      [00]|
+---+----------+

val logicalPlan = q.queryExecution.logical
scala> println(logicalPlan.numberedTreeString)
00 'Project [id#33L, catalystdatatoavro('id) AS to_avro_id#35]
01 +- Range (0, 1, step=1, splits=Some(8))

import org.apache.spark.sql.avro.CatalystDataToAvro
// Let's use QueryExecution.analyzed instead
// https://issues.apache.org/jira/browse/SPARK-26063
val analyzedPlan = q.queryExecution.analyzed
val toAvroExpr = analyzedPlan.expressions.drop(1).head.children.head.asInstanceOf[CatalystDataToAvro]
scala> println(toAvroExpr.sql)
to_avro(`id`, bigint)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

import org.apache.spark.sql.avro._

val q = spark.range(1).withColumn("to_avro_id", to_avro('id))

scala> q.show

+---+----------+

| id|to_avro_id|

+---+----------+

| 0| [00]|

+---+----------+

val logicalPlan = q.queryExecution.logical

scala> println(logicalPlan.numberedTreeString)

00 'Project [id#33L, catalystdatatoavro('id) AS to_avro_id#35]

01 +- Range (0, 1, step=1, splits=Some(8))

import org.apache.spark.sql.avro.CatalystDataToAvro

// Let's use QueryExecution.analyzed instead

// https://issues.apache.org/jira/browse/SPARK-26063

val analyzedPlan = q.queryExecution.analyzed

val toAvroExpr = analyzedPlan.expressions.drop(1).head.children.head.asInstanceOf[CatalystDataToAvro]

scala> println(toAvroExpr.sql)

to_avro(`id`, bigint)

Converting Avro-Encoded Column to Catalyst Value — `from_avro` Method



from_avro(data: Column, jsonFormatSchema: String): Column

1

2

3

4

5

from_avro(data: Column, jsonFormatSchema: String): Column

from_avro creates a Column with the AvroDataToCatalyst unary expression (with the Catalyst expression of the given data column and the jsonFormatSchema JSON-encoded schema).



import org.apache.spark.sql.avro._
val data = spark.range(1).withColumn("to_avro_id", to_avro('id))

// Use from_avro to decode to_avro-encoded id column
val jsonFormatSchema = s"""
  |{
  |  "type": "long",
  |  "name": "id"
  |}
""".stripMargin
val q = data.select(from_avro('to_avro_id, jsonFormatSchema) as "id_from_avro")
scala> q.show
+------------+
|id_from_avro|
+------------+
|           0|
+------------+

val logicalPlan = q.queryExecution.logical
scala> println(logicalPlan.numberedTreeString)
00 'Project [avrodatatocatalyst('to_avro_id,
01 {
02   "type": "long",
03   "name": "id"
04 }
05 ) AS id_from_avro#77]
06 +- Project [id#66L, catalystdatatoavro(id#66L) AS to_avro_id#68]
07    +- Range (0, 1, step=1, splits=Some(8))

import org.apache.spark.sql.avro.AvroDataToCatalyst
// Let's use QueryExecution.analyzed instead
// https://issues.apache.org/jira/browse/SPARK-26063
val analyzedPlan = q.queryExecution.analyzed
val fromAvroExpr = analyzedPlan.expressions.head.children.head.asInstanceOf[AvroDataToCatalyst]
scala> println(fromAvroExpr.sql)
from_avro(`to_avro_id`, bigint)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

import org.apache.spark.sql.avro._

val data = spark.range(1).withColumn("to_avro_id", to_avro('id))

// Use from_avro to decode to_avro-encoded id column

val jsonFormatSchema = s"""

|{

| "type": "long",

| "name": "id"

|}

""".stripMargin

val q = data.select(from_avro('to_avro_id, jsonFormatSchema) as "id_from_avro")

scala> q.show

+------------+

|id_from_avro|

+------------+

| 0|

+------------+

val logicalPlan = q.queryExecution.logical

scala> println(logicalPlan.numberedTreeString)

00 'Project [avrodatatocatalyst('to_avro_id,

01 {

02 "type": "long",

03 "name": "id"

04 }

05 ) AS id_from_avro#77]

06 +- Project [id#66L, catalystdatatoavro(id#66L) AS to_avro_id#68]

07 +- Range (0, 1, step=1, splits=Some(8))

import org.apache.spark.sql.avro.AvroDataToCatalyst

// Let's use QueryExecution.analyzed instead

// https://issues.apache.org/jira/browse/SPARK-26063

val analyzedPlan = q.queryExecution.analyzed

val fromAvroExpr = analyzedPlan.expressions.head.children.head.asInstanceOf[AvroDataToCatalyst]

scala> println(fromAvroExpr.sql)

from_avro(`to_avro_id`, bigint)

JsonUtils Helper Object

JsonUtils is a Scala object with methods for serializing and deserializing Kafka TopicPartitions to and from a single JSON text.

JsonUtils uses json4s library that provides a single AST with the Jackson parser for parsing to the AST (using json4s-jackson module).

Name Description

partitionOffsets

Deserializing partition offsets (i.e. offsets per Kafka TopicPartition) from JSON, e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}



partitionOffsets(str: String): Map[TopicPartition, Long]

1

2

3

4

5

partitionOffsets(str: String): Map[TopicPartition, Long]

partitionOffsets

Serializing partition offsets (i.e. offsets per Kafka TopicPartition) to JSON



partitionOffsets(partitionOffsets: Map[TopicPartition, Long]): String

1

2

3

4

5

partitionOffsets(partitionOffsets: Map[TopicPartition, Long]): String

partitions

Deserializing TopicPartitions from JSON, e.g. {"topicA":[0,1],"topicB":[0,1]}



partitions(str: String): Array[TopicPartition]

1

2

3

4

5

partitions(str: String): Array[TopicPartition]

partitions

Serializing TopicPartitions to JSON



partitions(partitions: Iterable[TopicPartition]): String

1

2

3

4

5

partitions(partitions: Iterable[TopicPartition]): String

Deserializing Partition Offsets From JSON — `partitionOffsets` Method



partitionOffsets(str: String): Map[TopicPartition, Long]

1

2

3

4

5

partitionOffsets(str: String): Map[TopicPartition, Long]

partitionOffsets…FIXME

Note

partitionOffsets is used when:

KafkaSourceProvider is requested to get the desired KafkaOffsetRangeLimit (for offset option)
(Spark Structured Streaming) KafkaContinuousReader is requested to deserializeOffset
(Spark Structured Streaming) KafkaSourceOffset is created (from a SerializedOffset)

Serializing Partition Offsets to JSON — `partitionOffsets` Method



partitionOffsets(partitionOffsets: Map[TopicPartition, Long]): String

1

2

3

4

5

partitionOffsets(partitionOffsets: Map[TopicPartition, Long]): String

partitionOffsets…FIXME

Note	`partitionOffsets` is used when…FIXME

Serializing TopicPartitions to JSON — `partitions` Method



partitions(partitions: Iterable[TopicPartition]): String

1

2

3

4

5

partitions(partitions: Iterable[TopicPartition]): String

partitions…FIXME

Note	`partitions` seems not to be used.

Deserializing TopicPartitions from JSON — `partitions` Method



partitions(str: String): Array[TopicPartition]

1

2

3

4

5

partitions(str: String): Array[TopicPartition]

partitions uses json4s-jakson’s Serialization object to read a Map[String, Seq[Int] from the input string that represents a Map of topics and partition numbers, e.g. {"topicA":[0,1],"topicB":[0,1]}.

For every pair of topic and partition number, partitions creates a new Kafka TopicPartition.

In case of any parsing issues, partitions throws a new IllegalArgumentException:



Expected e.g. {"topicA":[0,1],"topicB":[0,1]}, got [str]

1

2

3

4

5

Expected e.g. {"topicA":[0,1],"topicB":[0,1]}, got [str]

Note	`partitions` is used exclusively when `KafkaSourceProvider` is requested for a ConsumerStrategy (given assign option).

KafkaWriteTask

2012-05-16admin阅读(1420)

KafkaWriteTask

KafkaWriteTask is used to write rows (from a structured query) to Apache Kafka.

KafkaWriteTask is created exclusively when KafkaWriter is requested to write the rows of a structured query to a Kafka topic.

KafkaWriteTask writes keys and values in their binary format (as JVM’s bytes) and so uses the raw-memory unsafe row format only (i.e. UnsafeRow). That is supposed to save time for reconstructing the rows to very tiny JVM objects (i.e. byte arrays).

Table 1. KafkaWriteTask’s Internal Properties
Name	Description
`callback`
`failedWrite`
`projection`	UnsafeProjection Created once when `KafkaWriteTask` is created.

Writing Rows to Kafka Asynchronously — `execute` Method



execute(iterator: Iterator[InternalRow]): Unit

1

2

3

4

5

execute(iterator: Iterator[InternalRow]): Unit

execute uses Apache Kafka’s Producer API to create a KafkaProducer and ProducerRecord for every row in iterator, and sends the rows to Kafka in batches asynchronously.

Internally, execute creates a KafkaProducer using Array[Byte] for the keys and values, and producerConfiguration for the producer’s configuration.

Note	`execute` creates a single `KafkaProducer` for all rows.

For every row in the iterator, execute uses the internal UnsafeProjection to project (aka convert) binary internal row format to a UnsafeRow object and take 0th, 1st and 2nd fields for a topic, key and value, respectively.

execute then creates a ProducerRecord and sends it to Kafka (using the KafkaProducer). execute registers a asynchronous Callback to monitor the writing.

Note	From KafkaProducer’s documentation: The `send()` method is asynchronous. When called it adds the record to a buffer of pending record sends and immediately returns. This allows the producer to batch together individual records for efficiency.

Creating UnsafeProjection — `createProjection` Internal Method



createProjection: UnsafeProjection

1

2

3

4

5

createProjection: UnsafeProjection

createProjection creates a UnsafeProjection with topic, key and value expressions and the inputSchema.

createProjection makes sure that the following holds (and reports an IllegalStateException otherwise):

topic was defined (either as the input topic or in inputSchema) and is of type StringType
Optional key is of type StringType or BinaryType if defined
value was defined (in inputSchema) and is of type StringType or BinaryType

createProjection casts key and value expressions to BinaryType in UnsafeProjection.

Note	`createProjection` is used exclusively when `KafkaWriteTask` is created (as projection).

`close` Method



close(): Unit

1

2

3

4

5

close(): Unit

close…FIXME

Note	`close` is used when…FIXME

Creating KafkaWriteTask Instance

KafkaWriteTask takes the following when created:

Kafka Producer configuration (as Map[String, Object])
Input schema (as Seq[Attribute])
Topic name

KafkaWriteTask initializes the internal registries and counters.

KafkaWriter Helper Object — Writing Structured Queries to Kafka

2012-05-15admin阅读(1738)

KafkaWriter Helper Object — Writing Structured Queries to Kafka

KafkaWriter is a Scala object that is used to write the rows of a batch (or a streaming) structured query to Apache Kafka.

Figure 1. KafkaWriter (write) in web UI

KafkaWriter validates the schema of a structured query that it contains the following columns (output schema attributes):

Either topic of type StringType or the topic option are defined
Optional key of type StringType or BinaryType
Required value of type StringType or BinaryType



// KafkaWriter is a private `kafka010` package object
// and so the code to use it should also be in the same package
// BEGIN: Use `:paste -raw` in spark-shell
package org.apache.spark.sql.kafka010

object PublicKafkaWriter {
  import org.apache.spark.sql.execution.QueryExecution
  def validateQuery(
      queryExecution: QueryExecution,
      kafkaParameters: Map[String, Object],
      topic: Option[String] = None): Unit = {
    import scala.collection.JavaConversions.mapAsJavaMap
    KafkaWriter.validateQuery(queryExecution, kafkaParameters, topic)
  }
}
// END

import org.apache.spark.sql.kafka010.{PublicKafkaWriter => PKW}

val spark: SparkSession = ...
val q = spark.range(1).select('id)
scala> PKW.validateQuery(
  queryExecution = q.queryExecution,
  kafkaParameters = Map.empty[String, Object])
org.apache.spark.sql.AnalysisException: topic option required when no 'topic' attribute is present. Use the topic option for setting a topic.;
  at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$2.apply(KafkaWriter.scala:53)
  at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$2.apply(KafkaWriter.scala:52)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.kafka010.KafkaWriter$.validateQuery(KafkaWriter.scala:51)
  at org.apache.spark.sql.kafka010.PublicKafkaWriter$.validateQuery(<pastie>:10)
  ... 50 elided

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

// KafkaWriter is a private `kafka010` package object

// and so the code to use it should also be in the same package

// BEGIN: Use `:paste -raw` in spark-shell

package org.apache.spark.sql.kafka010

object PublicKafkaWriter {

import org.apache.spark.sql.execution.QueryExecution

def validateQuery(

queryExecution: QueryExecution,

kafkaParameters: Map[String, Object],

topic: Option[String] = None): Unit = {

import scala.collection.JavaConversions.mapAsJavaMap

KafkaWriter.validateQuery(queryExecution, kafkaParameters, topic)

}

// END

import org.apache.spark.sql.kafka010.{PublicKafkaWriter => PKW}

val spark: SparkSession = ...

val q = spark.range(1).select('id)

scala> PKW.validateQuery(

queryExecution = q.queryExecution,

kafkaParameters = Map.empty[String, Object])

org.apache.spark.sql.AnalysisException: topic option required when no 'topic' attribute is present. Use the topic option for setting a topic.;

at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$2.apply(KafkaWriter.scala:53)

at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$2.apply(KafkaWriter.scala:52)

at scala.Option.getOrElse(Option.scala:121)

at org.apache.spark.sql.kafka010.KafkaWriter$.validateQuery(KafkaWriter.scala:51)

at org.apache.spark.sql.kafka010.PublicKafkaWriter$.validateQuery(<pastie>:10)

... 50 elided

Writing Rows of Structured Query to Kafka Topic — `write` Method



write(
  sparkSession: SparkSession,
  queryExecution: QueryExecution,
  kafkaParameters: ju.Map[String, Object],
  topic: Option[String] = None): Unit

1

2

3

4

5

6

7

8

9

write(

sparkSession: SparkSession,

queryExecution: QueryExecution,

kafkaParameters: ju.Map[String, Object],

topic: Option[String] = None): Unit

write gets the output schema of the analyzed logical plan of the input QueryExecution.

write then validates the schema of a structured query.

In the end, write requests the QueryExecution for RDD[InternalRow] (that represents the structured query as an RDD) and executes the following function on every partition of the RDD (using RDD.foreachPartition operation):

Creates a KafkaWriteTask (for the input kafkaParameters, the schema and the input topic)
Requests the KafkaWriteTask to write the rows (of the partition) to Kafka topic
Requests the KafkaWriteTask to close

Note	`write` is used when: `KafkaSourceProvider` is requested to write a DataFrame to a Kafka topic (Spark Structured Streaming) `KafkaSink` is requested to `addBatch`

Validating Schema (Attributes) of Structured Query and Topic Option Availability — `validateQuery` Method



validateQuery(
  schema: Seq[Attribute],
  kafkaParameters: ju.Map[String, Object],
  topic: Option[String] = None): Unit

1

2

3

4

5

6

7

8

validateQuery(

schema: Seq[Attribute],

kafkaParameters: ju.Map[String, Object],

topic: Option[String] = None): Unit

validateQuery makes sure that the following attributes are in the input schema (or their alternatives) and of the right data types:

Either topic attribute of type StringType or the topic option are defined
If key attribute is defined it is of type StringType or BinaryType
value attribute is of type StringType or BinaryType

If any of the requirements are not met, validateQuery throws an AnalysisException.

Note	`validateQuery` is used when: `KafkaWriter` object is requested to write the rows of a structured query to a Kafka topic (Spark Structured Streaming) `KafkaStreamWriter` is created (Spark Structured Streaming) `KafkaSourceProvider` is requested to `createStreamWriter`

InternalKafkaConsumer

2012-05-14admin阅读(1322)

InternalKafkaConsumer

InternalKafkaConsumer is…FIXME

Getting Single Kafka ConsumerRecord — `get` Method



get(
  offset: Long,
  untilOffset: Long,
  pollTimeoutMs: Long,
  failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]]

1

2

3

4

5

6

7

8

9

get(

offset: Long,

untilOffset: Long,

pollTimeoutMs: Long,

failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]]

get…FIXME

Note	`get` is used when…FIXME

Getting Single AvailableOffsetRange — `getAvailableOffsetRange` Method



getAvailableOffsetRange(): AvailableOffsetRange

1

2

3

4

5

getAvailableOffsetRange(): AvailableOffsetRange

getAvailableOffsetRange…FIXME

Note	`getAvailableOffsetRange` is used when…FIXME

KafkaDataConsumer Contract

2012-05-13admin阅读(1008)

KafkaDataConsumer Contract

KafkaDataConsumer is the contract for KafkaDataConsumers that use an InternalKafkaConsumer for the following:

Getting a single Kafka ConsumerRecord
Getting a single AvailableOffsetRange

KafkaDataConsumer has to be released explicitly.



package org.apache.spark.sql.kafka010

sealed trait KafkaDataConsumer {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def internalConsumer: InternalKafkaConsumer
  def release(): Unit
}

1

2

3

4

5

6

7

8

9

10

11

12

package org.apache.spark.sql.kafka010

sealed trait KafkaDataConsumer {

// only required properties (vals and methods) that have no implementation

// the others follow

def internalConsumer: InternalKafkaConsumer

def release(): Unit

}

Table 1. KafkaDataConsumer Contract
Property	Description
`internalConsumer`	Used when: `KafkaDataConsumer` is requested to get a single Kafka ConsumerRecord and get a single AvailableOffsetRange `CachedKafkaDataConsumer` and `NonCachedKafkaDataConsumer` are requested to release the `InternalKafkaConsumer`
`release`	Used when: `KafkaSourceRDD` is requested to compute a partition (Spark Structured Streaming) `KafkaContinuousDataReader` is requested to `close`

Table 2. KafkaDataConsumers
KafkaDataConsumer	Description
`CachedKafkaDataConsumer`
`NonCachedKafkaDataConsumer`

Note	`KafkaDataConsumer` is a Scala sealed trait which means that all the implementations are in the same compilation unit (a single file).

Getting Single Kafka ConsumerRecord — `get` Method



get(
  offset: Long,
  untilOffset: Long,
  pollTimeoutMs: Long,
  failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]]

1

2

3

4

5

6

7

8

9

get(

offset: Long,

untilOffset: Long,

pollTimeoutMs: Long,

failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]]

get simply requests the InternalKafkaConsumer to get a single Kafka ConsumerRecord.

Note	`get` is used when: `KafkaSourceRDD` is requested to compute a partition (Spark Structured Streaming) `KafkaContinuousDataReader` is requested to `next`

Getting Single AvailableOffsetRange — `getAvailableOffsetRange` Method



getAvailableOffsetRange(): AvailableOffsetRange

1

2

3

4

5

getAvailableOffsetRange(): AvailableOffsetRange

getAvailableOffsetRange simply requests the InternalKafkaConsumer to get a single AvailableOffsetRange.

Note	`getAvailableOffsetRange` is used when: `KafkaSourceRDD` is requested to compute a partition (through resolveRange) (Spark Structured Streaming) `KafkaContinuousDataReader` is requested to `next`

KafkaOffsetRangeLimit

2012-05-12admin阅读(1506)

KafkaOffsetRangeLimit

KafkaOffsetRangeLimit is the desired offset range limits for starting, ending, and specific offsets.

Table 1. KafkaOffsetRangeLimits
KafkaOffsetRangeLimit	Description
`EarliestOffsetRangeLimit`	Bind to the earliest offset
`LatestOffsetRangeLimit`	Bind to the latest offset
`SpecificOffsetRangeLimit`	Bind to specific offsets Takes `partitionOffsets` (as `Map[TopicPartition, Long]`) when created.

KafkaOffsetRangeLimit is “created” (i.e. mapped to from a human-readable text representation) when KafkaSourceProvider is requested to getKafkaOffsetRangeLimit.

KafkaOffsetRangeLimit defines two constants to denote offset range limits that are resolved via Kafka:

-1L for the latest offset
-2L for the earliest offset

Note	`KafkaOffsetRangeLimit` is a Scala sealed trait which means that all the extensions are in the same compilation unit (a single file).

KafkaOffsetReader

2012-05-11admin阅读(1689)

KafkaOffsetReader

KafkaOffsetReader is used to query a Kafka cluster for partition offsets.

KafkaOffsetReader is created when:

KafkaRelation is requested to build a distributed data scan with column pruning (as a TableScan) (to get the initial partition offsets)
(Spark Structured Streaming) KafkaSourceProvider is requested to createSource and createContinuousReader

When requested for the human-readable text representation (aka toString), KafkaOffsetReader simply requests the ConsumerStrategy for one.

Table 1. KafkaOffsetReader’s Options
Name	Default Value	Description
`fetchOffset.numRetries`	`3`
`fetchOffset.retryIntervalMs`	`1000`	How long to wait before retries.

Table 2. KafkaOffsetReader’s Internal Registries and Counters
Name	Description
`consumer`	Kafka’s Consumer (with keys and values of `Array[Byte]` type) Initialized when `KafkaOffsetReader` is created. Used when `KafkaOffsetReader`: fetchTopicPartitions fetchEarliestOffsets fetchLatestOffsets resetConsumer is closed
`execContext`
`groupId`
`kafkaReaderThread`
`maxOffsetFetchAttempts`
`nextId`
`offsetFetchAttemptIntervalMs`

Tip

Enable INFO or DEBUG logging levels for org.apache.spark.sql.kafka010.KafkaOffsetReader to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG

1

2

3

4

5

log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG

Refer to Logging.

Creating Kafka Consumer — `createConsumer` Internal Method



createConsumer(): Consumer[Array[Byte], Array[Byte]]

1

2

3

4

5

createConsumer(): Consumer[Array[Byte], Array[Byte]]

createConsumer requests the ConsumerStrategy to create a Kafka Consumer with driverKafkaParams and new generated group.id Kafka property.

Note	`createConsumer` is used when `KafkaOffsetReader` is created (and initializes consumer) and resetConsumer

Creating KafkaOffsetReader Instance

KafkaOffsetReader takes the following when created:

ConsumerStrategy
Kafka parameters (as Map[String, Object])
Reader options (as Map[String, String])
Prefix for the group id

KafkaOffsetReader initializes the internal registries and counters.

`close` Method



close(): Unit

1

2

3

4

5

close(): Unit

close…FIXME

Note	`close` is used when…FIXME

`fetchEarliestOffsets` Method



fetchEarliestOffsets(): Map[TopicPartition, Long]

1

2

3

4

5

fetchEarliestOffsets(): Map[TopicPartition, Long]

fetchEarliestOffsets…FIXME

Note	`fetchEarliestOffsets` is used when…FIXME

`fetchEarliestOffsets` Method



fetchEarliestOffsets(newPartitions: Seq[TopicPartition]): Map[TopicPartition, Long]

1

2

3

4

5

fetchEarliestOffsets(newPartitions: Seq[TopicPartition]): Map[TopicPartition, Long]

fetchEarliestOffsets…FIXME

Note	`fetchEarliestOffsets` is used when…FIXME

`fetchLatestOffsets` Method



fetchLatestOffsets(): Map[TopicPartition, Long]

1

2

3

4

5

fetchLatestOffsets(): Map[TopicPartition, Long]

fetchLatestOffsets…FIXME

Note	`fetchLatestOffsets` is used when…FIXME

Fetching (and Pausing) Assigned Kafka TopicPartitions — `fetchTopicPartitions` Method



fetchTopicPartitions(): Set[TopicPartition]

1

2

3

4

5

fetchTopicPartitions(): Set[TopicPartition]

fetchTopicPartitions uses an UninterruptibleThread thread to do the following:

Requests the Kafka Consumer to poll (fetch data) for the topics and partitions (with 0 timeout)
Requests the Kafka Consumer to get the set of partitions currently assigned
Requests the Kafka Consumer to suspend fetching from the partitions assigned

In the end, fetchTopicPartitions returns the TopicPartitions assigned (and paused).

Note	`fetchTopicPartitions` is used exclusively when `KafkaRelation` is requested to build a distributed data scan with column pruning (as a TableScan) through getPartitionOffsets.

`nextGroupId` Internal Method



nextGroupId(): String

1

2

3

4

5

nextGroupId(): String

nextGroupId…FIXME

Note	`nextGroupId` is used when…FIXME

`resetConsumer` Internal Method



resetConsumer(): Unit

1

2

3

4

5

resetConsumer(): Unit

resetConsumer…FIXME

Note	`resetConsumer` is used when…FIXME

`runUninterruptibly` Internal Method



runUninterruptibly[T](body: => T): T

1

2

3

4

5

runUninterruptibly[T](body: => T): T

runUninterruptibly…FIXME

Note	`runUninterruptibly` is used when…FIXME

`withRetriesWithoutInterrupt` Internal Method



withRetriesWithoutInterrupt(body: => Map[TopicPartition, Long]): Map[TopicPartition, Long]

1

2

3

4

5

withRetriesWithoutInterrupt(body: => Map[TopicPartition, Long]): Map[TopicPartition, Long]

withRetriesWithoutInterrupt…FIXME

Note	`withRetriesWithoutInterrupt` is used when…FIXME

ConsumerStrategy Contract — Kafka Consumer Providers

2012-05-10admin阅读(1597)

ConsumerStrategy Contract — Kafka Consumer Providers

ConsumerStrategy is the contract for Kafka Consumer providers that can create a Kafka Consumer given Kafka parameters.



package org.apache.spark.sql.kafka010

sealed trait ConsumerStrategy {
  def createConsumer(kafkaParams: ju.Map[String, Object]): Consumer[Array[Byte], Array[Byte]]
}

1

2

3

4

5

6

7

8

9

package org.apache.spark.sql.kafka010

sealed trait ConsumerStrategy {

def createConsumer(kafkaParams: ju.Map[String, Object]): Consumer[Array[Byte], Array[Byte]]

}

Table 1. ConsumerStrategy Contract
Property	Description
`createConsumer`	Creates a Kafka Consumer (of keys and values of type `Array[Byte]`) Used exclusively when `KafkaOffsetReader` is requested to creating a Kafka Consumer

ConsumerStrategy createConsumer

AssignStrategy

Uses KafkaConsumer.assign(Collection<TopicPartition> partitions)

SubscribeStrategy

Uses KafkaConsumer.subscribe(Collection<String> topics)

SubscribePatternStrategy

Uses KafkaConsumer.subscribe(Pattern pattern, ConsumerRebalanceListener listener) with NoOpConsumerRebalanceListener.

Tip	Refer to java.util.regex.Pattern for the format of supported topic subscription regex patterns.

Note	`ConsumerStrategy` is a Scala sealed trait which means that all the implementations are in the same compilation unit (a single file).

KafkaSourceRDDPartition

2012-05-09admin阅读(1545)

KafkaSourceRDDPartition

KafkaSourceRDDPartition is…FIXME

spark-sql 第45页

Avro Data Source

Converting Column to Avro-Encoded Binary Column — to_avro Method

Converting Avro-Encoded Column to Catalyst Value — from_avro Method

JsonUtils Helper Object

Deserializing Partition Offsets From JSON — partitionOffsets Method

Serializing Partition Offsets to JSON — partitionOffsets Method

Serializing TopicPartitions to JSON — partitions Method

Deserializing TopicPartitions from JSON — partitions Method

KafkaWriteTask

Writing Rows to Kafka Asynchronously — execute Method

Creating UnsafeProjection — createProjection Internal Method

close Method

Creating KafkaWriteTask Instance

KafkaWriter Helper Object — Writing Structured Queries to Kafka

Writing Rows of Structured Query to Kafka Topic — write Method

Validating Schema (Attributes) of Structured Query and Topic Option Availability — validateQuery Method

InternalKafkaConsumer

Getting Single Kafka ConsumerRecord — get Method

Getting Single AvailableOffsetRange — getAvailableOffsetRange Method

KafkaDataConsumer Contract

Getting Single Kafka ConsumerRecord — get Method

Getting Single AvailableOffsetRange — getAvailableOffsetRange Method

KafkaOffsetRangeLimit

KafkaOffsetReader

Creating Kafka Consumer — createConsumer Internal Method

Creating KafkaOffsetReader Instance

close Method

fetchEarliestOffsets Method

fetchEarliestOffsets Method

fetchLatestOffsets Method

Fetching (and Pausing) Assigned Kafka TopicPartitions — fetchTopicPartitions Method

nextGroupId Internal Method

resetConsumer Internal Method

runUninterruptibly Internal Method

withRetriesWithoutInterrupt Internal Method

ConsumerStrategy Contract — Kafka Consumer Providers

KafkaSourceRDDPartition

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Converting Column to Avro-Encoded Binary Column — `to_avro` Method

Converting Avro-Encoded Column to Catalyst Value — `from_avro` Method

Deserializing Partition Offsets From JSON — `partitionOffsets` Method

Serializing Partition Offsets to JSON — `partitionOffsets` Method

Serializing TopicPartitions to JSON — `partitions` Method

Deserializing TopicPartitions from JSON — `partitions` Method

Writing Rows to Kafka Asynchronously — `execute` Method

Creating UnsafeProjection — `createProjection` Internal Method

`close` Method

Writing Rows of Structured Query to Kafka Topic — `write` Method

Validating Schema (Attributes) of Structured Query and Topic Option Availability — `validateQuery` Method

Getting Single Kafka ConsumerRecord — `get` Method

Getting Single AvailableOffsetRange — `getAvailableOffsetRange` Method

Getting Single Kafka ConsumerRecord — `get` Method

Getting Single AvailableOffsetRange — `getAvailableOffsetRange` Method

Creating Kafka Consumer — `createConsumer` Internal Method

`close` Method

`fetchEarliestOffsets` Method

`fetchEarliestOffsets` Method

`fetchLatestOffsets` Method

Fetching (and Pausing) Assigned Kafka TopicPartitions — `fetchTopicPartitions` Method

`nextGroupId` Internal Method

`resetConsumer` Internal Method

`runUninterruptibly` Internal Method

`withRetriesWithoutInterrupt` Internal Method