Kafka Data Source
Note
|
Apache Kafka is a storage of records in a format-independent and fault-tolerant durable way. Read up on Apache Kafka in the official documentation or in my other gitbook Mastering Apache Kafka. |
Kafka Data Source supports options to get better performance of structured queries that use it.
Reading Data from Kafka Topics
As a Spark developer, you use DataFrameReader.format method to specify Apache Kafka as the external data source to load data from.
You use kafka (or org.apache.spark.sql.kafka010.KafkaSourceProvider
) as the input data source format.
1 2 3 4 5 6 7 8 |
val kafka = spark.read.format("kafka").load // Alternatively val kafka = spark.read.format("org.apache.spark.sql.kafka010.KafkaSourceProvider").load |
These one-liners create a DataFrame that represents the distributed process of loading data from one or many Kafka topics (with additional properties).