spark-sql-spark技术分享-第41页

DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator

2012-06-27admin阅读(1705)

DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator

DataSourceRDD is an RDD that is created exclusively when DataSourceV2ScanExec physical operator is requested for the input RDD (when WholeStageCodegenExec physical operator is executed).

DataSourceRDD uses DataSourceRDDPartition partitions.

Requesting Preferred Locations Of DataReaderFactory (For Partition) — `getPreferredLocations` Method



getPreferredLocations(split: Partition): Seq[String]

1

2

3

4

5

getPreferredLocations(split: Partition): Seq[String]

Note	`getPreferredLocations` is part of Spark Core’s `RDD` Contract to…FIXME.

getPreferredLocations simply requests the preferred locations of the DataReaderFactory of the input DataSourceRDDPartition partition.

`getPartitions` Method



getPartitions: Array[Partition]

1

2

3

4

5

getPartitions: Array[Partition]

Note	`getPartitions` is part of Spark Core’s `RDD` Contract to…FIXME

getPartitions simply creates a DataSourceRDDPartition for every DataReaderFactory in the readerFactories.

Creating DataSourceRDD Instance

DataSourceRDD takes the following when created:

Spark Core’s SparkContext
Collection of DataReaderFactory objects

Computing Partition (in TaskContext) — `compute` Method



compute(split: Partition, context: TaskContext): Iterator[T]

1

2

3

4

5

compute(split: Partition, context: TaskContext): Iterator[T]

Note	`compute` is part of Spark Core’s `RDD` Contract to compute a partition (in a `TaskContext`).

compute requests the DataReaderFactory (of the DataSourceRDDPartition partition) to createDataReader.

compute registers a Spark Core TaskCompletionListener that requests the DataReader to close at a task completion.

compute returns a Spark Core InterruptibleIterator that…FIXME

RowToUnsafeRowDataReaderFactory

2012-06-26admin阅读(1475)

RowToUnsafeRowDataReaderFactory

RowToUnsafeRowDataReaderFactory is a DataReaderFactory of UnsafeRows.

RowToUnsafeRowDataReaderFactory is created exclusively when DataSourceV2ScanExec physical operator is requested for reader factories.

`preferredLocations` Method



preferredLocations: Array[String]

1

2

3

4

5

preferredLocations: Array[String]

Note	`preferredLocations` is part of DataReaderFactory Contract to…FIXME.

preferredLocations simply requests rowReaderFactory for preferredLocations.

`createDataReader` Method



createDataReader: DataReader[UnsafeRow]

1

2

3

4

5

createDataReader: DataReader[UnsafeRow]

Note	`createDataReader` is part of DataReaderFactory Contract to…FIXME.

createDataReader…FIXME

Creating RowToUnsafeRowDataReaderFactory Instance

RowToUnsafeRowDataReaderFactory takes the following when created:

DataReaderFactory of rows
Schema

DataReaderFactory

2012-06-25admin阅读(4179)

DataReaderFactory

DataReaderFactory is a contract…FIXME



package org.apache.spark.sql.sources.v2.reader;

public interface DataReaderFactory<T> extends Serializable {
  // only required methods that have no implementation
  // the others follow
  DataReader<T> createDataReader();
}

1

2

3

4

5

6

7

8

9

10

11

package org.apache.spark.sql.sources.v2.reader;

public interface DataReaderFactory<T> extends Serializable {

// only required methods that have no implementation

// the others follow

DataReader<T> createDataReader();

}

Note	`DataReaderFactory` is an `Evolving` contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release. In other words, using the contract is as treading on thin ice.

Table 1. DataReaderFactory Contract
Method	Description
`createDataReader`	Used when…FIXME

Specifying Preferred Locations — `preferredLocations` Method



default String[] preferredLocations()

1

2

3

4

5

default String[] preferredLocations()

preferredLocations defaults to an empty collection of host names (as the preferred locations) which simply means that this task has no location preference.

Note	`preferredLocations` is used when: `DataSourceRDD` is requested for getPreferredLocations `RowToUnsafeRowDataReaderFactory` is requested for preferredLocations Spark Structured Streaming’s `ContinuousDataSourceRDD` is requested for `getPreferredLocations`

SupportsPushDownFilters

2012-06-24admin阅读(3893)

SupportsPushDownFilters

SupportsPushDownFilters is the contract for DataSourceReaders that support push down filters to the data source (and hence reduce the size of the data to be read).



package org.apache.spark.sql.sources.v2.reader;

interface SupportsPushDownFilters extends DataSourceReader {
  Filter[] pushFilters(Filter[] filters);
  Filter[] pushedFilters();
}

1

2

3

4

5

6

7

8

9

10

package org.apache.spark.sql.sources.v2.reader;

interface SupportsPushDownFilters extends DataSourceReader {

Filter[] pushFilters(Filter[] filters);

Filter[] pushedFilters();

}

Note	`SupportsPushDownFilters` is an `Evolving` contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release. In other words, using the contract is as treading on thin ice.

Table 1. SupportsPushDownFilters Contract
Method	Description
`pushFilters`	Used when…FIXME
`pushedFilters`	Used when…FIXME

DataSourceReader

2012-06-23admin阅读(4179)

DataSourceReader

DataSourceReader is the contract for data source readers with a custom DataReaderFactory.



package org.apache.spark.sql.sources.v2.reader;

interface DataSourceReader {
  StructType readSchema();
  List<DataReaderFactory<Row>> createDataReaderFactories();
}

1

2

3

4

5

6

7

8

9

10

package org.apache.spark.sql.sources.v2.reader;

interface DataSourceReader {

StructType readSchema();

List<DataReaderFactory<Row>> createDataReaderFactories();

}

Note	`DataSourceReader` is an `Evolving` contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release. In other words, using the contract is as treading on thin ice.

Table 1. DataSourceReader Contract
Method	Description
`readSchema`	Used when…FIXME
`createDataReaderFactories`	Used when…FIXME

Table 2. DataSourceReaders
DataSourceReader	Description
`ContinuousReader`	Used in Spark Structured Streaming
`MicroBatchReader`	Used in Spark Structured Streaming
`SupportsPushDownCatalystFilters`
`SupportsPushDownFilters`
`SupportsPushDownRequiredColumns`
`SupportsReportPartitioning`
`SupportsReportStatistics`
`SupportsScanColumnarBatch`
`SupportsScanUnsafeRow`

DataSourceV2

2012-06-22admin阅读(4021)

DataSourceV2

DataSourceV2 is the contract of Data Sources V2 that FIXME.

DataSourceV2 defines no methods or values and acts as a marker interface.



package org.apache.spark.sql.sources.v2;

public interface DataSourceV2 {}

1

2

3

4

5

6

7

package org.apache.spark.sql.sources.v2;

public interface DataSourceV2 {}

Note	Implementations should mix-in at least one of the interfaces like `ReadSupport` and `WriteSupport`. Otherwise it’s simply a dummy data source which is un-readable and un-writable.

Note	`DataSourceV2` is an `Evolving` contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release. In other words, using the contract is as treading on thin ice.

Table 1. DataSourceV2s
DataSourceV2	Description
`ConsoleSinkProvider`	Used in Spark Structured Streaming
`ContinuousReadSupport`	Used in Spark Structured Streaming
`MemorySinkV2`	Used in Spark Structured Streaming
`MicroBatchReadSupport`	Used in Spark Structured Streaming
`RateSourceProvider`	Used in Spark Structured Streaming
`RateSourceProviderV2`	Used in Spark Structured Streaming
`ReadSupport`
`ReadSupportWithSchema`
`SessionConfigSupport`
`StreamWriteSupport`
`WriteSupport`

FileRelation Contract

2012-06-21admin阅读(1644)

FileRelation

FileRelation is the contract of relations that are backed by files.



package org.apache.spark.sql.execution

trait FileRelation {
  def inputFiles: Array[String]
}

1

2

3

4

5

6

7

8

9

package org.apache.spark.sql.execution

trait FileRelation {

def inputFiles: Array[String]

}

Table 1. FileRelation Contract
Method	Description
`inputFiles`	The list of files that will be read when scanning the relation. Used exclusively when `Dataset` is requested to inputFiles

Table 2. FileRelations
FileRelation	Description
HadoopFsRelation

Data Source Filter Predicate (For Filter Pushdown)

2012-06-20admin阅读(1679)

Data Source Filter Predicate (For Filter Pushdown)

Filter is the contract for filter predicates that can be pushed down to a relation (aka data source).

Filter is used when:

(Data Source API V1) BaseRelation is requested for unhandled filter predicates (and hence BaseRelation implementations, i.e. JDBCRelation)
(Data Source API V1) PrunedFilteredScan is requested for build a scan (and hence PrunedFilteredScan implementations, i.e. JDBCRelation)
FileFormat is requested to buildReader (and hence FileFormat implementations, i.e. OrcFileFormat, CSVFileFormat, JsonFileFormat, TextFileFormat and Spark MLlib’s LibSVMFileFormat)
FileFormat is requested to build a Data Reader with partition column values appended (and hence FileFormat implementations, i.e. OrcFileFormat, ParquetFileFormat)
RowDataSourceScanExec is created (for a simple text representation (in a query plan tree))
DataSourceStrategy execution planning strategy is requested to pruneFilterProject (when executed for LogicalRelation logical operators with a PrunedFilteredScan or a PrunedScan)
DataSourceStrategy execution planning strategy is requested to selectFilters
JDBCRDD is created and requested to scanTable
(Data Source API V2) SupportsPushDownFilters is requested to pushFilters and for pushedFilters



package org.apache.spark.sql.sources

abstract class Filter {
  // only required methods that have no implementation
  // the others follow
  def references: Array[String]
}

1

2

3

4

5

6

7

8

9

10

11

package org.apache.spark.sql.sources

abstract class Filter {

// only required methods that have no implementation

// the others follow

def references: Array[String]

}

Table 1. Filter Contract
Method	Description
`references`	Column references, i.e. list of column names that are referenced by a filter Used when: `Filter` is requested to find the column references in a value And, Or and Not filters are requested for the column references

Table 2. Filters
Filter	Description
`And`
`EqualNullSafe`
`EqualTo`
`GreaterThan`
`GreaterThanOrEqual`
`In`
`IsNotNull`
`IsNull`
`LessThan`
`LessThanOrEqual`
`Not`
`Or`
`StringContains`
`StringEndsWith`
`StringStartsWith`

Finding Column References in Any Value — `findReferences` Method



findReferences(value: Any): Array[String]

1

2

3

4

5

findReferences(value: Any): Array[String]

findReferences takes the references from the value filter is it is one or returns an empty array.

Note	`findReferences` is used when EqualTo, EqualNullSafe, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual and In filters are requested for their column references.

FileFormatWriter Helper Object

2012-06-19admin阅读(2197)

FileFormatWriter Helper Object

FileFormatWriter is a Scala object that allows for writing the result of a structured query.

Tip

Enable ERROR, INFO, DEBUG logging level for org.apache.spark.sql.execution.datasources.FileFormatWriter logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.execution.datasources.FileFormatWriter=INFO

1

2

3

4

5

log4j.logger.org.apache.spark.sql.execution.datasources.FileFormatWriter=INFO

Refer to Logging.

Writing Query Result — `write` Method



write(
  sparkSession: SparkSession,
  plan: SparkPlan,
  fileFormat: FileFormat,
  committer: FileCommitProtocol,
  outputSpec: OutputSpec,
  hadoopConf: Configuration,
  partitionColumns: Seq[Attribute],
  bucketSpec: Option[BucketSpec],
  statsTrackers: Seq[WriteJobStatsTracker],
  options: Map[String, String]): Set[String]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

write(

sparkSession: SparkSession,

plan: SparkPlan,

fileFormat: FileFormat,

committer: FileCommitProtocol,

outputSpec: OutputSpec,

hadoopConf: Configuration,

partitionColumns: Seq[Attribute],

bucketSpec: Option[BucketSpec],

statsTrackers: Seq[WriteJobStatsTracker],

options: Map[String, String]): Set[String]

write…FIXME

Note

write is used when:

SaveAsHiveFile is requested to saveAsHiveFile (when InsertIntoHiveDirCommand and InsertIntoHiveTable logical commands are executed)
InsertIntoHadoopFsRelationCommand logical command is executed
Structured Streaming’s FileStreamSink is requested to add a streaming batch (addBatch)

`executeTask` Internal Method



executeTask(
  description: WriteJobDescription,
  sparkStageId: Int,
  sparkPartitionId: Int,
  sparkAttemptNumber: Int,
  committer: FileCommitProtocol,
  iterator: Iterator[InternalRow]): WriteTaskResult

1

2

3

4

5

6

7

8

9

10

11

executeTask(

description: WriteJobDescription,

sparkStageId: Int,

sparkPartitionId: Int,

sparkAttemptNumber: Int,

committer: FileCommitProtocol,

iterator: Iterator[InternalRow]): WriteTaskResult

executeTask…FIXME

Note	`executeTask` is used exclusively when `FileFormatWriter` is requested to write the result of a structured query.

`processStats` Internal Method



processStats(
  statsTrackers: Seq[WriteJobStatsTracker],
  statsPerTask: Seq[Seq[WriteTaskStats]]): Unit

1

2

3

4

5

6

7

processStats(

statsTrackers: Seq[WriteJobStatsTracker],

statsPerTask: Seq[Seq[WriteTaskStats]]): Unit

processStats…FIXME

Note	`processStats` is used exclusively when `FileFormatWriter` is requested to write the result of a structured query.

TableScan Contract — Relations with Column Pruning

2012-06-18admin阅读(1480)

TableScan Contract — Relations with Column Pruning

TableScan is the contract of BaseRelations with support for column pruning, i.e. can eliminate unneeded columns before producing an RDD containing all of its tuples as Row objects.



package org.apache.spark.sql.sources

trait PrunedScan {
  def buildScan(): RDD[Row]
}

1

2

3

4

5

6

7

8

9

package org.apache.spark.sql.sources

trait PrunedScan {

def buildScan(): RDD[Row]

}

Table 1. TableScan Contract
Property	Description
`buildScan`	Building distributed data scan with column pruning In other words, `buildScan` creates a `RDD[Row]` to represent a distributed data scan (i.e. scanning over data in a relation). Used exclusively when `DataSourceStrategy` execution planning strategy is requested to plan a LogicalRelation with a TableScan.

Note	KafkaRelation is the one and only known implementation of the TableScan Contract in Spark SQL.

spark-sql 第41页

DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator

DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator

Requesting Preferred Locations Of DataReaderFactory (For Partition) — `getPreferredLocations` Method

`getPartitions` Method

Creating DataSourceRDD Instance

Computing Partition (in TaskContext) — `compute` Method

RowToUnsafeRowDataReaderFactory

RowToUnsafeRowDataReaderFactory

`preferredLocations` Method

`createDataReader` Method

Creating RowToUnsafeRowDataReaderFactory Instance

DataReaderFactory

DataReaderFactory

Specifying Preferred Locations — `preferredLocations` Method

SupportsPushDownFilters

SupportsPushDownFilters

DataSourceReader

DataSourceReader

DataSourceV2

DataSourceV2

FileRelation Contract

FileRelation

Data Source Filter Predicate (For Filter Pushdown)

Data Source Filter Predicate (For Filter Pushdown)

Finding Column References in Any Value — `findReferences` Method

FileFormatWriter Helper Object

FileFormatWriter Helper Object

Writing Query Result — `write` Method

`executeTask` Internal Method

`processStats` Internal Method

TableScan Contract — Relations with Column Pruning

TableScan Contract — Relations with Column Pruning

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

spark-sql 第41页

DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator

Requesting Preferred Locations Of DataReaderFactory (For Partition) — getPreferredLocations Method

getPartitions Method

Creating DataSourceRDD Instance

Computing Partition (in TaskContext) — compute Method

RowToUnsafeRowDataReaderFactory

preferredLocations Method

createDataReader Method

Creating RowToUnsafeRowDataReaderFactory Instance

DataReaderFactory

Specifying Preferred Locations — preferredLocations Method

SupportsPushDownFilters

DataSourceReader

DataSourceV2

FileRelation

Data Source Filter Predicate (For Filter Pushdown)

Finding Column References in Any Value — findReferences Method

FileFormatWriter Helper Object

Writing Query Result — write Method

executeTask Internal Method

processStats Internal Method

TableScan Contract — Relations with Column Pruning

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Requesting Preferred Locations Of DataReaderFactory (For Partition) — `getPreferredLocations` Method

`getPartitions` Method

Computing Partition (in TaskContext) — `compute` Method

`preferredLocations` Method

`createDataReader` Method

Specifying Preferred Locations — `preferredLocations` Method

Finding Column References in Any Value — `findReferences` Method

Writing Query Result — `write` Method

`executeTask` Internal Method

`processStats` Internal Method