关注 spark技术分享,
撸spark源码 玩spark最佳实践

DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator

admin阅读(1501)

DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator

DataSourceRDD is an RDD that is created exclusively when DataSourceV2ScanExec physical operator is requested for the input RDD (when WholeStageCodegenExec physical operator is executed).

DataSourceRDD uses DataSourceRDDPartition partitions.

Requesting Preferred Locations Of DataReaderFactory (For Partition) — getPreferredLocations Method

Note
getPreferredLocations is part of Spark Core’s RDD Contract to…​FIXME.

getPreferredLocations simply requests the preferred locations of the DataReaderFactory of the input DataSourceRDDPartition partition.

getPartitions Method

Note
getPartitions is part of Spark Core’s RDD Contract to…​FIXME

getPartitions simply creates a DataSourceRDDPartition for every DataReaderFactory in the readerFactories.

Creating DataSourceRDD Instance

DataSourceRDD takes the following when created:

Computing Partition (in TaskContext) — compute Method

Note
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext).

compute requests the DataReaderFactory (of the DataSourceRDDPartition partition) to createDataReader.

compute registers a Spark Core TaskCompletionListener that requests the DataReader to close at a task completion.

compute returns a Spark Core InterruptibleIterator that…​FIXME

RowToUnsafeRowDataReaderFactory

admin阅读(1278)

RowToUnsafeRowDataReaderFactory

RowToUnsafeRowDataReaderFactory is a DataReaderFactory of UnsafeRows.

RowToUnsafeRowDataReaderFactory is created exclusively when DataSourceV2ScanExec physical operator is requested for reader factories.

preferredLocations Method

Note
preferredLocations is part of DataReaderFactory Contract to…​FIXME.

preferredLocations simply requests rowReaderFactory for preferredLocations.

createDataReader Method

Note
createDataReader is part of DataReaderFactory Contract to…​FIXME.

createDataReader…​FIXME

Creating RowToUnsafeRowDataReaderFactory Instance

RowToUnsafeRowDataReaderFactory takes the following when created:

DataReaderFactory

admin阅读(3987)

DataReaderFactory

DataReaderFactory is a contract…​FIXME

Note

DataReaderFactory is an Evolving contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release.

In other words, using the contract is as treading on thin ice.

Table 1. DataReaderFactory Contract
Method Description

createDataReader

Used when…​FIXME

Specifying Preferred Locations —  preferredLocations Method

preferredLocations defaults to an empty collection of host names (as the preferred locations) which simply means that this task has no location preference.

Note

preferredLocations is used when:

  • DataSourceRDD is requested for getPreferredLocations

  • RowToUnsafeRowDataReaderFactory is requested for preferredLocations

  • Spark Structured Streaming’s ContinuousDataSourceRDD is requested for getPreferredLocations

SupportsPushDownFilters

admin阅读(3716)

SupportsPushDownFilters

SupportsPushDownFilters is the contract for DataSourceReaders that support push down filters to the data source (and hence reduce the size of the data to be read).

Note

SupportsPushDownFilters is an Evolving contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release.

In other words, using the contract is as treading on thin ice.

Table 1. SupportsPushDownFilters Contract
Method Description

pushFilters

Used when…​FIXME

pushedFilters

Used when…​FIXME

DataSourceReader

admin阅读(3994)

DataSourceReader

DataSourceReader is the contract for data source readers with a custom DataReaderFactory.

Note

DataSourceReader is an Evolving contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release.

In other words, using the contract is as treading on thin ice.

Table 1. DataSourceReader Contract
Method Description

readSchema

Used when…​FIXME

createDataReaderFactories

Used when…​FIXME

Table 2. DataSourceReaders
DataSourceReader Description

ContinuousReader

Used in Spark Structured Streaming

MicroBatchReader

Used in Spark Structured Streaming

SupportsPushDownCatalystFilters

SupportsPushDownFilters

SupportsPushDownRequiredColumns

SupportsReportPartitioning

SupportsReportStatistics

SupportsScanColumnarBatch

SupportsScanUnsafeRow

DataSourceV2

admin阅读(3750)

DataSourceV2

DataSourceV2 is the contract of Data Sources V2 that FIXME.

DataSourceV2 defines no methods or values and acts as a marker interface.

Note
Implementations should mix-in at least one of the interfaces like ReadSupport and WriteSupport. Otherwise it’s simply a dummy data source which is un-readable and un-writable.
Note

DataSourceV2 is an Evolving contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release.

In other words, using the contract is as treading on thin ice.

Table 1. DataSourceV2s
DataSourceV2 Description

ConsoleSinkProvider

Used in Spark Structured Streaming

ContinuousReadSupport

Used in Spark Structured Streaming

MemorySinkV2

Used in Spark Structured Streaming

MicroBatchReadSupport

Used in Spark Structured Streaming

RateSourceProvider

Used in Spark Structured Streaming

RateSourceProviderV2

Used in Spark Structured Streaming

ReadSupport

ReadSupportWithSchema

SessionConfigSupport

StreamWriteSupport

WriteSupport

FileRelation Contract

admin阅读(1431)

FileRelation

FileRelation is the contract of relations that are backed by files.

Table 1. FileRelation Contract
Method Description

inputFiles

The list of files that will be read when scanning the relation.

Used exclusively when Dataset is requested to inputFiles

Table 2. FileRelations
FileRelation Description

HadoopFsRelation

Data Source Filter Predicate (For Filter Pushdown)

admin阅读(1471)

Data Source Filter Predicate (For Filter Pushdown)

Filter is the contract for filter predicates that can be pushed down to a relation (aka data source).

Filter is used when:

Table 1. Filter Contract
Method Description

references

Column references, i.e. list of column names that are referenced by a filter

Used when:

Table 2. Filters
Filter Description

And

EqualNullSafe

EqualTo

GreaterThan

GreaterThanOrEqual

In

IsNotNull

IsNull

LessThan

LessThanOrEqual

Not

Or

StringContains

StringEndsWith

StringStartsWith

Finding Column References in Any Value — findReferences Method

findReferences takes the references from the value filter is it is one or returns an empty array.

Note
findReferences is used when EqualTo, EqualNullSafe, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual and In filters are requested for their column references.

FileFormatWriter Helper Object

admin阅读(1959)

FileFormatWriter Helper Object

FileFormatWriter is a Scala object that allows for writing the result of a structured query.

Tip

Enable ERROR, INFO, DEBUG logging level for org.apache.spark.sql.execution.datasources.FileFormatWriter logger to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

Writing Query Result — write Method

write…​FIXME

Note

write is used when:

executeTask Internal Method

executeTask…​FIXME

Note
executeTask is used exclusively when FileFormatWriter is requested to write the result of a structured query.

processStats Internal Method

processStats…​FIXME

Note
processStats is used exclusively when FileFormatWriter is requested to write the result of a structured query.

TableScan Contract — Relations with Column Pruning

admin阅读(1283)

TableScan Contract — Relations with Column Pruning

TableScan is the contract of BaseRelations with support for column pruning, i.e. can eliminate unneeded columns before producing an RDD containing all of its tuples as Row objects.

Table 1. TableScan Contract
Property Description

buildScan

Building distributed data scan with column pruning

In other words, buildScan creates a RDD[Row] to represent a distributed data scan (i.e. scanning over data in a relation).

Used exclusively when DataSourceStrategy execution planning strategy is requested to plan a LogicalRelation with a TableScan.

Note
KafkaRelation is the one and only known implementation of the TableScan Contract in Spark SQL.

关注公众号:spark技术分享

联系我们联系我们