SQLConf — Internal Configuration Store-spark技术分享

SQLConf — Internal Configuration Store

SQLConf is an internal key-value configuration store for parameters and hints used in Spark SQL.

Note

SQLConf is an internal part of Spark SQL and is not supposed to be used directly.

Spark SQL configuration is available through RuntimeConfig (the user-facing configuration management interface) that you can access using SparkSession.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.conf
org.apache.spark.sql.RuntimeConfig

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.conf

org.apache.spark.sql.RuntimeConfig

You can access a SQLConf using:

SQLConf.get (preferred) – the SQLConf of the current active SparkSession
SessionState – direct access through SessionState of the SparkSession of your choice (that gives more flexibility on what SparkSession is used that can be different from the current active SparkSession)



import org.apache.spark.sql.internal.SQLConf

// Use type-safe access to configuration properties
// using SQLConf.get.getConf
val parallelFileListingInStatsComputation = SQLConf.get.getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION)

// or even simpler
SQLConf.get.parallelFileListingInStatsComputation

import org.apache.spark.sql.internal.SQLConf

// Use type-safe access to configuration properties

// using SQLConf.get.getConf

val parallelFileListingInStatsComputation = SQLConf.get.getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION)

// or even simpler

SQLConf.get.parallelFileListingInStatsComputation

SQLConf offers methods to get, set, unset or clear values of configuration properties, but has also the accessor methods to read the current value of a configuration property or hint.



scala> :type spark
org.apache.spark.sql.SparkSession

// Direct access to the session SQLConf
val sqlConf = spark.sessionState.conf
scala> :type sqlConf
org.apache.spark.sql.internal.SQLConf

scala> println(sqlConf.offHeapColumnVectorEnabled)
false

// Or simply import the conf value
import spark.sessionState.conf

// accessing properties through accessor methods
scala> conf.numShufflePartitions
res1: Int = 200

// Prefer SQLConf.get (over direct access)
import org.apache.spark.sql.internal.SQLConf
val cc = SQLConf.get
scala> cc == conf
res4: Boolean = true

// setting properties using aliases
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
conf.setConf(SHUFFLE_PARTITIONS, 2)
scala> conf.numShufflePartitions
res2: Int = 2

// unset aka reset properties to the default value
conf.unsetConf(SHUFFLE_PARTITIONS)
scala> conf.numShufflePartitions
res3: Int = 200

scala> :type spark

org.apache.spark.sql.SparkSession

// Direct access to the session SQLConf

val sqlConf = spark.sessionState.conf

scala> :type sqlConf

org.apache.spark.sql.internal.SQLConf

scala> println(sqlConf.offHeapColumnVectorEnabled)

false

// Or simply import the conf value

import spark.sessionState.conf

// accessing properties through accessor methods

scala> conf.numShufflePartitions

res1: Int = 200

// Prefer SQLConf.get (over direct access)

import org.apache.spark.sql.internal.SQLConf

val cc = SQLConf.get

scala> cc == conf

res4: Boolean = true

// setting properties using aliases

import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS

conf.setConf(SHUFFLE_PARTITIONS, 2)

scala> conf.numShufflePartitions

res2: Int = 2

// unset aka reset properties to the default value

conf.unsetConf(SHUFFLE_PARTITIONS)

scala> conf.numShufflePartitions

res3: Int = 200

Name Parameter Description

adaptiveExecutionEnabled

spark.sql.adaptive.enabled

Used exclusively when EnsureRequirements adds an ExchangeCoordinator (for adaptive query execution)

autoBroadcastJoinThreshold

spark.sql.autoBroadcastJoinThreshold

Used exclusively in JoinSelection execution planning strategy

autoSizeUpdateEnabled

spark.sql.statistics.size.autoUpdate.enabled

Used when:

CommandUtils is requested for updating existing table statistics
AlterTableAddPartitionCommand is executed

avroCompressionCodec

spark.sql.avro.compression.codec

Used exclusively when AvroOptions is requested for the compression configuration property (and it was not set explicitly)

broadcastTimeout

spark.sql.broadcastTimeout

Used exclusively in BroadcastExchangeExec (for broadcasting a table to executors).

bucketingEnabled

spark.sql.sources.bucketing.enabled

Used when FileSourceScanExec is requested for the input RDD and to determine output partitioning and ordering

cacheVectorizedReaderEnabled

spark.sql.inMemoryColumnarStorage.enableVectorizedReader

Used exclusively when InMemoryTableScanExec physical operator is requested for supportsBatch flag.

caseSensitiveAnalysis

spark.sql.caseSensitive

cboEnabled

spark.sql.cbo.enabled

Used in:

ReorderJoin logical plan optimization (and indirectly in StarSchemaDetection for reorderStarJoins)
CostBasedJoinReorder logical plan optimization

columnBatchSize

spark.sql.inMemoryColumnarStorage.batchSize

Used when…FIXME

dataFramePivotMaxValues

spark.sql.pivotMaxValues

Used exclusively in pivot operator.

dataFrameRetainGroupColumns

spark.sql.retainGroupColumns

Used exclusively in RelationalGroupedDataset when creating the result Dataset (after agg, count, mean, max, avg, min, and sum operators).

defaultSizeInBytes

spark.sql.defaultSizeInBytes

Used when:

DetermineTableStats logical resolution rule could not compute the table size or spark.sql.statistics.fallBackToHdfs is turned off
ExternalRDD, LogicalRDD and DataSourceV2Relation are requested for statistics (i.e. computeStats)
(Spark Structured Streaming) StreamingRelation, StreamingExecutionRelation, StreamingRelationV2 and ContinuousExecutionRelation are requested for statistics (i.e. computeStats)
DataSource creates a HadoopFsRelation for FileFormat data source (and builds a CatalogFileIndex when no table statistics are available)
BaseRelation is requested for an estimated size of this relation (in bytes)

enableRadixSort

spark.sql.sort.enableRadixSort

Used exclusively when SortExec physical operator is requested for a UnsafeExternalRowSorter.

exchangeReuseEnabled

spark.sql.exchange.reuse

Used when ReuseSubquery and ReuseExchange physical optimizations are executed

Note	When disabled (i.e. `false`), `ReuseSubquery` and `ReuseExchange` physical optimizations do no optimizations.

fallBackToHdfsForStatsEnabled

spark.sql.statistics.fallBackToHdfs

Used exclusively when DetermineTableStats logical resolution rule is executed.

fileCommitProtocolClass

spark.sql.sources.commitProtocolClass

Used (to instantiate a FileCommitProtocol) when:

SaveAsHiveFile is requested to saveAsHiveFile
InsertIntoHadoopFsRelationCommand logical command is executed

histogramEnabled

spark.sql.statistics.histogram.enabled

Used exclusively when AnalyzeColumnCommand logical command is executed.

histogramNumBins

spark.sql.statistics.histogram.numBins

Used exclusively when AnalyzeColumnCommand is executed with spark.sql.statistics.histogram.enabled turned on (and calculates percentiles).

hugeMethodLimit

spark.sql.codegen.hugeMethodLimit

Used exclusively when WholeStageCodegenExec unary physical operator is requested to execute (and generate a RDD[InternalRow]), i.e. when the compiled function exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the query plan.

ignoreCorruptFiles

spark.sql.files.ignoreCorruptFiles

Used when:

FileScanRDD is created (and then to compute a partition)
OrcFileFormat is requested to inferSchema and buildReader
ParquetFileFormat is requested to mergeSchemasInParallel

ignoreMissingFiles

spark.sql.files.ignoreMissingFiles

Used exclusively when FileScanRDD is created (and then to compute a partition)

inMemoryPartitionPruning

spark.sql.inMemoryColumnarStorage.partitionPruning

Used exclusively when InMemoryTableScanExec physical operator is requested for filtered cached column batches (as a RDD[CachedBatch]).

isParquetBinaryAsString

spark.sql.parquet.binaryAsString

isParquetINT96AsTimestamp

spark.sql.parquet.int96AsTimestamp

isParquetINT96TimestampConversion

spark.sql.parquet.int96TimestampConversion

Used exclusively when ParquetFileFormat is requested to build a data reader with partition column values appended.

joinReorderEnabled

spark.sql.cbo.joinReorder.enabled

Used exclusively in CostBasedJoinReorder logical plan optimization

limitScaleUpFactor

spark.sql.limit.scaleUpFactor

Used exclusively when a physical operator is requested the first n rows as an array.

manageFilesourcePartitions

spark.sql.hive.manageFilesourcePartitions

Used in:

HiveMetastoreCatalog is requested to convert a HiveTableRelation to a LogicalRelation
CreateDataSourceTableCommand, CreateDataSourceTableAsSelectCommand and InsertIntoHadoopFsRelationCommand logical commands are executed
DDLUtils is requested to verifyPartitionProviderIsHive
DataSource is requested to resolve a relation (for file-based data source tables and creates a HadoopFsRelation)
FileStatusCache is requested to getOrCreate

numShufflePartitions

spark.sql.shuffle.partitions

Used in:

Dataset’s repartition operator (for a RepartitionByExpression logical operator)
SparkSqlAstBuilder (for a RepartitionByExpression logical operator)
JoinSelection execution planning strategy
SetCommand logical command
EnsureRequirements physical plan optimization

offHeapColumnVectorEnabled

spark.sql.columnVector.offheap.enabled

Used when:

InMemoryTableScanExec is requested for vectorTypes and createAndDecompressColumn
OrcFileFormat is requested to build a data reader with partition column values appended
ParquetFileFormat is requested for vectorTypes and build a data reader with partition column values appended

optimizerExcludedRules

spark.sql.optimizer.excludedRules

Used exclusively when Optimizer is requested for the optimization batches

optimizerInSetConversionThreshold

spark.sql.optimizer.inSetConversionThreshold

Used exclusively when OptimizeIn logical query optimization is applied to a logical plan (and replaces an In predicate expression with an InSet)

parallelFileListingInStatsComputation

spark.sql.statistics.parallelFileListingInStatsComputation.enabled

Used exclusively when CommandUtils helper object is requested to calculate the total size of a table (with partitions) (for AnalyzeColumnCommand and AnalyzeTableCommand commands)

parquetFilterPushDown

spark.sql.parquet.filterPushdown

Used exclusively when ParquetFileFormat is requested to build a data reader with partition column values appended.

parquetRecordFilterEnabled

spark.sql.parquet.recordLevelFilter.enabled

Used exclusively when ParquetFileFormat is requested to build a data reader with partition column values appended.

parquetVectorizedReaderEnabled

spark.sql.parquet.enableVectorizedReader

Used when:

FileSourceScanExec is requested for needsUnsafeRowConversion flag
ParquetFileFormat is requested for supportBatch flag and build a data reader with partition column values appended

partitionOverwriteMode

spark.sql.sources.partitionOverwriteMode

Used exclusively when InsertIntoHadoopFsRelationCommand logical command is executed

preferSortMergeJoin

spark.sql.join.preferSortMergeJoin

Used exclusively in JoinSelection execution planning strategy to prefer sort merge join over shuffle hash join.

runSQLonFile

spark.sql.runSQLOnFiles

Used when:

ResolveRelations does isRunningDirectlyOnFiles
ResolveSQLOnFile does maybeSQLFile

sessionLocalTimeZone

spark.sql.session.timeZone

starSchemaDetection

spark.sql.cbo.starSchemaDetection

Used exclusively in ReorderJoin logical plan optimization (and indirectly in StarSchemaDetection)

stringRedactionPattern

spark.sql.redaction.string.regex

Used when:

DataSourceScanExec is requested to redact sensitive information (in text representations)
QueryExecution is requested to redact sensitive information (in text representations)

subexpressionEliminationEnabled

spark.sql.subexpressionElimination.enabled

Used exclusively when SparkPlan is requested for subexpressionEliminationEnabled flag.

supportQuotedRegexColumnName

spark.sql.parser.quotedRegexColumnNames

Used when:

Dataset.col operator is used
AstBuilder is requested to parse a dereference and column reference in a SQL statement

useCompression

spark.sql.inMemoryColumnarStorage.compressed

Used when…FIXME

wholeStageEnabled

spark.sql.codegen.wholeStage

Used in:

CollapseCodegenStages to control codegen
ParquetFileFormat to control row batch reading

wholeStageFallback

spark.sql.codegen.fallback

Used exclusively when WholeStageCodegenExec is executed.

wholeStageMaxNumFields

spark.sql.codegen.maxFields

Used in:

CollapseCodegenStages to control codegen
ParquetFileFormat to control row batch reading

wholeStageSplitConsumeFuncByOperator

spark.sql.codegen.splitConsumeFuncByOperator

Used exclusively when CodegenSupport is requested to consume

wholeStageUseIdInClassName

spark.sql.codegen.useIdInClassName

Used exclusively when WholeStageCodegenExec is requested to generate the Java source code for the child physical plan subtree (when created)

windowExecBufferInMemoryThreshold

spark.sql.windowExec.buffer.in.memory.threshold

Used exclusively when WindowExec unary physical operator is executed.

windowExecBufferSpillThreshold

spark.sql.windowExec.buffer.spill.threshold

Used exclusively when WindowExec unary physical operator is executed.

useObjectHashAggregation

spark.sql.execution.useObjectHashAggregateExec

Used exclusively when Aggregation execution planning strategy is executed (and uses AggUtils to create an aggregation physical operator).

Getting Parameters and Hints

You can get the current parameters and hints using the following family of get methods.



getConf[T](entry: ConfigEntry[T], defaultValue: T): T
getConf[T](entry: ConfigEntry[T]): T
getConf[T](entry: OptionalConfigEntry[T]): Option[T]
getConfString(key: String): String
getConfString(key: String, defaultValue: String): String
getAllConfs: immutable.Map[String, String]
getAllDefinedConfs: Seq[(String, String, String)]

getConf[T](entry: ConfigEntry[T], defaultValue: T): T

getConf[T](entry: ConfigEntry[T]): T

getConf[T](entry: OptionalConfigEntry[T]): Option[T]

getConfString(key: String): String

getConfString(key: String, defaultValue: String): String

getAllConfs: immutable.Map[String, String]

getAllDefinedConfs: Seq[(String, String, String)]

Setting Parameters and Hints

You can set parameters and hints using the following family of set methods.



setConf(props: Properties): Unit
setConfString(key: String, value: String): Unit
setConf[T](entry: ConfigEntry[T], value: T): Unit

setConf(props: Properties): Unit

setConfString(key: String, value: String): Unit

setConf[T](entry: ConfigEntry[T], value: T): Unit

Unsetting Parameters and Hints

You can unset parameters and hints using the following family of unset methods.



unsetConf(key: String): Unit
unsetConf(entry: ConfigEntry[_]): Unit

unsetConf(key: String): Unit

unsetConf(entry: ConfigEntry[_]): Unit

Clearing All Parameters and Hints



clear(): Unit

clear(): Unit

You can use clear to remove all the parameters and hints in SQLConf.

Redacting Data Source Options with Sensitive Information — `redactOptions` Method



redactOptions(options: Map[String, String]): Map[String, String]

redactOptions(options: Map[String, String]): Map[String, String]

redactOptions takes the values of the spark.sql.redaction.options.regex and spark.redaction.regex configuration properties.

For every regular expression (in the order), redactOptions redacts sensitive information, i.e. finds the first match of a regular expression pattern in every option key or value and if either matches replaces the value with ***(redacted).

Note	`redactOptions` is used exclusively when `SaveIntoDataSourceCommand` logical command is requested for the simple description.

SQLConf — Internal Configuration Store