Configuration Properties
Configuration properties (aka settings) allow you to fine-tune a Spark SQL application.
You can set a configuration property in a SparkSession while creating a new instance using config method.
1 2 3 4 5 6 7 8 9 10 |
import org.apache.spark.sql.SparkSession val spark: SparkSession = SparkSession.builder .master("local[*]") .appName("My Spark Application") .config("spark.sql.warehouse.dir", "c:/Temp") (1) .getOrCreate |
-
Sets spark.sql.warehouse.dir for the Spark SQL session
You can also set a property using SQL SET
command.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
scala> spark.conf.getOption("spark.sql.hive.metastore.version") res1: Option[String] = None scala> spark.sql("SET spark.sql.hive.metastore.version=2.3.2").show(truncate = false) +--------------------------------+-----+ |key |value| +--------------------------------+-----+ |spark.sql.hive.metastore.version|2.3.2| +--------------------------------+-----+ scala> spark.conf.get("spark.sql.hive.metastore.version") res2: String = 2.3.2 |
Name | Default | Description | ||
---|---|---|---|---|
|
|
Enables adaptive query execution Use SQLConf.adaptiveExecutionEnabled method to access the current value. |
||
|
|
Controls whether creating multiple SQLContexts/HiveContexts is allowed |
||
|
|
Maximum size (in bytes) for a table that will be broadcast to all worker nodes when performing a join. If the size of the statistics of the logical plan of a table is at most the setting, the DataFrame is broadcast for join. Negative values or Use SQLConf.autoBroadcastJoinThreshold method to access the current value. |
||
|
|
The supported codecs are:
Use SQLConf.avroCompressionCodec method to access the current value. |
||
|
|
Timeout in seconds for the broadcast wait time in broadcast joins. When negative, it is assumed infinite (i.e. Use SQLConf.broadcastTimeout method to access the current value. |
||
|
|
(internal) Controls whether the query analyzer should be case sensitive ( Use SQLConf.caseSensitiveAnalysis method to access the current value. |
||
|
|
Enables cost-based optimization (CBO) for estimation of plan statistics when Use SQLConf.cboEnabled method to access the current value. |
||
|
|
Use SQLConf.joinReorderEnabled method to access the current value. |
||
|
|
Enables join reordering based on star schema detection for cost-based optimization (CBO) in ReorderJoin logical plan optimization. Use SQLConf.starSchemaDetection method to access the current value. |
||
|
|
Controls whether |
||
|
|
Acceptable values: Used when |
||
|
|
(internal) Whether the whole stage codegen could be temporary disabled for the part of a query that has failed to compile generated code ( Use SQLConf.wholeStageFallback method to access the current value. |
||
|
|
(internal) The maximum bytecode size of a single compiled Java function generated by whole-stage codegen. The default value Use SQLConf.hugeMethodLimit method to access the current value. |
||
|
|
(internal) Controls whether to embed the (whole-stage) codegen stage ID into the class name of the generated class as a suffix ( Use SQLConf.wholeStageUseIdInClassName method to access the current value. |
||
|
|
(internal) Maximum number of output fields (including nested fields) that whole-stage codegen supports. Going above the number deactivates whole-stage codegen. Use SQLConf.wholeStageMaxNumFields method to access the current value. |
||
|
|
(internal) Controls whether whole stage codegen puts the logic of consuming rows of each physical operator into individual methods, instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization. Use SQLConf.wholeStageSplitConsumeFuncByOperator method to access the current value. |
||
|
|
(internal) Whether the whole stage (of multiple physical operators) will be compiled into a single Java method ( Use SQLConf.wholeStageEnabled method to access the current value. |
||
|
|
(internal) Enables OffHeapColumnVector in ColumnarBatch ( Use SQLConf.offHeapColumnVectorEnabled method to access the current value. |
||
|
||||
|
Java’s |
Set to Java’s Used by the planner to decide when it is safe to broadcast a relation. By default, the system will assume that tables are too large to broadcast. Use SQLConf.defaultSizeInBytes method to access the current value. |
||
|
||||
|
|
(internal) When enabled (i.e.
Use SQLConf.exchangeReuseEnabled method to access the current value. |
||
|
|
Enables ObjectHashAggregateExec when Aggregation execution planning strategy is executed. Use SQLConf.useObjectHashAggregation method to access the current value. |
||
|
|
Controls whether to ignore corrupt files ( Use SQLConf.ignoreCorruptFiles method to access the current value. |
||
|
|
Controls whether to ignore missing files ( Use SQLConf.ignoreMissingFiles method to access the current value. |
||
|
|
(internal) When enabled (i.e. |
||
|
|
Controls whether to use the built-in Parquet reader and writer to process parquet tables created by using the HiveQL syntax (instead of Hive serde). |
||
|
|
Enables trying to merge possibly different but compatible Parquet schemas in different Parquet data files. This configuration is only effective when spark.sql.hive.convertMetastoreParquet is enabled. |
||
|
|
Enables metastore partition management for file source tables. This includes both datasource and converted Hive tables. When enabled ( Use SQLConf.manageFilesourcePartitions method to access the current value. |
||
|
(empty) |
Comma-separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with, e.g. Hive UDFs that are declared in a prefix that typically would be shared (i.e. |
||
|
|
Location of the jars that should be used to create a HiveClientImpl. Supported locations:
|
||
|
|
Comma-separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared are:
|
||
|
|
Version of the Hive metastore (and the client classes and jars). Supported versions range from |
||
|
|
Use SQLConf.columnBatchSize method to access the current value. |
||
|
|
Use SQLConf.useCompression method to access the current value. |
||
|
|
Enables vectorized reader for columnar caching. Use SQLConf.cacheVectorizedReaderEnabled method to access the current value. |
||
|
|
(internal) Enables partition pruning for in-memory columnar tables Use SQLConf.inMemoryPartitionPruning method to access the current value. |
||
|
|
(internal) Controls whether JoinSelection execution planning strategy prefers sort merge join over shuffled hash join. Use SQLConf.preferSortMergeJoin method to access the current value. |
||
|
|
(internal) Minimal increase rate in the number of partitions between attempts when executing Use SQLConf.limitScaleUpFactor method to access the current value. |
||
|
(empty) |
Comma-separated list of optimization rule names that should be disabled (excluded) in the optimizer. The optimizer will log the rules that have indeed been excluded.
Use SQLConf.optimizerExcludedRules method to access the current value. |
||
|
|
(internal) The threshold of set size for Use SQLConf.optimizerInSetConversionThreshold method to access the current value. |
||
|
|
|||
|
|
|||
|
|
Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems. Use SQLConf.isParquetBinaryAsString method to access the current value. |
||
|
|
Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems. Use SQLConf.isParquetINT96AsTimestamp method to access the current value. |
||
|
|
Enables vectorized parquet decoding. Use SQLConf.parquetVectorizedReaderEnabled method to access the current value. |
||
|
|
Controls the filter predicate push-down optimization for data sources using parquet file format Use SQLConf.parquetFilterPushDown method to access the current value. |
||
|
|
Controls whether timestamp adjustments should be applied to INT96 data when converting to timestamps, for data written by Impala. This is necessary because Impala stores INT96 data with a different timezone offset than Hive and Spark. Use SQLConf.isParquetINT96TimestampConversion method to access the current value. |
||
|
|
Use SQLConf.parquetRecordFilterEnabled method to access the current value. |
||
|
|
Controls whether quoted identifiers (using backticks) in SELECT statements should be interpreted as regular expressions. Use SQLConf.supportQuotedRegexColumnName method to access the current value. |
||
|
|
(internal) Controls whether to use radix sort ( Radix sort is much faster but requires additional memory to be reserved up-front. The memory overhead may be significant when sorting very small rows (up to 50% more). Use SQLConf.enableRadixSort method to access the current value. |
||
|
|
Use SQLConf.fileCommitProtocolClass method to access the current value. |
||
|
|
Enables dynamic partition inserts when dynamic When The default ( Use SQLConf.partitionOverwriteMode method to access the current value. |
||
|
|
Maximum number of (distinct) values that will be collected without error (when doing a pivot without specifying the values for the pivot column) Use SQLConf.dataFramePivotMaxValues method to access the current value. |
||
|
|
The values of the options matched will be redacted in the explain output. This redaction is applied on top of the global redaction configuration defined by Used exclusively when |
||
|
(undefined) |
When this regex matches a string part, that string part is replaced by a dummy value (i.e.
Use SQLConf.stringRedactionPattern method to access the current value. |
||
|
|
Controls whether to retain columns used for aggregation or not (in RelationalGroupedDataset operators). Use SQLConf.dataFrameRetainGroupColumns method to access the current value. |
||
|
|
(internal) Controls whether Spark SQL could use Use SQLConf.runSQLonFile method to access the current value. |
||
|
|
Controls whether to resolve ambiguity in join conditions for self-joins automatically. |
||
|
Java’s |
The ID of session-local timezone, e.g. “GMT”, “America/Los_Angeles”, etc. Use SQLConf.sessionLocalTimeZone method to access the current value. |
||
|
|
Number of partitions to use by default when shuffling data for joins or aggregations Corresponds to Apache Hive’s mapred.reduce.tasks property that Spark considers deprecated. Use SQLConf.numShufflePartitions method to access the current value. |
||
|
|
Enables bucketing support. When disabled (i.e. Use SQLConf.bucketingEnabled method to access the current value. |
||
|
|
Defines the default data source to use for DataFrameReader. Used when:
|
||
|
|
Enables automatic calculation of table size statistic by falling back to HDFS if the table statistics are not available from table metadata. This can be useful in determining if a table is small enough for auto broadcast joins in query planning. Use SQLConf.fallBackToHdfsForStatsEnabled method to access the current value. |
||
|
|
Enables generating histograms when computing column statistics
Use SQLConf.histogramEnabled method to access the current value. |
||
|
|
Use SQLConf.histogramNumBins method to access the current value. |
||
|
|
(internal) Enables parallel file listing in SQL commands, e.g. Use SQLConf.parallelFileListingInStatsComputation method to access the current value. |
||
|
|
Use SQLConf.autoSizeUpdateEnabled method to access the current value. |
||
|
|
(internal) Enables subexpression elimination Use subexpressionEliminationEnabled method to access the current value. |
||
|
(empty) |
A comma-separated pair of numbers, e.g. |
||
|
|
The number of SQLExecutionUIData entries to keep in When a query execution finishes, the execution is removed from the internal |
||
|
|
(internal) Threshold for number of rows guaranteed to be held in memory by WindowExec physical operator. Use windowExecBufferInMemoryThreshold method to access the current value. |
||
|
|
(internal) Threshold for number of rows buffered in a WindowExec physical operator. Use windowExecBufferSpillThreshold method to access the current value. |