spark-sql-spark技术分享-第50页

Bucketing

2012-03-29admin阅读(1318)

Bucketing

Bucketing is an optimization technique that uses buckets (and bucketing columns) to determine data partitioning and avoid data shuffle.

The motivation is to optimize performance of a join query by avoiding shuffles (aka exchanges) of tables participating in the join. Bucketing results in fewer exchanges (and so stages).

Note	Bucketing can show the biggest benefit when pre-shuffled bucketed tables are used more than once as bucketing itself takes time (that you will offset executing multiple join queries later).

Bucketing is enabled by default. Spark SQL uses spark.sql.sources.bucketing.enabled configuration property to control whether bucketing should be enabled and used for query optimization or not.

Bucketing is used exclusively in FileSourceScanExec physical operator (when it is requested for the input RDD and to determine the partitioning and ordering of the output).


Example: SortMergeJoin of two FileScans

import org.apache.spark.sql.SaveMode
spark.range(10e4.toLong).write.mode(SaveMode.Overwrite).saveAsTable("t10e4")
spark.range(10e6.toLong).write.mode(SaveMode.Overwrite).saveAsTable("t10e6")

// Bucketing is enabled by default
// Let's check it out anyway
assert(spark.sessionState.conf.bucketingEnabled, "Bucketing disabled?!")

// Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange
// For that, let's disable auto broadcasting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val tables = spark.catalog.listTables.where($"name" startsWith "t10e")
scala> tables.show
+-----+--------+-----------+---------+-----------+
| name|database|description|tableType|isTemporary|
+-----+--------+-----------+---------+-----------+
|t10e4| default|       null|  MANAGED|      false|
|t10e6| default|       null|  MANAGED|      false|
+-----+--------+-----------+---------+-----------+

val t4 = spark.table("t10e4")
val t6 = spark.table("t10e6")

assert(t4.count == 10e4)
assert(t6.count == 10e6)

// trigger execution of the join query
t4.join(t6, "id").foreach(_ => ())

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

Example: SortMergeJoin of two FileScans

import org.apache.spark.sql.SaveMode

spark.range(10e4.toLong).write.mode(SaveMode.Overwrite).saveAsTable("t10e4")

spark.range(10e6.toLong).write.mode(SaveMode.Overwrite).saveAsTable("t10e6")

// Bucketing is enabled by default

// Let's check it out anyway

assert(spark.sessionState.conf.bucketingEnabled, "Bucketing disabled?!")

// Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange

// For that, let's disable auto broadcasting

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val tables = spark.catalog.listTables.where($"name" startsWith "t10e")

scala> tables.show

+-----+--------+-----------+---------+-----------+

+-----+--------+-----------+---------+-----------+

+-----+--------+-----------+---------+-----------+

val t4 = spark.table("t10e4")

val t6 = spark.table("t10e6")

assert(t4.count == 10e4)

assert(t6.count == 10e6)

// trigger execution of the join query

t4.join(t6, "id").foreach(_ => ())

The above join query is a fine example of a SortMergeJoinExec (aka SortMergeJoin) of two FileSourceScanExecs (aka Scan). The join query uses ShuffleExchangeExec physical operators (aka Exchange) to shuffle the table datasets for the SortMergeJoin.

spark sql bucketing sortmergejoin filescans.png

Figure 1. SortMergeJoin of FileScans (Details for Query)

One way to avoid the exchanges (and so optimize the join query) is to use table bucketing that is applicable for all file-based data sources, e.g. Parquet, ORC, JSON, CSV, that are saved as a table using DataFrameWrite.saveAsTable or simply available in a catalog by SparkSession.table.

Note	Bucketing is not supported for DataFrameWriter.save, DataFrameWriter.insertInto and DataFrameWriter.jdbc methods.

You use DataFrameWriter.bucketBy method to specify the number of buckets and the bucketing columns.

You can optionally sort the output rows in buckets using DataFrameWriter.sortBy method.



people.write
  .bucketBy(42, "name")
  .sortBy("age")
  .saveAsTable("people_bucketed")

1

2

3

4

5

6

7

8

people.write

.bucketBy(42, "name")

.sortBy("age")

.saveAsTable("people_bucketed")

Note	DataFrameWriter.bucketBy and DataFrameWriter.sortBy simply set respective internal properties that eventually become a bucketing specification.

Unlike bucketing in Apache Hive, Spark SQL creates the bucket files per the number of buckets and partitions. In other words, the number of bucketing files is the number of buckets multiplied by the number of task writers (one per partition).



val large = spark.range(10e6.toLong)
import org.apache.spark.sql.SaveMode
large.write
  .bucketBy(4, "id")
  .sortBy("id")
  .mode(SaveMode.Overwrite)
  .saveAsTable("bucketed_4_id")

scala> println(large.queryExecution.toRdd.getNumPartitions)
8

// That gives 8 (partitions/task writers) x 4 (buckets) = 32 files
// With _SUCCESS extra file and the ls -l header "total 794624" that gives 34 files
$ ls -tlr spark-warehouse/bucketed_4_id | wc -l
      34

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

val large = spark.range(10e6.toLong)

import org.apache.spark.sql.SaveMode

large.write

.bucketBy(4, "id")

.sortBy("id")

.mode(SaveMode.Overwrite)

.saveAsTable("bucketed_4_id")

scala> println(large.queryExecution.toRdd.getNumPartitions)

8

// That gives 8 (partitions/task writers) x 4 (buckets) = 32 files

// With _SUCCESS extra file and the ls -l header "total 794624" that gives 34 files

$ ls -tlr spark-warehouse/bucketed_4_id | wc -l

34

With bucketing, the Exchanges are no longer needed (as the tables are already pre-shuffled).



// Create bucketed tables
import org.apache.spark.sql.SaveMode
spark.range(10e4.toLong)
  .write
  .bucketBy(4, "id")
  .sortBy("id")
  .mode(SaveMode.Overwrite)
  .saveAsTable("bucketed_4_10e4")
spark.range(10e6.toLong)
  .write
  .bucketBy(4, "id")
  .sortBy("id")
  .mode(SaveMode.Overwrite)
  .saveAsTable("bucketed_4_10e6")

val bucketed_4_10e4 = spark.table("bucketed_4_10e4")
val bucketed_4_10e6 = spark.table("bucketed_4_10e6")

// trigger execution of the join query
bucketed_4_10e4.join(bucketed_4_10e6, "id").foreach(_ => ())

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

// Create bucketed tables

import org.apache.spark.sql.SaveMode

spark.range(10e4.toLong)

.write

.bucketBy(4, "id")

.sortBy("id")

.mode(SaveMode.Overwrite)

.saveAsTable("bucketed_4_10e4")

spark.range(10e6.toLong)

.write

.bucketBy(4, "id")

.sortBy("id")

.mode(SaveMode.Overwrite)

.saveAsTable("bucketed_4_10e6")

val bucketed_4_10e4 = spark.table("bucketed_4_10e4")

val bucketed_4_10e6 = spark.table("bucketed_4_10e6")

// trigger execution of the join query

bucketed_4_10e4.join(bucketed_4_10e6, "id").foreach(_ => ())

The above join query of the bucketed tables shows no ShuffleExchangeExec physical operators (aka Exchange) as the shuffling has already been executed (before the query was run).

spark sql bucketing sortmergejoin bucketed tables no exchanges.png

Figure 2. SortMergeJoin of Bucketed Tables (Details for Query)

The number of partitions of a bucketed table is exactly the number of buckets.



val bucketed_4_10e4 = spark.table("bucketed_4_10e4")
val numPartitions = bucketed_4_10e4.queryExecution.toRdd.getNumPartitions
assert(numPartitions == 4)

1

2

3

4

5

6

7

val bucketed_4_10e4 = spark.table("bucketed_4_10e4")

val numPartitions = bucketed_4_10e4.queryExecution.toRdd.getNumPartitions

assert(numPartitions == 4)

Use SessionCatalog or DESCRIBE EXTENDED SQL command to find the bucketing information.



val bucketed_tables = spark.catalog.listTables.where($"name" startsWith "bucketed_")
scala> bucketed_tables.show
+---------------+--------+-----------+---------+-----------+
|           name|database|description|tableType|isTemporary|
+---------------+--------+-----------+---------+-----------+
|bucketed_4_10e4| default|       null|  MANAGED|      false|
|bucketed_4_10e6| default|       null|  MANAGED|      false|
+---------------+--------+-----------+---------+-----------+

val demoTable = "bucketed_4_10e4"

// DESC EXTENDED or DESC FORMATTED would also work
val describeSQL = sql(s"DESCRIBE EXTENDED $demoTable")
scala> describeSQL.show(numRows = 21, truncate = false)
+----------------------------+---------------------------------------------------------------+-------+
|col_name                    |data_type                                                      |comment|
+----------------------------+---------------------------------------------------------------+-------+
|id                          |bigint                                                         |null   |
|                            |                                                               |       |
|# Detailed Table Information|                                                               |       |
|Database                    |default                                                        |       |
|Table                       |bucketed_4_10e4                                                |       |
|Owner                       |jacek                                                          |       |
|Created Time                |Tue Oct 02 10:50:50 CEST 2018                                  |       |
|Last Access                 |Thu Jan 01 01:00:00 CET 1970                                   |       |
|Created By                  |Spark 2.3.2                                                    |       |
|Type                        |MANAGED                                                        |       |
|Provider                    |parquet                                                        |       |
|Num Buckets                 |4                                                              |       |
|Bucket Columns              |[`id`]                                                         |       |
|Sort Columns                |[`id`]                                                         |       |
|Table Properties            |[transient_lastDdlTime=1538470250]                             |       |
|Statistics                  |413954 bytes                                                   |       |
|Location                    |file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_10e4|       |
|Serde Library               |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe             |       |
|InputFormat                 |org.apache.hadoop.mapred.SequenceFileInputFormat               |       |
|OutputFormat                |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat      |       |
|Storage Properties          |[serialization.format=1]                                       |       |
+----------------------------+---------------------------------------------------------------+-------+

import org.apache.spark.sql.catalyst.TableIdentifier
val metadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(demoTable))
scala> metadata.bucketSpec.foreach(println)
4 buckets, bucket columns: [id], sort columns: [id]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

val bucketed_tables = spark.catalog.listTables.where($"name" startsWith "bucketed_")

scala> bucketed_tables.show

+---------------+--------+-----------+---------+-----------+

+---------------+--------+-----------+---------+-----------+

+---------------+--------+-----------+---------+-----------+

val demoTable = "bucketed_4_10e4"

// DESC EXTENDED or DESC FORMATTED would also work

val describeSQL = sql(s"DESCRIBE EXTENDED $demoTable")

scala> describeSQL.show(numRows = 21, truncate = false)

+----------------------------+---------------------------------------------------------------+-------+

|col_name |data_type |comment|

+----------------------------+---------------------------------------------------------------+-------+

|id |bigint |null |

| | | |

|# Detailed Table Information| | |

|Database |default | |

|Table |bucketed_4_10e4 | |

|Owner |jacek | |

|Created Time |Tue Oct 02 10:50:50 CEST 2018 | |

|Last Access |Thu Jan 01 01:00:00 CET 1970 | |

|Created By |Spark 2.3.2 | |

|Type |MANAGED | |

|Provider |parquet | |

|Num Buckets |4 | |

|Bucket Columns |[`id`] | |

|Sort Columns |[`id`] | |

|Table Properties |[transient_lastDdlTime=1538470250] | |

|Statistics |413954 bytes | |

|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_10e4| |

|Serde Library |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | |

|InputFormat |org.apache.hadoop.mapred.SequenceFileInputFormat | |

|OutputFormat |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat | |

|Storage Properties |[serialization.format=1] | |

+----------------------------+---------------------------------------------------------------+-------+

import org.apache.spark.sql.catalyst.TableIdentifier

val metadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(demoTable))

scala> metadata.bucketSpec.foreach(println)

4 buckets, bucket columns: [id], sort columns: [id]

The number of buckets has to be between 0 and 100000 exclusive or Spark SQL throws an AnalysisException:



Number of buckets should be greater than 0 but less than 100000. Got `[numBuckets]`

1

2

3

4

5

Number of buckets should be greater than 0 but less than 100000. Got `[numBuckets]`

There are however requirements that have to be met before Spark Optimizer gives a no-Exchange query plan:

The number of partitions on both sides of a join has to be exactly the same.
Both join operators have to use HashPartitioning partitioning scheme.

It is acceptable to use bucketing for one side of a join.



// Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange
// For this, let's disable auto broadcasting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val bucketedTableName = "bucketed_4_id"
val large = spark.range(10e5.toLong)
import org.apache.spark.sql.SaveMode
large.write
  .bucketBy(4, "id")
  .sortBy("id")
  .mode(SaveMode.Overwrite)
  .saveAsTable(bucketedTableName)
val bucketedTable = spark.table(bucketedTableName)

val t1 = spark
  .range(4)
  .repartition(4, $"id")  // Make sure that the number of partitions matches the other side

val q = t1.join(bucketedTable, "id")
scala> q.explain
== Physical Plan ==
*(4) Project [id#169L]
+- *(4) SortMergeJoin [id#169L], [id#167L], Inner
   :- *(2) Sort [id#169L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#169L, 4)
   :     +- *(1) Range (0, 4, step=1, splits=8)
   +- *(3) Sort [id#167L ASC NULLS FIRST], false, 0
      +- *(3) Project [id#167L]
         +- *(3) Filter isnotnull(id#167L)
            +- *(3) FileScan parquet default.bucketed_4_id[id#167L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>

q.foreach(_ => ())

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

// Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange

// For this, let's disable auto broadcasting

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val bucketedTableName = "bucketed_4_id"

val large = spark.range(10e5.toLong)

import org.apache.spark.sql.SaveMode

large.write

.bucketBy(4, "id")

.sortBy("id")

.mode(SaveMode.Overwrite)

.saveAsTable(bucketedTableName)

val bucketedTable = spark.table(bucketedTableName)

val t1 = spark

.range(4)

.repartition(4, $"id") // Make sure that the number of partitions matches the other side

val q = t1.join(bucketedTable, "id")

scala> q.explain

== Physical Plan ==

*(4) Project [id#169L]

+- *(4) SortMergeJoin [id#169L], [id#167L], Inner

:- *(2) Sort [id#169L ASC NULLS FIRST], false, 0

: +- Exchange hashpartitioning(id#169L, 4)

: +- *(1) Range (0, 4, step=1, splits=8)

+- *(3) Sort [id#167L ASC NULLS FIRST], false, 0

+- *(3) Project [id#167L]

+- *(3) Filter isnotnull(id#167L)

+- *(3) FileScan parquet default.bucketed_4_id[id#167L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>

q.foreach(_ => ())

spark sql bucketing sortmergejoin one bucketed table.png

Figure 3. SortMergeJoin of One Bucketed Table (Details for Query)

Bucket Pruning — Optimizing Filtering on Bucketed Column (Reducing Bucket Files to Scan)

As of Spark 2.4, Spark SQL supports bucket pruning to optimize filtering on bucketed column (by reducing the number of bucket files to scan).

Bucket pruning supports the following predicate expressions:

EqualTo (=)
EqualNullSafe (<=>)
In
InSet
And and Or of the above

FileSourceStrategy execution planning strategy is responsible for selecting only LogicalRelations over HadoopFsRelation with the bucketing specification with the following:

There is exactly one bucketing column
The number of buckets is greater than 1


Example: Bucket Pruning

// Enable INFO logging level of FileSourceStrategy logger to see the details of the strategy
import org.apache.spark.sql.execution.datasources.FileSourceStrategy
val logger = FileSourceStrategy.getClass.getName.replace("$", "")
import org.apache.log4j.{Level, Logger}
Logger.getLogger(logger).setLevel(Level.INFO)

val q57 = q.where($"id" isin (50, 70))
scala> val sparkPlan57 = q57.queryExecution.executedPlan
18/11/17 23:18:04 INFO FileSourceStrategy: Pruning directories with:
18/11/17 23:18:04 INFO FileSourceStrategy: Pruned 2 out of 4 buckets.
18/11/17 23:18:04 INFO FileSourceStrategy: Post-Scan Filters: id#0L IN (50,70)
18/11/17 23:18:04 INFO FileSourceStrategy: Output Data Schema: struct<id: bigint>
18/11/17 23:18:04 INFO FileSourceScanExec: Pushed Filters: In(id, [50,70])
...

scala> println(sparkPlan57.numberedTreeString)
00 *(1) Filter id#0L IN (50,70)
01 +- *(1) FileScan parquet default.bucketed_4_id[id#0L,part#1L] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionCount: 2, PartitionFilters: [], PushedFilters: [In(id, [50,70])], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 2 out of 4

import org.apache.spark.sql.execution.FileSourceScanExec
val scan57 = sparkPlan57.collectFirst { case exec: FileSourceScanExec => exec }.get

import org.apache.spark.sql.execution.datasources.FileScanRDD
val rdd57 = scan57.inputRDDs.head.asInstanceOf[FileScanRDD]

import org.apache.spark.sql.execution.datasources.FilePartition
val bucketFiles57 = for {
  FilePartition(bucketId, files) <- rdd57.filePartitions
  f <- files
} yield s"Bucket $bucketId => $f"

scala> println(bucketFiles57.size)
24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

Example: Bucket Pruning

// Enable INFO logging level of FileSourceStrategy logger to see the details of the strategy

import org.apache.spark.sql.execution.datasources.FileSourceStrategy

val logger = FileSourceStrategy.getClass.getName.replace("$", "")

import org.apache.log4j.{Level, Logger}

Logger.getLogger(logger).setLevel(Level.INFO)

val q57 = q.where($"id" isin (50, 70))

scala> val sparkPlan57 = q57.queryExecution.executedPlan

18/11/17 23:18:04 INFO FileSourceStrategy: Pruning directories with:

18/11/17 23:18:04 INFO FileSourceStrategy: Pruned 2 out of 4 buckets.

18/11/17 23:18:04 INFO FileSourceStrategy: Post-Scan Filters: id#0L IN (50,70)

18/11/17 23:18:04 INFO FileSourceStrategy: Output Data Schema: struct<id: bigint>

18/11/17 23:18:04 INFO FileSourceScanExec: Pushed Filters: In(id, [50,70])

...

scala> println(sparkPlan57.numberedTreeString)

00 *(1) Filter id#0L IN (50,70)

01 +- *(1) FileScan parquet default.bucketed_4_id[id#0L,part#1L] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionCount: 2, PartitionFilters: [], PushedFilters: [In(id, [50,70])], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 2 out of 4

import org.apache.spark.sql.execution.FileSourceScanExec

val scan57 = sparkPlan57.collectFirst { case exec: FileSourceScanExec => exec }.get

import org.apache.spark.sql.execution.datasources.FileScanRDD

val rdd57 = scan57.inputRDDs.head.asInstanceOf[FileScanRDD]

import org.apache.spark.sql.execution.datasources.FilePartition

val bucketFiles57 = for {

FilePartition(bucketId, files) <- rdd57.filePartitions

f <- files

} yield s"Bucket $bucketId => $f"

scala> println(bucketFiles57.size)

24

Sorting



// Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange
// Disable auto broadcasting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val bucketedTableName = "bucketed_4_id"
val large = spark.range(10e5.toLong)
import org.apache.spark.sql.SaveMode
large.write
  .bucketBy(4, "id")
  .sortBy("id")
  .mode(SaveMode.Overwrite)
  .saveAsTable(bucketedTableName)

// Describe the table and include bucketing spec only
val descSQL = sql(s"DESC FORMATTED $bucketedTableName")
  .filter($"col_name".contains("Bucket") || $"col_name" === "Sort Columns")
scala> descSQL.show
+--------------+---------+-------+
|      col_name|data_type|comment|
+--------------+---------+-------+
|   Num Buckets|        4|       |
|Bucket Columns|   [`id`]|       |
|  Sort Columns|   [`id`]|       |
+--------------+---------+-------+

val bucketedTable = spark.table(bucketedTableName)

val t1 = spark.range(4)
  .repartition(2, $"id")  // Use just 2 partitions
  .sortWithinPartitions("id") // sort partitions

val q = t1.join(bucketedTable, "id")
// Note two exchanges and sorts
scala> q.explain
== Physical Plan ==
*(5) Project [id#205L]
+- *(5) SortMergeJoin [id#205L], [id#203L], Inner
   :- *(3) Sort [id#205L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#205L, 4)
   :     +- *(2) Sort [id#205L ASC NULLS FIRST], false, 0
   :        +- Exchange hashpartitioning(id#205L, 2)
   :           +- *(1) Range (0, 4, step=1, splits=8)
   +- *(4) Sort [id#203L ASC NULLS FIRST], false, 0
      +- *(4) Project [id#203L]
         +- *(4) Filter isnotnull(id#203L)
            +- *(4) FileScan parquet default.bucketed_4_id[id#203L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>

q.foreach(_ => ())

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

// Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange

// Disable auto broadcasting

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val bucketedTableName = "bucketed_4_id"

val large = spark.range(10e5.toLong)

import org.apache.spark.sql.SaveMode

large.write

.bucketBy(4, "id")

.sortBy("id")

.mode(SaveMode.Overwrite)

.saveAsTable(bucketedTableName)

// Describe the table and include bucketing spec only

val descSQL = sql(s"DESC FORMATTED $bucketedTableName")

.filter($"col_name".contains("Bucket") || $"col_name" === "Sort Columns")

scala> descSQL.show

+--------------+---------+-------+

| col_name|data_type|comment|

+--------------+---------+-------+

| Num Buckets| 4| |

|Bucket Columns| [`id`]| |

| Sort Columns| [`id`]| |

+--------------+---------+-------+

val bucketedTable = spark.table(bucketedTableName)

val t1 = spark.range(4)

.repartition(2, $"id") // Use just 2 partitions

.sortWithinPartitions("id") // sort partitions

val q = t1.join(bucketedTable, "id")

// Note two exchanges and sorts

scala> q.explain

== Physical Plan ==

*(5) Project [id#205L]

+- *(5) SortMergeJoin [id#205L], [id#203L], Inner

:- *(3) Sort [id#205L ASC NULLS FIRST], false, 0

: +- Exchange hashpartitioning(id#205L, 4)

: +- *(2) Sort [id#205L ASC NULLS FIRST], false, 0

: +- Exchange hashpartitioning(id#205L, 2)

: +- *(1) Range (0, 4, step=1, splits=8)

+- *(4) Sort [id#203L ASC NULLS FIRST], false, 0

+- *(4) Project [id#203L]

+- *(4) Filter isnotnull(id#203L)

+- *(4) FileScan parquet default.bucketed_4_id[id#203L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>

q.foreach(_ => ())

Warning

There are two exchanges and sorts which makes the above use case almost unusable. I filed an issue at SPARK-24025 Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side.

spark sql bucketing sortmergejoin sorted dataset and bucketed table.png

Figure 4. SortMergeJoin of Sorted Dataset and Bucketed Table (Details for Query)

spark.sql.sources.bucketing.enabled Spark SQL Configuration Property

Bucketing is enabled when spark.sql.sources.bucketing.enabled configuration property is turned on (true) and it is by default.

Tip	Use SQLConf.bucketingEnabled to access the current value of `spark.sql.sources.bucketing.enabled` property.



// Bucketing is on by default
assert(spark.sessionState.conf.bucketingEnabled, "Bucketing disabled?!")

1

2

3

4

5

6

// Bucketing is on by default

assert(spark.sessionState.conf.bucketingEnabled, "Bucketing disabled?!")

Dynamic Partition Inserts

2012-03-28admin阅读(1804)

Dynamic Partition Inserts

Partitioning uses partitioning columns to divide a dataset into smaller chunks (based on the values of certain columns) that will be written into separate directories.

With a partitioned dataset, Spark SQL can load only the parts (partitions) that are really needed (and avoid doing filtering out unnecessary data on JVM). That leads to faster load time and more efficient memory consumption which gives a better performance overall.

With a partitioned dataset, Spark SQL can also be executed over different subsets (directories) in parallel at the same time.


Partitioned table (with single partition p1)

spark.range(10)
  .withColumn("p1", 'id % 2)
  .write
  .mode("overwrite")
  .partitionBy("p1")
  .saveAsTable("partitioned_table")

1

2

3

4

5

6

7

8

9

10

11

Partitioned table (with single partition p1)

spark.range(10)

.withColumn("p1", 'id % 2)

.write

.mode("overwrite")

.partitionBy("p1")

.saveAsTable("partitioned_table")

Dynamic Partition Inserts is a feature of Spark SQL that allows for executing INSERT OVERWRITE TABLE SQL statements over partitioned HadoopFsRelations that limits what partitions are deleted to overwrite the partitioned table (and its partitions) with new data.

Dynamic partitions are the partition columns that have no values defined explicitly in the PARTITION clause of INSERT OVERWRITE TABLE SQL statements (in the partitionSpec part).

Static partitions are the partition columns that have values defined explicitly in the PARTITION clause of INSERT OVERWRITE TABLE SQL statements (in the partitionSpec part).



// Borrowed from https://medium.com/@anuvrat/writing-into-dynamic-partitions-using-spark-2e2b818a007a
// Note day dynamic partition
INSERT OVERWRITE TABLE stats
PARTITION(country = 'US', year = 2017, month = 3, day)
SELECT ad, SUM(impressions), SUM(clicks), log_day
FROM impression_logs
GROUP BY ad;

1

2

3

4

5

6

7

8

9

10

11

// Borrowed from https://medium.com/@anuvrat/writing-into-dynamic-partitions-using-spark-2e2b818a007a

// Note day dynamic partition

INSERT OVERWRITE TABLE stats

PARTITION(country = 'US', year = 2017, month = 3, day)

SELECT ad, SUM(impressions), SUM(clicks), log_day

FROM impression_logs

GROUP BY ad;

Note	`INSERT OVERWRITE TABLE` SQL statement is translated into InsertIntoTable logical operator.

Dynamic Partition Inserts is only supported in SQL mode (for INSERT OVERWRITE TABLE SQL statements).

Dynamic Partition Inserts is not supported for non-file-based data sources, i.e. InsertableRelations.

With Dynamic Partition Inserts, the behaviour of OVERWRITE keyword is controlled by spark.sql.sources.partitionOverwriteMode configuration property (default: static). The property controls whether Spark should delete all the partitions that match the partition specification regardless of whether there is data to be written to or not (static) or delete only those partitions that will have data written into (dynamic).

When the dynamic overwrite mode is enabled Spark will only delete the partitions for which it has data to be written to. All the other partitions remain intact.

From the Writing Into Dynamic Partitions Using Spark:

Spark now writes data partitioned just as Hive would — which means only the partitions that are touched by the INSERT query get overwritten and the others are not touched.

UDFRegistration — Session-Scoped FunctionRegistry

2012-03-27admin阅读(1829)

UDFRegistration — Session-Scoped FunctionRegistry

UDFRegistration is an interface to the session-scoped FunctionRegistry to register user-defined functions (UDFs) and user-defined aggregate functions (UDAFs).

UDFRegistration is available using SparkSession.



import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
spark.udf

1

2

3

4

5

6

7

import org.apache.spark.sql.SparkSession

val spark: SparkSession = ...

spark.udf

UDFRegistration takes a FunctionRegistry when created.

UDFRegistration is created exclusively for SessionState.

Registering UserDefinedFunction (with FunctionRegistry) — `register` Method



register(name: String, func: Function0[RT]): UserDefinedFunction
register(name: String, func: Function1[A1, RT]): UserDefinedFunction
...
register(name: String, func: Function22[A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, A11, A12, A13, A14, A15, A16, A17, A18, A19, A20, A21, A22, RT]): UserDefinedFunction

1

2

3

4

5

6

7

8

register(name: String, func: Function0[RT]): UserDefinedFunction

register(name: String, func: Function1[A1, RT]): UserDefinedFunction

...

register(name: String, func: Function22[A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, A11, A12, A13, A14, A15, A16, A17, A18, A19, A20, A21, A22, RT]): UserDefinedFunction

register…FIXME

Note	`register` is used when…FIXME

Registering UserDefinedFunction (with FunctionRegistry) — `register` Method



register(name: String, udf: UserDefinedFunction): UserDefinedFunction

1

2

3

4

5

register(name: String, udf: UserDefinedFunction): UserDefinedFunction

register…FIXME

Note	`register` is used when…FIXME

Registering UserDefinedAggregateFunction (with FunctionRegistry) — `register` Method



register(
  name: String,
  udaf: UserDefinedAggregateFunction): UserDefinedAggregateFunction

1

2

3

4

5

6

7

register(

name: String,

udaf: UserDefinedAggregateFunction): UserDefinedAggregateFunction

register registers a UserDefinedAggregateFunction under name with FunctionRegistry.

register creates a ScalaUDAF internally to register a UDAF.

Note	`register` gives the input `udaf` aggregate function back after the function has been registered with FunctionRegistry.

CatalystConf

2012-03-26admin阅读(1646)

CatalystConf

CatalystConf is…FIXME

Note	The default `CatalystConf` is SQLConf that is…FIXME

Table 1. CatalystConf’s Internal Properties
Name	Initial Value	Description
`caseSensitiveAnalysis`
`cboEnabled`		Enables cost-based optimizations (CBO) for estimation of plan statistics when enabled. Used in CostBasedJoinReorder logical plan optimization and `Project`, `Filter`, `Join` and `Aggregate` logical operators.
`optimizerMaxIterations`	spark.sql.optimizer.maxIterations	Maximum number of iterations for Analyzer and Optimizer.
`sessionLocalTimeZone`

`resolver` Method

resolver gives case-sensitive or case-insensitive Resolvers per caseSensitiveAnalysis setting.

Note	`Resolver` is a mere function of two `String` parameters that returns `true` if both refer to the same entity (i.e. for case insensitive equality).

StaticSQLConf — Cross-Session, Immutable and Static SQL Configuration

2012-03-25admin阅读(2385)

StaticSQLConf — Cross-Session, Immutable and Static SQL Configuration

StaticSQLConf holds cross-session, immutable and static SQL configuration properties.

Name Default Value Scala Value Description

spark.sql.catalogImplementation

in-memory

CATALOG_IMPLEMENTATION

Selects the active catalog implementation from the available ExternalCatalogs:

Note	Use Builder.enableHiveSupport to enable Hive support (that sets `spark.sql.catalogImplementation` configuration property to `hive` when the Hive classes are available).

Tip	Read ExternalCatalog — Base Metastore of Permanent Relational Entities.

Used when:

SparkSession is requested for the SessionState
SharedState is requested for the ExternalCatalog
SetCommand command is executed (with hive keys)
SparkSession is created with Hive support

spark.sql.debug

false

DEBUG_MODE

(internal) Only used for internal debugging when HiveExternalCatalog is requested to restoreTableMetadata.

Not all functions are supported when enabled.

spark.sql.extensions

(empty)

SPARK_SESSION_EXTENSIONS

Name of the SQL extension configuration class that is used to configure SparkSession extensions (when Builder is requested to get or create a SparkSession). The class should implement Function1[SparkSessionExtensions, Unit], and must have a no-args constructor.

spark.sql.filesourceTableRelationCacheSize

1000

FILESOURCE_TABLE_RELATION_CACHE_SIZE

(internal) The maximum size of the cache that maps qualified table names to table relation plans. Must not be negative.

spark.sql.globalTempDatabase

global_temp

GLOBAL_TEMP_DATABASE

(internal) Name of the Spark-owned internal database of global temporary views

Used exclusively to create a GlobalTempViewManager when SharedState is first requested for the GlobalTempViewManager.

Note	The name of the internal database cannot conflict with the names of any database that is already available in ExternalCatalog.

spark.sql.hive.thriftServer.singleSession

false

HIVE_THRIFT_SERVER_SINGLESESSION

When set to true, Hive Thrift server is running in a single session mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.

spark.sql.queryExecutionListeners

(empty)

QUERY_EXECUTION_LISTENERS

List of class names that implement QueryExecutionListener that will be automatically registered to new SparkSessions.

The classes should have either a no-arg constructor, or a constructor that expects a SparkConf argument.

spark.sql.sources.schemaStringLengthThreshold

4000

SCHEMA_STRING_LENGTH_THRESHOLD

(internal) The maximum length allowed in a single cell when storing additional schema information in Hive’s metastore

spark.sql.ui.retainedExecutions

1000

UI_RETAINED_EXECUTIONS

Number of executions to retain in the Spark UI.

spark.sql.warehouse.dir

spark-warehouse

WAREHOUSE_PATH

The directory of a Hive warehouse (using Derby) with managed databases and tables (aka Spark warehouse)

Tip	Read the official Hive Metastore Administration document to learn more.

The properties in StaticSQLConf can only be queried and can never be changed once the first SparkSession is created.



import org.apache.spark.sql.internal.StaticSQLConf
scala> val metastoreName = spark.conf.get(StaticSQLConf.CATALOG_IMPLEMENTATION.key)
metastoreName: String = hive

scala> spark.conf.set(StaticSQLConf.CATALOG_IMPLEMENTATION.key, "hive")
org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.catalogImplementation;
  at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:144)
  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41)
  ... 50 elided

1

2

3

4

5

6

7

8

9

10

11

12

13

import org.apache.spark.sql.internal.StaticSQLConf

scala> val metastoreName = spark.conf.get(StaticSQLConf.CATALOG_IMPLEMENTATION.key)

metastoreName: String = hive

scala> spark.conf.set(StaticSQLConf.CATALOG_IMPLEMENTATION.key, "hive")

org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.catalogImplementation;

at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:144)

at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41)

... 50 elided

SQLConf — Internal Configuration Store

2012-03-24admin阅读(1958)

SQLConf — Internal Configuration Store

SQLConf is an internal key-value configuration store for parameters and hints used in Spark SQL.

Note

SQLConf is an internal part of Spark SQL and is not supposed to be used directly.

Spark SQL configuration is available through RuntimeConfig (the user-facing configuration management interface) that you can access using SparkSession.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.conf
org.apache.spark.sql.RuntimeConfig

1

2

3

4

5

6

7

8

9

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.conf

org.apache.spark.sql.RuntimeConfig

You can access a SQLConf using:

SQLConf.get (preferred) – the SQLConf of the current active SparkSession
SessionState – direct access through SessionState of the SparkSession of your choice (that gives more flexibility on what SparkSession is used that can be different from the current active SparkSession)



import org.apache.spark.sql.internal.SQLConf

// Use type-safe access to configuration properties
// using SQLConf.get.getConf
val parallelFileListingInStatsComputation = SQLConf.get.getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION)

// or even simpler
SQLConf.get.parallelFileListingInStatsComputation

1

2

3

4

5

6

7

8

9

10

11

12

import org.apache.spark.sql.internal.SQLConf

// Use type-safe access to configuration properties

// using SQLConf.get.getConf

val parallelFileListingInStatsComputation = SQLConf.get.getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION)

// or even simpler

SQLConf.get.parallelFileListingInStatsComputation

SQLConf offers methods to get, set, unset or clear values of configuration properties, but has also the accessor methods to read the current value of a configuration property or hint.



scala> :type spark
org.apache.spark.sql.SparkSession

// Direct access to the session SQLConf
val sqlConf = spark.sessionState.conf
scala> :type sqlConf
org.apache.spark.sql.internal.SQLConf

scala> println(sqlConf.offHeapColumnVectorEnabled)
false

// Or simply import the conf value
import spark.sessionState.conf

// accessing properties through accessor methods
scala> conf.numShufflePartitions
res1: Int = 200

// Prefer SQLConf.get (over direct access)
import org.apache.spark.sql.internal.SQLConf
val cc = SQLConf.get
scala> cc == conf
res4: Boolean = true

// setting properties using aliases
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
conf.setConf(SHUFFLE_PARTITIONS, 2)
scala> conf.numShufflePartitions
res2: Int = 2

// unset aka reset properties to the default value
conf.unsetConf(SHUFFLE_PARTITIONS)
scala> conf.numShufflePartitions
res3: Int = 200

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

scala> :type spark

org.apache.spark.sql.SparkSession

// Direct access to the session SQLConf

val sqlConf = spark.sessionState.conf

scala> :type sqlConf

org.apache.spark.sql.internal.SQLConf

scala> println(sqlConf.offHeapColumnVectorEnabled)

false

// Or simply import the conf value

import spark.sessionState.conf

// accessing properties through accessor methods

scala> conf.numShufflePartitions

res1: Int = 200

// Prefer SQLConf.get (over direct access)

import org.apache.spark.sql.internal.SQLConf

val cc = SQLConf.get

scala> cc == conf

res4: Boolean = true

// setting properties using aliases

import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS

conf.setConf(SHUFFLE_PARTITIONS, 2)

scala> conf.numShufflePartitions

res2: Int = 2

// unset aka reset properties to the default value

conf.unsetConf(SHUFFLE_PARTITIONS)

scala> conf.numShufflePartitions

res3: Int = 200

Name Parameter Description

adaptiveExecutionEnabled

spark.sql.adaptive.enabled

Used exclusively when EnsureRequirements adds an ExchangeCoordinator (for adaptive query execution)

autoBroadcastJoinThreshold

spark.sql.autoBroadcastJoinThreshold

Used exclusively in JoinSelection execution planning strategy

autoSizeUpdateEnabled

spark.sql.statistics.size.autoUpdate.enabled

Used when:

CommandUtils is requested for updating existing table statistics
AlterTableAddPartitionCommand is executed

avroCompressionCodec

spark.sql.avro.compression.codec

Used exclusively when AvroOptions is requested for the compression configuration property (and it was not set explicitly)

broadcastTimeout

spark.sql.broadcastTimeout

Used exclusively in BroadcastExchangeExec (for broadcasting a table to executors).

bucketingEnabled

spark.sql.sources.bucketing.enabled

Used when FileSourceScanExec is requested for the input RDD and to determine output partitioning and ordering

cacheVectorizedReaderEnabled

spark.sql.inMemoryColumnarStorage.enableVectorizedReader

Used exclusively when InMemoryTableScanExec physical operator is requested for supportsBatch flag.

caseSensitiveAnalysis

spark.sql.caseSensitive

cboEnabled

spark.sql.cbo.enabled

Used in:

ReorderJoin logical plan optimization (and indirectly in StarSchemaDetection for reorderStarJoins)
CostBasedJoinReorder logical plan optimization

columnBatchSize

spark.sql.inMemoryColumnarStorage.batchSize

Used when…FIXME

dataFramePivotMaxValues

spark.sql.pivotMaxValues

Used exclusively in pivot operator.

dataFrameRetainGroupColumns

spark.sql.retainGroupColumns

Used exclusively in RelationalGroupedDataset when creating the result Dataset (after agg, count, mean, max, avg, min, and sum operators).

defaultSizeInBytes

spark.sql.defaultSizeInBytes

Used when:

DetermineTableStats logical resolution rule could not compute the table size or spark.sql.statistics.fallBackToHdfs is turned off
ExternalRDD, LogicalRDD and DataSourceV2Relation are requested for statistics (i.e. computeStats)
(Spark Structured Streaming) StreamingRelation, StreamingExecutionRelation, StreamingRelationV2 and ContinuousExecutionRelation are requested for statistics (i.e. computeStats)
DataSource creates a HadoopFsRelation for FileFormat data source (and builds a CatalogFileIndex when no table statistics are available)
BaseRelation is requested for an estimated size of this relation (in bytes)

enableRadixSort

spark.sql.sort.enableRadixSort

Used exclusively when SortExec physical operator is requested for a UnsafeExternalRowSorter.

exchangeReuseEnabled

spark.sql.exchange.reuse

Used when ReuseSubquery and ReuseExchange physical optimizations are executed

Note	When disabled (i.e. `false`), `ReuseSubquery` and `ReuseExchange` physical optimizations do no optimizations.

fallBackToHdfsForStatsEnabled

spark.sql.statistics.fallBackToHdfs

Used exclusively when DetermineTableStats logical resolution rule is executed.

fileCommitProtocolClass

spark.sql.sources.commitProtocolClass

Used (to instantiate a FileCommitProtocol) when:

SaveAsHiveFile is requested to saveAsHiveFile
InsertIntoHadoopFsRelationCommand logical command is executed

histogramEnabled

spark.sql.statistics.histogram.enabled

Used exclusively when AnalyzeColumnCommand logical command is executed.

histogramNumBins

spark.sql.statistics.histogram.numBins

Used exclusively when AnalyzeColumnCommand is executed with spark.sql.statistics.histogram.enabled turned on (and calculates percentiles).

hugeMethodLimit

spark.sql.codegen.hugeMethodLimit

Used exclusively when WholeStageCodegenExec unary physical operator is requested to execute (and generate a RDD[InternalRow]), i.e. when the compiled function exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the query plan.

ignoreCorruptFiles

spark.sql.files.ignoreCorruptFiles

Used when:

FileScanRDD is created (and then to compute a partition)
OrcFileFormat is requested to inferSchema and buildReader
ParquetFileFormat is requested to mergeSchemasInParallel

ignoreMissingFiles

spark.sql.files.ignoreMissingFiles

Used exclusively when FileScanRDD is created (and then to compute a partition)

inMemoryPartitionPruning

spark.sql.inMemoryColumnarStorage.partitionPruning

Used exclusively when InMemoryTableScanExec physical operator is requested for filtered cached column batches (as a RDD[CachedBatch]).

isParquetBinaryAsString

spark.sql.parquet.binaryAsString

isParquetINT96AsTimestamp

spark.sql.parquet.int96AsTimestamp

isParquetINT96TimestampConversion

spark.sql.parquet.int96TimestampConversion

Used exclusively when ParquetFileFormat is requested to build a data reader with partition column values appended.

joinReorderEnabled

spark.sql.cbo.joinReorder.enabled

Used exclusively in CostBasedJoinReorder logical plan optimization

limitScaleUpFactor

spark.sql.limit.scaleUpFactor

Used exclusively when a physical operator is requested the first n rows as an array.

manageFilesourcePartitions

spark.sql.hive.manageFilesourcePartitions

Used in:

HiveMetastoreCatalog is requested to convert a HiveTableRelation to a LogicalRelation
CreateDataSourceTableCommand, CreateDataSourceTableAsSelectCommand and InsertIntoHadoopFsRelationCommand logical commands are executed
DDLUtils is requested to verifyPartitionProviderIsHive
DataSource is requested to resolve a relation (for file-based data source tables and creates a HadoopFsRelation)
FileStatusCache is requested to getOrCreate

numShufflePartitions

spark.sql.shuffle.partitions

Used in:

Dataset’s repartition operator (for a RepartitionByExpression logical operator)
SparkSqlAstBuilder (for a RepartitionByExpression logical operator)
JoinSelection execution planning strategy
SetCommand logical command
EnsureRequirements physical plan optimization

offHeapColumnVectorEnabled

spark.sql.columnVector.offheap.enabled

Used when:

InMemoryTableScanExec is requested for vectorTypes and createAndDecompressColumn
OrcFileFormat is requested to build a data reader with partition column values appended
ParquetFileFormat is requested for vectorTypes and build a data reader with partition column values appended

optimizerExcludedRules

spark.sql.optimizer.excludedRules

Used exclusively when Optimizer is requested for the optimization batches

optimizerInSetConversionThreshold

spark.sql.optimizer.inSetConversionThreshold

Used exclusively when OptimizeIn logical query optimization is applied to a logical plan (and replaces an In predicate expression with an InSet)

parallelFileListingInStatsComputation

spark.sql.statistics.parallelFileListingInStatsComputation.enabled

Used exclusively when CommandUtils helper object is requested to calculate the total size of a table (with partitions) (for AnalyzeColumnCommand and AnalyzeTableCommand commands)

parquetFilterPushDown

spark.sql.parquet.filterPushdown

Used exclusively when ParquetFileFormat is requested to build a data reader with partition column values appended.

parquetRecordFilterEnabled

spark.sql.parquet.recordLevelFilter.enabled

Used exclusively when ParquetFileFormat is requested to build a data reader with partition column values appended.

parquetVectorizedReaderEnabled

spark.sql.parquet.enableVectorizedReader

Used when:

FileSourceScanExec is requested for needsUnsafeRowConversion flag
ParquetFileFormat is requested for supportBatch flag and build a data reader with partition column values appended

partitionOverwriteMode

spark.sql.sources.partitionOverwriteMode

Used exclusively when InsertIntoHadoopFsRelationCommand logical command is executed

preferSortMergeJoin

spark.sql.join.preferSortMergeJoin

Used exclusively in JoinSelection execution planning strategy to prefer sort merge join over shuffle hash join.

runSQLonFile

spark.sql.runSQLOnFiles

Used when:

ResolveRelations does isRunningDirectlyOnFiles
ResolveSQLOnFile does maybeSQLFile

sessionLocalTimeZone

spark.sql.session.timeZone

starSchemaDetection

spark.sql.cbo.starSchemaDetection

Used exclusively in ReorderJoin logical plan optimization (and indirectly in StarSchemaDetection)

stringRedactionPattern

spark.sql.redaction.string.regex

Used when:

DataSourceScanExec is requested to redact sensitive information (in text representations)
QueryExecution is requested to redact sensitive information (in text representations)

subexpressionEliminationEnabled

spark.sql.subexpressionElimination.enabled

Used exclusively when SparkPlan is requested for subexpressionEliminationEnabled flag.

supportQuotedRegexColumnName

spark.sql.parser.quotedRegexColumnNames

Used when:

Dataset.col operator is used
AstBuilder is requested to parse a dereference and column reference in a SQL statement

useCompression

spark.sql.inMemoryColumnarStorage.compressed

Used when…FIXME

wholeStageEnabled

spark.sql.codegen.wholeStage

Used in:

CollapseCodegenStages to control codegen
ParquetFileFormat to control row batch reading

wholeStageFallback

spark.sql.codegen.fallback

Used exclusively when WholeStageCodegenExec is executed.

wholeStageMaxNumFields

spark.sql.codegen.maxFields

Used in:

CollapseCodegenStages to control codegen
ParquetFileFormat to control row batch reading

wholeStageSplitConsumeFuncByOperator

spark.sql.codegen.splitConsumeFuncByOperator

Used exclusively when CodegenSupport is requested to consume

wholeStageUseIdInClassName

spark.sql.codegen.useIdInClassName

Used exclusively when WholeStageCodegenExec is requested to generate the Java source code for the child physical plan subtree (when created)

windowExecBufferInMemoryThreshold

spark.sql.windowExec.buffer.in.memory.threshold

Used exclusively when WindowExec unary physical operator is executed.

windowExecBufferSpillThreshold

spark.sql.windowExec.buffer.spill.threshold

Used exclusively when WindowExec unary physical operator is executed.

useObjectHashAggregation

spark.sql.execution.useObjectHashAggregateExec

Used exclusively when Aggregation execution planning strategy is executed (and uses AggUtils to create an aggregation physical operator).

Getting Parameters and Hints

You can get the current parameters and hints using the following family of get methods.



getConf[T](entry: ConfigEntry[T], defaultValue: T): T
getConf[T](entry: ConfigEntry[T]): T
getConf[T](entry: OptionalConfigEntry[T]): Option[T]
getConfString(key: String): String
getConfString(key: String, defaultValue: String): String
getAllConfs: immutable.Map[String, String]
getAllDefinedConfs: Seq[(String, String, String)]

1

2

3

4

5

6

7

8

9

10

11

getConf[T](entry: ConfigEntry[T], defaultValue: T): T

getConf[T](entry: ConfigEntry[T]): T

getConf[T](entry: OptionalConfigEntry[T]): Option[T]

getConfString(key: String): String

getConfString(key: String, defaultValue: String): String

getAllConfs: immutable.Map[String, String]

getAllDefinedConfs: Seq[(String, String, String)]

Setting Parameters and Hints

You can set parameters and hints using the following family of set methods.



setConf(props: Properties): Unit
setConfString(key: String, value: String): Unit
setConf[T](entry: ConfigEntry[T], value: T): Unit

1

2

3

4

5

6

7

setConf(props: Properties): Unit

setConfString(key: String, value: String): Unit

setConf[T](entry: ConfigEntry[T], value: T): Unit

Unsetting Parameters and Hints

You can unset parameters and hints using the following family of unset methods.



unsetConf(key: String): Unit
unsetConf(entry: ConfigEntry[_]): Unit

1

2

3

4

5

6

unsetConf(key: String): Unit

unsetConf(entry: ConfigEntry[_]): Unit

Clearing All Parameters and Hints



clear(): Unit

1

2

3

4

5

clear(): Unit

You can use clear to remove all the parameters and hints in SQLConf.

Redacting Data Source Options with Sensitive Information — `redactOptions` Method



redactOptions(options: Map[String, String]): Map[String, String]

1

2

3

4

5

redactOptions(options: Map[String, String]): Map[String, String]

redactOptions takes the values of the spark.sql.redaction.options.regex and spark.redaction.regex configuration properties.

For every regular expression (in the order), redactOptions redacts sensitive information, i.e. finds the first match of a regular expression pattern in every option key or value and if either matches replaces the value with ***(redacted).

Note	`redactOptions` is used exclusively when `SaveIntoDataSourceCommand` logical command is requested for the simple description.

RuntimeConfig — Management Interface of Runtime Configuration

2012-03-23admin阅读(1539)

RuntimeConfig — Management Interface of Runtime Configuration

RuntimeConfig is the management interface of the runtime configuration.

Method Description

get



get(key: String): String
get(key: String, default: String): String

1

2

3

4

5

6

get(key: String): String

get(key: String, default: String): String

getAll



getAll: Map[String, String]

1

2

3

4

5

getAll: Map[String, String]

getOption



getOption(key: String): Option[String]

1

2

3

4

5

getOption(key: String): Option[String]

isModifiable



isModifiable(key: String): Boolean

1

2

3

4

5

isModifiable(key: String): Boolean

(New in 2.4.0)

set



set(key: String, value: Boolean): Unit
set(key: String, value: Long): Unit
set(key: String, value: String): Unit

1

2

3

4

5

6

7

set(key: String, value: Boolean): Unit

set(key: String, value: Long): Unit

set(key: String, value: String): Unit

unset



unset(key: String): Unit

1

2

3

4

5

unset(key: String): Unit

RuntimeConfig is available using the conf attribute of a SparkSession.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.conf
org.apache.spark.sql.RuntimeConfig

1

2

3

4

5

6

7

8

9

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.conf

org.apache.spark.sql.RuntimeConfig

Figure 1. RuntimeConfig, SparkSession and SQLConf

RuntimeConfig is created exclusively when SparkSession is requested for one.

RuntimeConfig takes a SQLConf when created.

`get` Method



get(key: String): String
get(key: String, default: String): String

1

2

3

4

5

6

get(key: String): String

get(key: String, default: String): String

get…FIXME

Note	`get` is used when…FIXME

`getAll` Method



getAll: Map[String, String]

1

2

3

4

5

getAll: Map[String, String]

getAll…FIXME

Note	`getAll` is used when…FIXME

`getOption` Method



getOption(key: String): Option[String]

1

2

3

4

5

getOption(key: String): Option[String]

getOption…FIXME

Note	`getOption` is used when…FIXME

`set` Method



set(key: String, value: Boolean): Unit
set(key: String, value: Long): Unit
set(key: String, value: String): Unit

1

2

3

4

5

6

7

set(key: String, value: Boolean): Unit

set(key: String, value: Long): Unit

set(key: String, value: String): Unit

set…FIXME

Note	`set` is used when…FIXME

`unset` Method



unset(key: String): Unit

1

2

3

4

5

unset(key: String): Unit

unset…FIXME

Note	`unset` is used when…FIXME

CacheManager — In-Memory Cache for Tables and Views

2012-03-22admin阅读(4733)

CacheManager — In-Memory Cache for Tables and Views

CacheManager is an in-memory cache for tables and views (as logical plans). It uses the internal cachedData collection of CachedData to track logical plans and their cached InMemoryRelation representation.

CacheManager is shared across SparkSessions through SharedState.



val spark: SparkSession = ...
spark.sharedState.cacheManager

1

2

3

4

5

6

val spark: SparkSession = ...

spark.sharedState.cacheManager

Note	A Spark developer can use `CacheManager` to cache `Dataset`s using cache or persist operators.

Cached Queries — `cachedData` Internal Registry

cachedData is a collection of CachedData with logical plans and their cached InMemoryRelation representation.

A new CachedData is added when a Dataset is cached and removed when a Dataset is uncached or when invalidating cache data with a resource path.

cachedData is cleared when…FIXME

`invalidateCachedPath` Method

Caution

FIXME

`invalidateCache` Method

Caution

FIXME

`lookupCachedData` Method

Caution

FIXME

`uncacheQuery` Method

Caution

FIXME

`isEmpty` Method

Caution

FIXME

Caching Dataset (Registering Analyzed Logical Plan as InMemoryRelation) — `cacheQuery` Method



cacheQuery(
  query: Dataset[_],
  tableName: Option[String] = None,
  storageLevel: StorageLevel = MEMORY_AND_DISK): Unit

1

2

3

4

5

6

7

8

cacheQuery(

query: Dataset[_],

tableName: Option[String] = None,

storageLevel: StorageLevel = MEMORY_AND_DISK): Unit

cacheQuery adds the analyzed logical plan of the input query to the cachedData internal registry of cached queries.

Internally, cacheQuery firstly requests the input query for the analyzed logical plan and creates a InMemoryRelation with the following properties:

spark.sql.inMemoryColumnarStorage.compressed (enabled by default)
spark.sql.inMemoryColumnarStorage.batchSize (default: 10000)
Input storageLevel storage level
Optimized physical query plan (after requesting SessionState to execute the analyzed logical plan)
Input tableName
Statistics of the analyzed query plan

cacheQuery then creates a CachedData (for the analyzed query plan and the InMemoryRelation) and adds it to the cachedData internal registry.

If the input query has already been cached, cacheQuery simply prints the following WARN message to the logs and exits (i.e. does nothing but printing out the WARN message):



WARN CacheManager: Asked to cache already cached data.

1

2

3

4

5

WARN CacheManager: Asked to cache already cached data.

Note	`cacheQuery` is used when: Dataset.persist basic action is used `CatalogImpl` is requested to cache a table or view in-memory or refresh a table

Removing All Cached Tables From In-Memory Cache — `clearCache` Method



clearCache(): Unit

1

2

3

4

5

clearCache(): Unit

clearCache acquires a write lock and unpersists RDD[CachedBatch]s of the queries in cachedData before removing them altogether.

Note	`clearCache` is used when the `CatalogImpl` is requested to clearCache.

CachedData

Caution

FIXME

`recacheByCondition` Internal Method



recacheByCondition(spark: SparkSession, condition: LogicalPlan => Boolean): Unit

1

2

3

4

5

recacheByCondition(spark: SparkSession, condition: LogicalPlan => Boolean): Unit

recacheByCondition…FIXME

Note	`recacheByCondition` is used when `CacheManager` is requested to recacheByPlan or recacheByPath.

`recacheByPlan` Method



recacheByPlan(spark: SparkSession, plan: LogicalPlan): Unit

1

2

3

4

5

recacheByPlan(spark: SparkSession, plan: LogicalPlan): Unit

recacheByPlan…FIXME

Note	`recacheByPlan` is used exclusively when `InsertIntoDataSourceCommand` logical command is executed.

`recacheByPath` Method



recacheByPath(spark: SparkSession, resourcePath: String): Unit

1

2

3

4

5

recacheByPath(spark: SparkSession, resourcePath: String): Unit

recacheByPath…FIXME

Note	`recacheByPath` is used exclusively when `CatalogImpl` is requested to refreshByPath.

Replacing Logical Query Segments With Cached Query Plans — `useCachedData` Method



useCachedData(plan: LogicalPlan): LogicalPlan

1

2

3

4

5

useCachedData(plan: LogicalPlan): LogicalPlan

useCachedData…FIXME

Note	`useCachedData` is used exclusively when `QueryExecution` is requested for a cached logical query plan.

SharedState — State Shared Across SparkSessions

2012-03-21admin阅读(1795)

SharedState — State Shared Across SparkSessions

SharedState holds the shared state across multiple SparkSessions.

Name Type Description

cacheManager

CacheManager

externalCatalog

ExternalCatalog

Metastore of permanent relational entities, i.e. databases, tables, partitions, and functions.

Note	`externalCatalog` is initialized lazily on the first access.

globalTempViewManager

GlobalTempViewManager

Management interface of global temporary views

jarClassLoader

NonClosableMutableURLClassLoader

sparkContext

SparkContext

Spark Core’s SparkContext

statusStore

SQLAppStatusStore

warehousePath

String

Warehouse path

SharedState is available as the sharedState property of a SparkSession.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sharedState
org.apache.spark.sql.internal.SharedState

1

2

3

4

5

6

7

8

9

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.sharedState

org.apache.spark.sql.internal.SharedState

SharedState is shared across SparkSessions.



scala> spark.newSession.sharedState == spark.sharedState
res1: Boolean = true

1

2

3

4

5

6

scala> spark.newSession.sharedState == spark.sharedState

res1: Boolean = true

SharedState is created exclusively when accessed using sharedState property of SparkSession.

Tip

Enable INFO logging level for org.apache.spark.sql.internal.SharedState logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.internal.SharedState=INFO

1

2

3

4

5

log4j.logger.org.apache.spark.sql.internal.SharedState=INFO

Refer to Logging.

`warehousePath` Property



warehousePath: String

1

2

3

4

5

warehousePath: String

warehousePath is the warehouse path with the value of:

hive.metastore.warehouse.dir if defined and spark.sql.warehouse.dir is not
spark.sql.warehouse.dir if hive.metastore.warehouse.dir is undefined

You should see the following INFO message in the logs when SharedState is created:



INFO Warehouse path is '[warehousePath]'.

1

2

3

4

5

INFO Warehouse path is '[warehousePath]'.

warehousePath is used exclusively when SharedState initializes ExternalCatalog (and creates the default database in the metastore).

While initialized, warehousePath does the following:

Loads hive-site.xml if available on CLASSPATH, i.e. adds it as a configuration resource to Hadoop’s Configuration (of SparkContext).
Removes hive.metastore.warehouse.dir from SparkConf (of SparkContext) and leaves it off if defined using any of the Hadoop configuration resources.

Sets spark.sql.warehouse.dir or hive.metastore.warehouse.dir in the Hadoop configuration (of SparkContext)

If hive.metastore.warehouse.dir has been defined in any of the Hadoop configuration resources but spark.sql.warehouse.dir has not, spark.sql.warehouse.dir becomes the value of hive.metastore.warehouse.dir.

You should see the following INFO message in the logs:



spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('[hiveWarehouseDir]').

1

2

3

4

5

spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('[hiveWarehouseDir]').

Otherwise, the Hadoop configuration’s hive.metastore.warehouse.dir is set to spark.sql.warehouse.dir

You should see the following INFO message in the logs:

Setting hive.metastore.warehouse.dir ('[hiveWarehouseDir]') to the value of spark.sql.warehouse.dir ('[sparkWarehouseDir]').

1
2
3
4
5

Setting hive.metastore.warehouse.dir ('[hiveWarehouseDir]') to the value of spark.sql.warehouse.dir ('[sparkWarehouseDir]').

`externalCatalog` Property



externalCatalog: ExternalCatalog

1

2

3

4

5

externalCatalog: ExternalCatalog

externalCatalog is created reflectively per spark.sql.catalogImplementation internal configuration property (with the current Hadoop’s Configuration as SparkContext.hadoopConfiguration):

HiveExternalCatalog for hive
InMemoryCatalog for in-memory (default)

While initialized:

Creates the default database (with default database description and warehousePath location) if it doesn’t exist.
Registers a ExternalCatalogEventListener that propagates external catalog events to the Spark listener bus.

`externalCatalogClassName` Internal Method



externalCatalogClassName(conf: SparkConf): String

1

2

3

4

5

externalCatalogClassName(conf: SparkConf): String

externalCatalogClassName gives the name of the class of the ExternalCatalog per spark.sql.catalogImplementation, i.e.

org.apache.spark.sql.hive.HiveExternalCatalog for hive
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog for in-memory

Note	`externalCatalogClassName` is used exclusively when `SharedState` is requested for the ExternalCatalog.

Accessing Management Interface of Global Temporary Views — `globalTempViewManager` Property



globalTempViewManager: GlobalTempViewManager

1

2

3

4

5

globalTempViewManager: GlobalTempViewManager

When accessed for the very first time, globalTempViewManager gets the name of the global temporary view database (as the value of spark.sql.globalTempDatabase internal static configuration property).

In the end, globalTempViewManager creates a new GlobalTempViewManager (with the database name).

globalTempViewManager throws a SparkException when the global temporary view database exist in the ExternalCatalog.



[globalTempDB] is a system preserved database, please rename your existing database to resolve the name conflict, or set a different value for spark.sql.globalTempDatabase, and launch your Spark application again.

1

2

3

4

5

[globalTempDB] is a system preserved database, please rename your existing database to resolve the name conflict, or set a different value for spark.sql.globalTempDatabase, and launch your Spark application again.

Note	`globalTempViewManager` is used when BaseSessionStateBuilder and HiveSessionStateBuilder are requested for the SessionCatalog.

HiveSessionStateBuilder — Builder of Hive-Specific SessionState

2012-03-20admin阅读(1842)

HiveSessionStateBuilder — Builder of Hive-Specific SessionState

HiveSessionStateBuilder is a BaseSessionStateBuilder that has Hive-specific Analyzer, SparkPlanner, HiveSessionCatalog, HiveExternalCatalog and HiveSessionResourceLoader.

Figure 1. HiveSessionStateBuilder’s Hive-Specific Properties

HiveSessionStateBuilder is created (using newBuilder) exclusively when…FIXME

spark sql HiveSessionStateBuilder SessionState.png

Figure 2. HiveSessionStateBuilder and SessionState (in SparkSession)

Name Description

analyzer

Hive-specific logical query plan analyzer with the Hive-specific rules.

catalog

HiveSessionCatalog with the following:

HiveExternalCatalog
GlobalTempViewManager from the session-specific SharedState
New HiveMetastoreCatalog
FunctionRegistry
SQLConf
New Hadoop Configuration
ParserInterface
HiveSessionResourceLoader

Note	If parentState is defined, the state is copied to `catalog`

Used to create Hive-specific Analyzer and a RelationConversions logical evaluation rule (as part of Hive-Specific Analyzer’s PostHoc Resolution Rules)

externalCatalog

HiveExternalCatalog

planner

SparkPlanner with Hive-specific strategies.

resourceLoader

HiveSessionResourceLoader

SparkPlanner with Hive-Specific Strategies — `planner` Property



planner: SparkPlanner

1

2

3

4

5

planner: SparkPlanner

Note	`planner` is part of BaseSessionStateBuilder Contract to create a query planner.

planner is a SparkPlanner with…FIXME

planner uses the Hive-specific strategies.

Table 2. Hive-Specific SparkPlanner’s Hive-Specific Strategies
Strategy	Description
HiveTableScans
`Scripts`

Logical Query Plan Analyzer with Hive-Specific Rules — `analyzer` Property



analyzer: Analyzer

1

2

3

4

5

analyzer: Analyzer

Note	`analyzer` is part of BaseSessionStateBuilder Contract to create a logical query plan analyzer.

analyzer is a Analyzer with Hive-specific SessionCatalog (and SQLConf).

analyzer uses the Hive-specific extended resolution, postHoc resolution and extended check rules.

Table 3. Hive-Specific Analyzer’s Extended Resolution Rules (in the order of execution)
Logical Rule	Description
`ResolveHiveSerdeTable`
FindDataSourceTable
`ResolveSQLOnFile`

Table 4. Hive-Specific Analyzer’s PostHoc Resolution Rules
Logical Rule	Description
DetermineTableStats
RelationConversions
PreprocessTableCreation
`PreprocessTableInsertion`
DataSourceAnalysis
HiveAnalysis

Table 5. Hive-Specific Analyzer’s Extended Check Rules
Logical Rule	Description
`PreWriteCheck`
`PreReadCheck`

Builder Function to Create HiveSessionStateBuilder — newBuilder Factory Method



newBuilder: NewBuilder

1

2

3

4

5

newBuilder: NewBuilder

Note	`newBuilder` is part of BaseSessionStateBuilder Contract to…FIXME.

newBuilder…FIXME

Creating HiveSessionStateBuilder Instance

HiveSessionStateBuilder takes the following when created:

SparkSession
Optional SessionState (None by default)

spark-sql 第50页

Bucketing

Bucket Pruning — Optimizing Filtering on Bucketed Column (Reducing Bucket Files to Scan)

Sorting

spark.sql.sources.bucketing.enabled Spark SQL Configuration Property

Dynamic Partition Inserts

UDFRegistration — Session-Scoped FunctionRegistry

Registering UserDefinedFunction (with FunctionRegistry) — register Method

Registering UserDefinedFunction (with FunctionRegistry) — register Method

Registering UserDefinedAggregateFunction (with FunctionRegistry) — register Method

CatalystConf

resolver Method

StaticSQLConf — Cross-Session, Immutable and Static SQL Configuration

SQLConf — Internal Configuration Store

Getting Parameters and Hints

Setting Parameters and Hints

Unsetting Parameters and Hints

Clearing All Parameters and Hints

Redacting Data Source Options with Sensitive Information — redactOptions Method

RuntimeConfig — Management Interface of Runtime Configuration

get Method

getAll Method

getOption Method

set Method

unset Method

CacheManager — In-Memory Cache for Tables and Views

Cached Queries — cachedData Internal Registry

invalidateCachedPath Method

invalidateCache Method

lookupCachedData Method

uncacheQuery Method

isEmpty Method

Caching Dataset (Registering Analyzed Logical Plan as InMemoryRelation) — cacheQuery Method

Removing All Cached Tables From In-Memory Cache — clearCache Method

CachedData

recacheByCondition Internal Method

recacheByPlan Method

recacheByPath Method

Replacing Logical Query Segments With Cached Query Plans — useCachedData Method

SharedState — State Shared Across SparkSessions

warehousePath Property

externalCatalog Property

externalCatalogClassName Internal Method

Accessing Management Interface of Global Temporary Views — globalTempViewManager Property

HiveSessionStateBuilder — Builder of Hive-Specific SessionState

SparkPlanner with Hive-Specific Strategies — planner Property

Logical Query Plan Analyzer with Hive-Specific Rules — analyzer Property

Builder Function to Create HiveSessionStateBuilder — newBuilder Factory Method

Creating HiveSessionStateBuilder Instance

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Registering UserDefinedFunction (with FunctionRegistry) — `register` Method

Registering UserDefinedFunction (with FunctionRegistry) — `register` Method

Registering UserDefinedAggregateFunction (with FunctionRegistry) — `register` Method

`resolver` Method

Redacting Data Source Options with Sensitive Information — `redactOptions` Method

`get` Method

`getAll` Method

`getOption` Method

`set` Method

`unset` Method

Cached Queries — `cachedData` Internal Registry

`invalidateCachedPath` Method

`invalidateCache` Method

`lookupCachedData` Method

`uncacheQuery` Method

`isEmpty` Method

Caching Dataset (Registering Analyzed Logical Plan as InMemoryRelation) — `cacheQuery` Method

Removing All Cached Tables From In-Memory Cache — `clearCache` Method

`recacheByCondition` Internal Method

`recacheByPlan` Method

`recacheByPath` Method

Replacing Logical Query Segments With Cached Query Plans — `useCachedData` Method

`warehousePath` Property

`externalCatalog` Property

`externalCatalogClassName` Internal Method

Accessing Management Interface of Global Temporary Views — `globalTempViewManager` Property

SparkPlanner with Hive-Specific Strategies — `planner` Property

Logical Query Plan Analyzer with Hive-Specific Rules — `analyzer` Property