CommandUtils — Utilities for Table Statistics
CommandUtils
is a helper class that logical commands, e.g. InsertInto*
, AlterTable*Command
, LoadDataCommand
, and CBO’s Analyze*
, use to manage table statistics.
CommandUtils
defines the following utilities:
Tip
|
Enable Add the following line to
Refer to Logging. |
Updating Existing Table Statistics — updateTableStats
Method
1 2 3 4 5 |
updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit |
updateTableStats
updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).
updateTableStats
requests SessionCatalog
to alterTableStats with the current total size (when spark.sql.statistics.size.autoUpdate.enabled property is turned on) or empty statistics (that effectively removes the recorded statistics completely).
Important
|
updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.
|
Note
|
updateTableStats uses SparkSession to access the current SessionState that it then uses to access the session-scoped SessionCatalog.
|
Note
|
updateTableStats is used when InsertIntoHiveTable, InsertIntoHadoopFsRelationCommand, AlterTableDropPartitionCommand , AlterTableSetLocationCommand and LoadDataCommand commands are executed.
|
Calculating Total Size of Table (with Partitions) — calculateTotalSize
Method
1 2 3 4 5 |
calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): BigInt |
calculateTotalSize
calculates total file size for the entire input CatalogTable (when it has no partitions defined) or all its partitions (through the session-scoped SessionCatalog).
Note
|
calculateTotalSize uses the input SessionState to access the SessionCatalog.
|
Note
|
|
Calculating Total File Size Under Path — calculateLocationSize
Method
1 2 3 4 5 6 7 8 |
calculateLocationSize( sessionState: SessionState, identifier: TableIdentifier, locationUri: Option[URI]): Long |
calculateLocationSize
reads hive.exec.stagingdir
configuration property for the staging directory (with .hive-staging
being the default).
You should see the following INFO message in the logs:
1 2 3 4 5 |
INFO CommandUtils: Starting to calculate the total file size under path [locationUri]. |
calculateLocationSize
calculates the sum of the length of all the files under the input locationUri
.
Note
|
calculateLocationSize uses Hadoop’s FileSystem.getFileStatus and FileStatus.getLen to access a file and the length of the file (in bytes), respectively.
|
In the end, you should see the following INFO message in the logs:
1 2 3 4 5 |
INFO CommandUtils: It took [durationInMs] ms to calculate the total file size under path [locationUri]. |
Note
|
|
Creating CatalogStatistics with Current Statistics — compareAndGetNewStats
Method
1 2 3 4 5 6 7 8 |
compareAndGetNewStats( oldStats: Option[CatalogStatistics], newTotalSize: BigInt, newRowCount: Option[BigInt]): Option[CatalogStatistics] |
compareAndGetNewStats
creates a new CatalogStatistics
with the input newTotalSize
and newRowCount
only when they are different from the oldStats
.
Note
|
compareAndGetNewStats is used when AnalyzePartitionCommand and AnalyzeTableCommand are executed.
|