关注 spark技术分享,
撸spark源码 玩spark最佳实践

CommandUtils — Utilities for Table Statistics

CommandUtils — Utilities for Table Statistics

CommandUtils is a helper class that logical commands, e.g. InsertInto*, AlterTable*Command, LoadDataCommand, and CBO’s Analyze*, use to manage table statistics.

CommandUtils defines the following utilities:

Tip

Enable INFO logging level for org.apache.spark.sql.execution.command.CommandUtils logger to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

Updating Existing Table Statistics — updateTableStats Method

updateTableStats updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).

updateTableStats requests SessionCatalog to alterTableStats with the current total size (when spark.sql.statistics.size.autoUpdate.enabled property is turned on) or empty statistics (that effectively removes the recorded statistics completely).

Important
updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.
Note
updateTableStats uses SparkSession to access the current SessionState that it then uses to access the session-scoped SessionCatalog.
Note
updateTableStats is used when InsertIntoHiveTable, InsertIntoHadoopFsRelationCommand, AlterTableDropPartitionCommand, AlterTableSetLocationCommand and LoadDataCommand commands are executed.

Calculating Total Size of Table (with Partitions) — calculateTotalSize Method

calculateTotalSize calculates total file size for the entire input CatalogTable (when it has no partitions defined) or all its partitions (through the session-scoped SessionCatalog).

Note
calculateTotalSize uses the input SessionState to access the SessionCatalog.
Note

calculateTotalSize is used when:

Calculating Total File Size Under Path — calculateLocationSize Method

calculateLocationSize reads hive.exec.stagingdir configuration property for the staging directory (with .hive-staging being the default).

You should see the following INFO message in the logs:

calculateLocationSize calculates the sum of the length of all the files under the input locationUri.

Note
calculateLocationSize uses Hadoop’s FileSystem.getFileStatus and FileStatus.getLen to access a file and the length of the file (in bytes), respectively.

In the end, you should see the following INFO message in the logs:

Note

calculateLocationSize is used when:

Creating CatalogStatistics with Current Statistics — compareAndGetNewStats Method

compareAndGetNewStats creates a new CatalogStatistics with the input newTotalSize and newRowCount only when they are different from the oldStats.

Note
compareAndGetNewStats is used when AnalyzePartitionCommand and AnalyzeTableCommand are executed.
赞(0) 打赏
未经允许不得转载:spark技术分享 » CommandUtils — Utilities for Table Statistics
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏