HiveTableRelation-spark技术分享

HiveTableRelation Leaf Logical Operator — Representing Hive Tables in Logical Plan

HiveTableRelation is a leaf logical operator that represents a Hive table in a logical query plan.

HiveTableRelation is created exclusively when FindDataSourceTable logical evaluation rule is requested to resolve UnresolvedCatalogRelations in a logical plan (for Hive tables).



val tableName = "h1"

// Make the example reproducible
val db = spark.catalog.currentDatabase
import spark.sharedState.{externalCatalog => extCatalog}
extCatalog.dropTable(
  db, table = tableName, ignoreIfNotExists = true, purge = true)

// sql("CREATE TABLE h1 (id LONG) USING hive")
import org.apache.spark.sql.types.StructType
spark.catalog.createTable(
  tableName,
  source = "hive",
  schema = new StructType().add($"id".long),
  options = Map.empty[String, String])

val h1meta = extCatalog.getTable(db, tableName)
scala> println(h1meta.provider.get)
hive

// Looks like we've got the testing space ready for the experiment
val h1 = spark.table(tableName)

import org.apache.spark.sql.catalyst.dsl.plans._
val plan = table(tableName).insertInto("t2", overwrite = true)
scala> println(plan.numberedTreeString)
00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false
01 +- 'UnresolvedRelation `h1`

// ResolveRelations logical rule first to resolve UnresolvedRelations
import spark.sessionState.analyzer.ResolveRelations
val rrPlan = ResolveRelations(plan)
scala> println(rrPlan.numberedTreeString)
00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false
01 +- 'SubqueryAlias h1
02    +- 'UnresolvedCatalogRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

// FindDataSourceTable logical rule next to resolve UnresolvedCatalogRelations
import org.apache.spark.sql.execution.datasources.FindDataSourceTable
val findTablesRule = new FindDataSourceTable(spark)
val planWithTables = findTablesRule(rrPlan)

// At long last...
// Note HiveTableRelation in the logical plan
scala> println(planWithTables.numberedTreeString)
00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false
01 +- SubqueryAlias h1
02    +- HiveTableRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#13L]

val tableName = "h1"

// Make the example reproducible

val db = spark.catalog.currentDatabase

import spark.sharedState.{externalCatalog => extCatalog}

extCatalog.dropTable(

db, table = tableName, ignoreIfNotExists = true, purge = true)

// sql("CREATE TABLE h1 (id LONG) USING hive")

import org.apache.spark.sql.types.StructType

spark.catalog.createTable(

tableName,

source = "hive",

schema = new StructType().add($"id".long),

options = Map.empty[String, String])

val h1meta = extCatalog.getTable(db, tableName)

scala> println(h1meta.provider.get)

hive

// Looks like we've got the testing space ready for the experiment

val h1 = spark.table(tableName)

import org.apache.spark.sql.catalyst.dsl.plans._

val plan = table(tableName).insertInto("t2", overwrite = true)

scala> println(plan.numberedTreeString)

00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false

01 +- 'UnresolvedRelation `h1`

// ResolveRelations logical rule first to resolve UnresolvedRelations

import spark.sessionState.analyzer.ResolveRelations

val rrPlan = ResolveRelations(plan)

scala> println(rrPlan.numberedTreeString)

00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false

01 +- 'SubqueryAlias h1

02 +- 'UnresolvedCatalogRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

// FindDataSourceTable logical rule next to resolve UnresolvedCatalogRelations

import org.apache.spark.sql.execution.datasources.FindDataSourceTable

val findTablesRule = new FindDataSourceTable(spark)

val planWithTables = findTablesRule(rrPlan)

// At long last...

// Note HiveTableRelation in the logical plan

scala> println(planWithTables.numberedTreeString)

00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false

01 +- SubqueryAlias h1

02 +- HiveTableRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#13L]

HiveTableRelation is partitioned when it has at least one partition.

The metadata of a HiveTableRelation (in a catalog) has to meet the requirements:

The database is defined
The partition schema is of the same type as partitionCols
The data schema is of the same type as dataCols

HiveTableRelation has the output attributes made up of data followed by partition columns.

Note

HiveTableRelation is removed from a logical plan when HiveAnalysis logical rule is executed (and transforms a InsertIntoTable with HiveTableRelation to an InsertIntoHiveTable).

HiveTableRelation is when RelationConversions rule is executed (and converts HiveTableRelations to LogicalRelations).

HiveTableRelation is resolved to HiveTableScanExec physical operator when HiveTableScans strategy is executed.

Computing Statistics — `computeStats` Method



computeStats(): Statistics

computeStats(): Statistics

Note	`computeStats` is part of LeafNode Contract to compute statistics for cost-based optimizer.

computeStats takes the table statistics from the table metadata if defined and converts them to Spark statistics (with output columns).

If the table statistics are not available, computeStats reports an IllegalStateException.



table stats must be specified.

table stats must be specified.

Creating HiveTableRelation Instance

HiveTableRelation takes the following when created:

Table metadata
Columns (as a collection of AttributeReferences)
Partitions (as a collection of AttributeReferences)

HiveTableRelation

HiveTableRelation Leaf Logical Operator — Representing Hive Tables in Logical Plan

Computing Statistics — `computeStats` Method

Creating HiveTableRelation Instance

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

HiveTableRelation Leaf Logical Operator — Representing Hive Tables in Logical Plan

Computing Statistics — computeStats Method

Creating HiveTableRelation Instance

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

Computing Statistics — `computeStats` Method