CacheManager — In-Memory Cache for Tables and Views
CacheManager is an in-memory cache for tables and views (as logical plans). It uses the internal cachedData collection of CachedData to track logical plans and their cached InMemoryRelation representation.
CacheManager is shared across SparkSessions through SharedState.
|
1 2 3 4 5 6 |
val spark: SparkSession = ... spark.sharedState.cacheManager |
Cached Queries — cachedData Internal Registry
cachedData is a collection of CachedData with logical plans and their cached InMemoryRelation representation.
A new CachedData is added when a Dataset is cached and removed when a Dataset is uncached or when invalidating cache data with a resource path.
cachedData is cleared when…FIXME
Caching Dataset (Registering Analyzed Logical Plan as InMemoryRelation) — cacheQuery Method
|
1 2 3 4 5 6 7 8 |
cacheQuery( query: Dataset[_], tableName: Option[String] = None, storageLevel: StorageLevel = MEMORY_AND_DISK): Unit |
cacheQuery adds the analyzed logical plan of the input query to the cachedData internal registry of cached queries.
Internally, cacheQuery firstly requests the input query for the analyzed logical plan and creates a InMemoryRelation with the following properties:
-
spark.sql.inMemoryColumnarStorage.compressed (enabled by default)
-
spark.sql.inMemoryColumnarStorage.batchSize (default:
10000) -
Input
storageLevelstorage level -
Optimized physical query plan (after requesting
SessionStateto execute the analyzed logical plan) -
Input
tableName -
Statistics of the analyzed query plan
cacheQuery then creates a CachedData (for the analyzed query plan and the InMemoryRelation) and adds it to the cachedData internal registry.
If the input query has already been cached, cacheQuery simply prints the following WARN message to the logs and exits (i.e. does nothing but printing out the WARN message):
|
1 2 3 4 5 |
WARN CacheManager: Asked to cache already cached data. |
|
Note
|
|
Removing All Cached Tables From In-Memory Cache — clearCache Method
|
1 2 3 4 5 |
clearCache(): Unit |
clearCache acquires a write lock and unpersists RDD[CachedBatch]s of the queries in cachedData before removing them altogether.
|
Note
|
clearCache is used when the CatalogImpl is requested to clearCache.
|
recacheByCondition Internal Method
|
1 2 3 4 5 |
recacheByCondition(spark: SparkSession, condition: LogicalPlan => Boolean): Unit |
recacheByCondition…FIXME
|
Note
|
recacheByCondition is used when CacheManager is requested to recacheByPlan or recacheByPath.
|
recacheByPlan Method
|
1 2 3 4 5 |
recacheByPlan(spark: SparkSession, plan: LogicalPlan): Unit |
recacheByPlan…FIXME
|
Note
|
recacheByPlan is used exclusively when InsertIntoDataSourceCommand logical command is executed.
|
recacheByPath Method
|
1 2 3 4 5 |
recacheByPath(spark: SparkSession, resourcePath: String): Unit |
recacheByPath…FIXME
|
Note
|
recacheByPath is used exclusively when CatalogImpl is requested to refreshByPath.
|
Replacing Logical Query Segments With Cached Query Plans — useCachedData Method
|
1 2 3 4 5 |
useCachedData(plan: LogicalPlan): LogicalPlan |
useCachedData…FIXME
|
Note
|
useCachedData is used exclusively when QueryExecution is requested for a cached logical query plan.
|
spark技术分享