CacheManager — In-Memory Cache for Tables and Views
CacheManager
is an in-memory cache for tables and views (as logical plans). It uses the internal cachedData collection of CachedData to track logical plans and their cached InMemoryRelation representation.
CacheManager
is shared across SparkSessions
through SharedState.
1 2 3 4 5 6 |
val spark: SparkSession = ... spark.sharedState.cacheManager |
Cached Queries — cachedData
Internal Registry
cachedData
is a collection of CachedData with logical plans and their cached InMemoryRelation representation.
A new CachedData is added when a Dataset is cached and removed when a Dataset is uncached or when invalidating cache data with a resource path.
cachedData
is cleared when…FIXME
Caching Dataset (Registering Analyzed Logical Plan as InMemoryRelation) — cacheQuery
Method
1 2 3 4 5 6 7 8 |
cacheQuery( query: Dataset[_], tableName: Option[String] = None, storageLevel: StorageLevel = MEMORY_AND_DISK): Unit |
cacheQuery
adds the analyzed logical plan of the input query
to the cachedData internal registry of cached queries.
Internally, cacheQuery
firstly requests the input query
for the analyzed logical plan and creates a InMemoryRelation with the following properties:
-
spark.sql.inMemoryColumnarStorage.compressed (enabled by default)
-
spark.sql.inMemoryColumnarStorage.batchSize (default:
10000
) -
Input
storageLevel
storage level -
Optimized physical query plan (after requesting
SessionState
to execute the analyzed logical plan) -
Input
tableName
-
Statistics of the analyzed query plan
cacheQuery
then creates a CachedData (for the analyzed query plan and the InMemoryRelation
) and adds it to the cachedData internal registry.
If the input query
has already been cached, cacheQuery
simply prints the following WARN message to the logs and exits (i.e. does nothing but printing out the WARN message):
1 2 3 4 5 |
WARN CacheManager: Asked to cache already cached data. |
Note
|
|
Removing All Cached Tables From In-Memory Cache — clearCache
Method
1 2 3 4 5 |
clearCache(): Unit |
clearCache
acquires a write lock and unpersists RDD[CachedBatch]
s of the queries in cachedData before removing them altogether.
Note
|
clearCache is used when the CatalogImpl is requested to clearCache.
|
recacheByCondition
Internal Method
1 2 3 4 5 |
recacheByCondition(spark: SparkSession, condition: LogicalPlan => Boolean): Unit |
recacheByCondition
…FIXME
Note
|
recacheByCondition is used when CacheManager is requested to recacheByPlan or recacheByPath.
|
recacheByPlan
Method
1 2 3 4 5 |
recacheByPlan(spark: SparkSession, plan: LogicalPlan): Unit |
recacheByPlan
…FIXME
Note
|
recacheByPlan is used exclusively when InsertIntoDataSourceCommand logical command is executed.
|
recacheByPath
Method
1 2 3 4 5 |
recacheByPath(spark: SparkSession, resourcePath: String): Unit |
recacheByPath
…FIXME
Note
|
recacheByPath is used exclusively when CatalogImpl is requested to refreshByPath.
|
Replacing Logical Query Segments With Cached Query Plans — useCachedData
Method
1 2 3 4 5 |
useCachedData(plan: LogicalPlan): LogicalPlan |
useCachedData
…FIXME
Note
|
useCachedData is used exclusively when QueryExecution is requested for a cached logical query plan.
|