StorageLevel
StorageLevel describes how an RDD is persisted (and addresses the following concerns):
-
Does RDD use disk?
-
Does RDD use memory to store data?
-
How much of RDD is in memory?
-
Does RDD use off-heap memory?
-
Should an RDD be serialized or not (while storing the data)?
-
How many replicas (default:
1) to use (can only be less than40)?
There are the following StorageLevel (number _2 in the name denotes 2 replicas):
-
DISK_ONLY -
DISK_ONLY_2 -
MEMORY_ONLY(default forcacheoperation for RDDs) -
MEMORY_ONLY_2 -
MEMORY_ONLY_SER -
MEMORY_ONLY_SER_2 -
MEMORY_AND_DISK_2 -
MEMORY_AND_DISK_SER -
MEMORY_AND_DISK_SER_2 -
OFF_HEAP
You can check out the storage level using getStorageLevel() operation.
|
1 2 3 4 5 6 7 8 |
val lines = sc.textFile("README.md") scala> lines.getStorageLevel res0: org.apache.spark.storage.StorageLevel = StorageLevel(disk=false, memory=false, offheap=false, deserialized=false, replication=1) |
StorageLevel can indicate to use memory for data storage using useMemory flag.
|
1 2 3 4 5 |
useMemory: Boolean |
StorageLevel can indicate to use disk for data storage using useDisk flag.
|
1 2 3 4 5 |
useDisk: Boolean |
StorageLevel can indicate to store data in deserialized format using deserialized flag.
|
1 2 3 4 5 |
deserialized: Boolean |
StorageLevel can indicate to replicate the data to other block managers using replication property.
|
1 2 3 4 5 |
replication: Int |
spark技术分享