StorageLevel
StorageLevel
describes how an RDD is persisted (and addresses the following concerns):
-
Does RDD use disk?
-
Does RDD use memory to store data?
-
How much of RDD is in memory?
-
Does RDD use off-heap memory?
-
Should an RDD be serialized or not (while storing the data)?
-
How many replicas (default:
1
) to use (can only be less than40
)?
There are the following StorageLevel
(number _2
in the name denotes 2 replicas):
-
DISK_ONLY
-
DISK_ONLY_2
-
MEMORY_ONLY
(default forcache
operation for RDDs) -
MEMORY_ONLY_2
-
MEMORY_ONLY_SER
-
MEMORY_ONLY_SER_2
-
MEMORY_AND_DISK_2
-
MEMORY_AND_DISK_SER
-
MEMORY_AND_DISK_SER_2
-
OFF_HEAP
You can check out the storage level using getStorageLevel()
operation.
1 2 3 4 5 6 7 8 |
val lines = sc.textFile("README.md") scala> lines.getStorageLevel res0: org.apache.spark.storage.StorageLevel = StorageLevel(disk=false, memory=false, offheap=false, deserialized=false, replication=1) |
StorageLevel
can indicate to use memory for data storage using useMemory
flag.
1 2 3 4 5 |
useMemory: Boolean |
StorageLevel
can indicate to use disk for data storage using useDisk
flag.
1 2 3 4 5 |
useDisk: Boolean |
StorageLevel
can indicate to store data in deserialized format using deserialized
flag.
1 2 3 4 5 |
deserialized: Boolean |
StorageLevel
can indicate to replicate the data to other block managers using replication
property.
1 2 3 4 5 |
replication: Int |