ALS — Estimator for ALSModel
ALS
is an Estimator that generates a ALSModel.
ALS
uses als-[random-numbers]
for the default identifier.
ALS
can be fine-tuned using parameters.
Parameter | Default Value | Description |
---|---|---|
|
Alpha constant in the implicit preference formulation Must be non-negative, i.e. at least Used when ALS trains a model (and computes factors for users and items datasets) with implicit preference enabled (which is disabled by default) |
|
|
Checkpoint interval, i.e. how many iterations between checkpoints. Must be at least |
|
|
Strategy for dealing with unknown or new users/items at prediction time, i.e. what happens for user or item ids the model has not seen in the training data. Supported values:
|
|
StorageLevel for ALS model factors |
||
|
Flag to turn implicit preference on ( |
|
StorageLevel for intermediate datasets. Must not be |
||
|
Column name for item ids Must be all integers or numerics within the integer value range |
|
|
Maximum number of iterations Must be non-negative, i.e. at least |
|
Disabled ( |
Flag to decide whether to apply nonnegativity constraints for least squares. |
|
|
Number of user blocks Has to be at least |
|
|
Number of item blocks Has to be at least |
|
|
Column name for predictions
|
|
|
Rank of the matrix factorization Has to be at least |
|
|
||
|
Regularization parameter Must be non-negative, i.e. at least |
|
Randomly-generated |
Random seed |
|
|
Column name for user ids Must be all integers or numerics within the integer value range |
computeFactors
Internal Method
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
computeFactors[ID]( srcFactorBlocks: RDD[(Int, FactorBlock)], srcOutBlocks: RDD[(Int, OutBlock)], dstInBlocks: RDD[(Int, InBlock[ID])], rank: Int, regParam: Double, srcEncoder: LocalIndexEncoder, implicitPrefs: Boolean = false, alpha: Double = 1.0, solver: LeastSquaresNESolver): RDD[(Int, FactorBlock)] |
computeFactors
…FIXME
Note
|
computeFactors is used when…FIXME
|
Fitting ALSModel — fit
Method
1 2 3 4 5 |
fit(dataset: Dataset[_]): ALSModel |
Internally, fit
validates the schema of the dataset
(to make sure that the types of the columns are correct and the prediction column is not available yet).
fit
casts the rating column (as defined using ratingCol parameter) to FloatType
.
fit
selects user, item and rating columns (from the dataset
) and converts it to RDD
of Rating
instances.
Note
|
fit converts the dataset to RDD using rdd operator.
|
fit
prints out the training parameters as INFO message to the logs:
1 2 3 4 5 |
INFO ...FIXME |
fit
trains a model, i.e. generates a pair of RDDs of user and item factors.
fit
converts the RDDs with user and item factors to corresponding DataFrames with id
and features
columns.
fit
creates a ALSModel
.
fit
prints out the following INFO message to the logs:
1 2 3 4 5 |
INFO training finished |
Caution
|
FIXME Check out the log |
In the end, fit
copies parameter values to the ALSModel
model.
Caution
|
FIXME Why is the copying necessary? |
partitionRatings
Internal Method
1 2 3 4 5 6 7 8 |
partitionRatings[ID]( ratings: RDD[Rating[ID]], srcPart: Partitioner, dstPart: Partitioner): RDD[((Int, Int), RatingBlock[ID])] |
partitionRatings
…FIXME
Note
|
partitionRatings is used when…FIXME
|
makeBlocks
Internal Method
1 2 3 4 5 6 7 8 9 10 11 |
makeBlocks[ID]( prefix: String, ratingBlocks: RDD[((Int, Int), RatingBlock[ID])], srcPart: Partitioner, dstPart: Partitioner, storageLevel: StorageLevel)( implicit srcOrd: Ordering[ID]): (RDD[(Int, InBlock[ID])], RDD[(Int, OutBlock)]) |
makeBlocks
…FIXME
Note
|
makeBlocks is used when…FIXME
|
train
Method
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
train[ID]( ratings: RDD[Rating[ID]], rank: Int = 10, numUserBlocks: Int = 10, numItemBlocks: Int = 10, maxIter: Int = 10, regParam: Double = 0.1, implicitPrefs: Boolean = false, alpha: Double = 1.0, nonnegative: Boolean = false, intermediateRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK, finalRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK, checkpointInterval: Int = 10, seed: Long = 0L)( implicit ord: Ordering[ID]): (RDD[(ID, Array[Float])], RDD[(ID, Array[Float])]) |
train
first creates
train
partition the ratings RDD (using two HashPartitioners with numUserBlocks and numItemBlocks partitions) and immediately persists the RDD per intermediateRDDStorageLevel
storage level.
train
creates a pair of user in and out block RDDs for blockRatings
.
train
triggers caching.
Note
|
train uses a Spark idiom to trigger caching by counting the elements of an RDD.
|
train
swaps users and items to create a swappedBlockRatings
RDD.
train
creates a pair of user in and out block RDDs for the swappedBlockRatings
RDD.
train
triggers caching.
train
creates LocalIndexEncoders
for user and item HashPartitioner
partitioners.
Caution
|
FIXME train gets too “heavy”, i.e. advanced. Gave up for now. Sorry.
|
train
throws a IllegalArgumentException
when ratings
is empty.
1 2 3 4 5 |
requirement failed: No ratings available from [ratings] |
train
throws a IllegalArgumentException
when intermediateRDDStorageLevel
is NONE
.
1 2 3 4 5 |
requirement failed: ALS is not designed to run without persisting intermediate RDDs. |
Note
|
train is used when…FIXME
|
validateAndTransformSchema
Internal Method
1 2 3 4 5 |
validateAndTransformSchema(schema: StructType): StructType |
validateAndTransformSchema
…FIXME
Note
|
validateAndTransformSchema is used exclusively when ALS is requested to transform a dataset schema.
|
Transforming Dataset Schema — transformSchema
Method
1 2 3 4 5 |
transformSchema(schema: StructType): StructType |
Internally, transformSchema
…FIXME