ALS — Estimator for ALSModel
ALS is an Estimator that generates a ALSModel.
ALS uses als-[random-numbers] for the default identifier.
ALS can be fine-tuned using parameters.
| Parameter | Default Value | Description |
|---|---|---|
|
|
Alpha constant in the implicit preference formulation Must be non-negative, i.e. at least Used when ALS trains a model (and computes factors for users and items datasets) with implicit preference enabled (which is disabled by default) |
|
|
|
Checkpoint interval, i.e. how many iterations between checkpoints. Must be at least |
|
|
|
Strategy for dealing with unknown or new users/items at prediction time, i.e. what happens for user or item ids the model has not seen in the training data. Supported values:
|
|
|
StorageLevel for ALS model factors |
||
|
|
Flag to turn implicit preference on ( |
|
|
StorageLevel for intermediate datasets. Must not be |
||
|
|
Column name for item ids Must be all integers or numerics within the integer value range |
|
|
|
Maximum number of iterations Must be non-negative, i.e. at least |
|
|
Disabled ( |
Flag to decide whether to apply nonnegativity constraints for least squares. |
|
|
|
Number of user blocks Has to be at least |
|
|
|
Number of item blocks Has to be at least |
|
|
|
Column name for predictions
|
|
|
|
Rank of the matrix factorization Has to be at least |
|
|
|
||
|
|
Regularization parameter Must be non-negative, i.e. at least |
|
|
Randomly-generated |
Random seed |
|
|
|
Column name for user ids Must be all integers or numerics within the integer value range |
computeFactors Internal Method
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
computeFactors[ID]( srcFactorBlocks: RDD[(Int, FactorBlock)], srcOutBlocks: RDD[(Int, OutBlock)], dstInBlocks: RDD[(Int, InBlock[ID])], rank: Int, regParam: Double, srcEncoder: LocalIndexEncoder, implicitPrefs: Boolean = false, alpha: Double = 1.0, solver: LeastSquaresNESolver): RDD[(Int, FactorBlock)] |
computeFactors…FIXME
|
Note
|
computeFactors is used when…FIXME
|
Fitting ALSModel — fit Method
|
1 2 3 4 5 |
fit(dataset: Dataset[_]): ALSModel |
Internally, fit validates the schema of the dataset (to make sure that the types of the columns are correct and the prediction column is not available yet).
fit casts the rating column (as defined using ratingCol parameter) to FloatType.
fit selects user, item and rating columns (from the dataset) and converts it to RDD of Rating instances.
|
Note
|
fit converts the dataset to RDD using rdd operator.
|
fit prints out the training parameters as INFO message to the logs:
|
1 2 3 4 5 |
INFO ...FIXME |
fit trains a model, i.e. generates a pair of RDDs of user and item factors.
fit converts the RDDs with user and item factors to corresponding DataFrames with id and features columns.
fit creates a ALSModel.
fit prints out the following INFO message to the logs:
|
1 2 3 4 5 |
INFO training finished |
|
Caution
|
FIXME Check out the log |
In the end, fit copies parameter values to the ALSModel model.
|
Caution
|
FIXME Why is the copying necessary? |
partitionRatings Internal Method
|
1 2 3 4 5 6 7 8 |
partitionRatings[ID]( ratings: RDD[Rating[ID]], srcPart: Partitioner, dstPart: Partitioner): RDD[((Int, Int), RatingBlock[ID])] |
partitionRatings…FIXME
|
Note
|
partitionRatings is used when…FIXME
|
makeBlocks Internal Method
|
1 2 3 4 5 6 7 8 9 10 11 |
makeBlocks[ID]( prefix: String, ratingBlocks: RDD[((Int, Int), RatingBlock[ID])], srcPart: Partitioner, dstPart: Partitioner, storageLevel: StorageLevel)( implicit srcOrd: Ordering[ID]): (RDD[(Int, InBlock[ID])], RDD[(Int, OutBlock)]) |
makeBlocks…FIXME
|
Note
|
makeBlocks is used when…FIXME
|
train Method
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
train[ID]( ratings: RDD[Rating[ID]], rank: Int = 10, numUserBlocks: Int = 10, numItemBlocks: Int = 10, maxIter: Int = 10, regParam: Double = 0.1, implicitPrefs: Boolean = false, alpha: Double = 1.0, nonnegative: Boolean = false, intermediateRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK, finalRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK, checkpointInterval: Int = 10, seed: Long = 0L)( implicit ord: Ordering[ID]): (RDD[(ID, Array[Float])], RDD[(ID, Array[Float])]) |
train first creates
train partition the ratings RDD (using two HashPartitioners with numUserBlocks and numItemBlocks partitions) and immediately persists the RDD per intermediateRDDStorageLevel storage level.
train creates a pair of user in and out block RDDs for blockRatings.
train triggers caching.
|
Note
|
train uses a Spark idiom to trigger caching by counting the elements of an RDD.
|
train swaps users and items to create a swappedBlockRatings RDD.
train creates a pair of user in and out block RDDs for the swappedBlockRatings RDD.
train triggers caching.
train creates LocalIndexEncoders for user and item HashPartitioner partitioners.
|
Caution
|
FIXME train gets too “heavy”, i.e. advanced. Gave up for now. Sorry.
|
train throws a IllegalArgumentException when ratings is empty.
|
1 2 3 4 5 |
requirement failed: No ratings available from [ratings] |
train throws a IllegalArgumentException when intermediateRDDStorageLevel is NONE.
|
1 2 3 4 5 |
requirement failed: ALS is not designed to run without persisting intermediate RDDs. |
|
Note
|
train is used when…FIXME
|
validateAndTransformSchema Internal Method
|
1 2 3 4 5 |
validateAndTransformSchema(schema: StructType): StructType |
validateAndTransformSchema…FIXME
|
Note
|
validateAndTransformSchema is used exclusively when ALS is requested to transform a dataset schema.
|
Transforming Dataset Schema — transformSchema Method
|
1 2 3 4 5 |
transformSchema(schema: StructType): StructType |
Internally, transformSchema…FIXME
spark技术分享