ALS — Estimator for ALSModel-spark技术分享

ALS — Estimator for ALSModel

ALS is an Estimator that generates a ALSModel.

ALS uses als-[random-numbers] for the default identifier.

ALS can be fine-tuned using parameters.

Table 1. ALS’s Parameters (aka ALSParams)
Parameter	Default Value	Description
`alpha`	`1.0`	Alpha constant in the implicit preference formulation Must be non-negative, i.e. at least `0`. Used when ALS trains a model (and computes factors for users and items datasets) with implicit preference enabled (which is disabled by default)
`checkpointInterval`	`10`	Checkpoint interval, i.e. how many iterations between checkpoints. Must be at least `1` or exactly `-1` to disable checkpointing
`coldStartStrategy`	`nan`	Strategy for dealing with unknown or new users/items at prediction time, i.e. what happens for user or item ids the model has not seen in the training data. Supported values: `nan` – predicted value for unknown ids will be NaN `drop` – rows in the input DataFrame containing unknown ids are dropped from the output DataFrame (with predictions).
`finalStorageLevel`	MEMORY_AND_DISK	StorageLevel for ALS model factors
`implicitPrefs`	`false`	Flag to turn implicit preference on (`true`) or off (`false`)
`intermediateStorageLevel`	MEMORY_AND_DISK	StorageLevel for intermediate datasets. Must not be `NONE`.
`itemCol`	`item`	Column name for item ids Must be all integers or numerics within the integer value range
`maxIter`	`10`	Maximum number of iterations Must be non-negative, i.e. at least `0`.
`nonnegative`	Disabled (`false`)	Flag to decide whether to apply nonnegativity constraints for least squares.
`numUserBlocks`	`10`	Number of user blocks Has to be at least `1`.
`numItemBlocks`	`10`	Number of item blocks Has to be at least `1`.
`predictionCol`	`prediction`	Column name for predictions The main purpose of the estimator Of type `FloatType`
`rank`	`10`	Rank of the matrix factorization Has to be at least `1`.
`ratingCol`	`rating`	Column name for ratings Must be all integers or numerics within the integer value range Cast to `FloatType` Set to `1.0` when undefined
`regParam`	`10`	Regularization parameter Must be non-negative, i.e. at least `0`.
`seed`	Randomly-generated	Random seed
`userCol`	`user`	Column name for user ids Must be all integers or numerics within the integer value range

`computeFactors` Internal Method



computeFactors[ID](
  srcFactorBlocks: RDD[(Int, FactorBlock)],
  srcOutBlocks: RDD[(Int, OutBlock)],
  dstInBlocks: RDD[(Int, InBlock[ID])],
  rank: Int,
  regParam: Double,
  srcEncoder: LocalIndexEncoder,
  implicitPrefs: Boolean = false,
  alpha: Double = 1.0,
  solver: LeastSquaresNESolver): RDD[(Int, FactorBlock)]

computeFactors[ID](

srcFactorBlocks: RDD[(Int, FactorBlock)],

srcOutBlocks: RDD[(Int, OutBlock)],

dstInBlocks: RDD[(Int, InBlock[ID])],

rank: Int,

regParam: Double,

srcEncoder: LocalIndexEncoder,

implicitPrefs: Boolean = false,

alpha: Double = 1.0,

solver: LeastSquaresNESolver): RDD[(Int, FactorBlock)]

computeFactors…FIXME

Note	`computeFactors` is used when…FIXME

Fitting ALSModel — `fit` Method



fit(dataset: Dataset[_]): ALSModel

fit(dataset: Dataset[_]): ALSModel

Internally, fit validates the schema of the dataset (to make sure that the types of the columns are correct and the prediction column is not available yet).

fit casts the rating column (as defined using ratingCol parameter) to FloatType.

fit selects user, item and rating columns (from the dataset) and converts it to RDD of Rating instances.

Note	`fit` converts the `dataset` to `RDD` using rdd operator.

fit prints out the training parameters as INFO message to the logs:



INFO ...FIXME

INFO ...FIXME

fit trains a model, i.e. generates a pair of RDDs of user and item factors.

fit converts the RDDs with user and item factors to corresponding DataFrames with id and features columns.

fit creates a ALSModel.

fit prints out the following INFO message to the logs:



INFO training finished

INFO training finished

Caution

FIXME Check out the log

In the end, fit copies parameter values to the ALSModel model.

Caution

FIXME Why is the copying necessary?

`partitionRatings` Internal Method



partitionRatings[ID](
  ratings: RDD[Rating[ID]],
  srcPart: Partitioner,
  dstPart: Partitioner): RDD[((Int, Int), RatingBlock[ID])]

partitionRatings[ID](

ratings: RDD[Rating[ID]],

srcPart: Partitioner,

dstPart: Partitioner): RDD[((Int, Int), RatingBlock[ID])]

partitionRatings…FIXME

Note	`partitionRatings` is used when…FIXME

`makeBlocks` Internal Method



makeBlocks[ID](
  prefix: String,
  ratingBlocks: RDD[((Int, Int), RatingBlock[ID])],
  srcPart: Partitioner,
  dstPart: Partitioner,
  storageLevel: StorageLevel)(
  implicit srcOrd: Ordering[ID]): (RDD[(Int, InBlock[ID])], RDD[(Int, OutBlock)])

makeBlocks[ID](

prefix: String,

ratingBlocks: RDD[((Int, Int), RatingBlock[ID])],

srcPart: Partitioner,

dstPart: Partitioner,

storageLevel: StorageLevel)(

implicit srcOrd: Ordering[ID]): (RDD[(Int, InBlock[ID])], RDD[(Int, OutBlock)])

makeBlocks…FIXME

Note	`makeBlocks` is used when…FIXME

`train` Method



train[ID](
  ratings: RDD[Rating[ID]],
  rank: Int = 10,
  numUserBlocks: Int = 10,
  numItemBlocks: Int = 10,
  maxIter: Int = 10,
  regParam: Double = 0.1,
  implicitPrefs: Boolean = false,
  alpha: Double = 1.0,
  nonnegative: Boolean = false,
  intermediateRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,
  finalRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,
  checkpointInterval: Int = 10,
  seed: Long = 0L)(
  implicit ord: Ordering[ID]): (RDD[(ID, Array[Float])], RDD[(ID, Array[Float])])

train[ID](

ratings: RDD[Rating[ID]],

rank: Int = 10,

numUserBlocks: Int = 10,

numItemBlocks: Int = 10,

maxIter: Int = 10,

regParam: Double = 0.1,

implicitPrefs: Boolean = false,

alpha: Double = 1.0,

nonnegative: Boolean = false,

intermediateRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,

finalRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,

checkpointInterval: Int = 10,

seed: Long = 0L)(

implicit ord: Ordering[ID]): (RDD[(ID, Array[Float])], RDD[(ID, Array[Float])])

train first creates

train partition the ratings RDD (using two HashPartitioners with numUserBlocks and numItemBlocks partitions) and immediately persists the RDD per intermediateRDDStorageLevel storage level.

train creates a pair of user in and out block RDDs for blockRatings.

train triggers caching.

Note	`train` uses a Spark idiom to trigger caching by counting the elements of an RDD.

train swaps users and items to create a swappedBlockRatings RDD.

train creates a pair of user in and out block RDDs for the swappedBlockRatings RDD.

train triggers caching.

train creates LocalIndexEncoders for user and item HashPartitioner partitioners.

Caution

FIXME train gets too “heavy”, i.e. advanced. Gave up for now. Sorry.

train throws a IllegalArgumentException when ratings is empty.



requirement failed: No ratings available from [ratings]

requirement failed: No ratings available from [ratings]

train throws a IllegalArgumentException when intermediateRDDStorageLevel is NONE.



requirement failed: ALS is not designed to run without persisting intermediate RDDs.

requirement failed: ALS is not designed to run without persisting intermediate RDDs.

Note	`train` is used when…FIXME

`validateAndTransformSchema` Internal Method



validateAndTransformSchema(schema: StructType): StructType

validateAndTransformSchema(schema: StructType): StructType

validateAndTransformSchema…FIXME

Note	`validateAndTransformSchema` is used exclusively when `ALS` is requested to transform a dataset schema.

Transforming Dataset Schema — `transformSchema` Method



transformSchema(schema: StructType): StructType

transformSchema(schema: StructType): StructType

Internally, transformSchema…FIXME

ALS — Estimator for ALSModel