关注 spark技术分享,
撸spark源码 玩spark最佳实践

ALS — Estimator for ALSModel

ALS — Estimator for ALSModel

ALS uses als-[random-numbers] for the default identifier.

ALS can be fine-tuned using parameters.

Table 1. ALS’s Parameters (aka ALSParams)
Parameter Default Value Description

alpha

1.0

Alpha constant in the implicit preference formulation

Must be non-negative, i.e. at least 0.

Used when ALS trains a model (and computes factors for users and items datasets) with implicit preference enabled (which is disabled by default)

checkpointInterval

10

Checkpoint interval, i.e. how many iterations between checkpoints.

Must be at least 1 or exactly -1 to disable checkpointing

coldStartStrategy

nan

Strategy for dealing with unknown or new users/items at prediction time, i.e. what happens for user or item ids the model has not seen in the training data.

Supported values:

  • nan – predicted value for unknown ids will be NaN

  • drop – rows in the input DataFrame containing unknown ids are dropped from the output DataFrame (with predictions).

finalStorageLevel

MEMORY_AND_DISK

StorageLevel for ALS model factors

implicitPrefs

false

Flag to turn implicit preference on (true) or off (false)

intermediateStorageLevel

MEMORY_AND_DISK

StorageLevel for intermediate datasets. Must not be NONE.

itemCol

item

Column name for item ids

Must be all integers or numerics within the integer value range

maxIter

10

Maximum number of iterations

Must be non-negative, i.e. at least 0.

nonnegative

Disabled (false)

Flag to decide whether to apply nonnegativity constraints for least squares.

numUserBlocks

10

Number of user blocks

Has to be at least 1.

numItemBlocks

10

Number of item blocks

Has to be at least 1.

predictionCol

prediction

Column name for predictions

  • The main purpose of the estimator

  • Of type FloatType

rank

10

Rank of the matrix factorization

Has to be at least 1.

ratingCol

rating

Column name for ratings

Must be all integers or numerics within the integer value range

  • Cast to FloatType

  • Set to 1.0 when undefined

regParam

10

Regularization parameter

Must be non-negative, i.e. at least 0.

seed

Randomly-generated

Random seed

userCol

user

Column name for user ids

Must be all integers or numerics within the integer value range

computeFactors Internal Method

computeFactors…​FIXME

Note
computeFactors is used when…​FIXME

Fitting ALSModel — fit Method

Internally, fit validates the schema of the dataset (to make sure that the types of the columns are correct and the prediction column is not available yet).

fit casts the rating column (as defined using ratingCol parameter) to FloatType.

fit selects user, item and rating columns (from the dataset) and converts it to RDD of Rating instances.

Note
fit converts the dataset to RDD using rdd operator.

fit prints out the training parameters as INFO message to the logs:

fit trains a model, i.e. generates a pair of RDDs of user and item factors.

fit converts the RDDs with user and item factors to corresponding DataFrames with id and features columns.

fit creates a ALSModel.

fit prints out the following INFO message to the logs:

Caution
FIXME Check out the log

In the end, fit copies parameter values to the ALSModel model.

Caution
FIXME Why is the copying necessary?

partitionRatings Internal Method

partitionRatings…​FIXME

Note
partitionRatings is used when…​FIXME

makeBlocks Internal Method

makeBlocks…​FIXME

Note
makeBlocks is used when…​FIXME

train Method

train first creates

train partition the ratings RDD (using two HashPartitioners with numUserBlocks and numItemBlocks partitions) and immediately persists the RDD per intermediateRDDStorageLevel storage level.

train triggers caching.

Note
train uses a Spark idiom to trigger caching by counting the elements of an RDD.

train swaps users and items to create a swappedBlockRatings RDD.

train creates a pair of user in and out block RDDs for the swappedBlockRatings RDD.

train triggers caching.

train creates LocalIndexEncoders for user and item HashPartitioner partitioners.

Caution
FIXME train gets too “heavy”, i.e. advanced. Gave up for now. Sorry.

train throws a IllegalArgumentException when ratings is empty.

train throws a IllegalArgumentException when intermediateRDDStorageLevel is NONE.

Note
train is used when…​FIXME

validateAndTransformSchema Internal Method

validateAndTransformSchema…​FIXME

Note
validateAndTransformSchema is used exclusively when ALS is requested to transform a dataset schema.

Transforming Dataset Schema — transformSchema Method

Internally, transformSchema…​FIXME

赞(0) 打赏
未经允许不得转载:spark技术分享 » ALS — Estimator for ALSModel
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏