关注 spark技术分享,
撸spark源码 玩spark最佳实践

Models

ML Pipeline Models

Model abstract class is a Transformer with the optional Estimator that has produced it (as a transient parent field).

Note
An Estimator is optional and is available only after fit (of an Estimator) has been executed whose result a model is.

As a Transformer it takes a DataFrame and transforms it to a result DataFrame with prediction column added.

There are two direct implementations of the Model class that are not directly related to a concrete ML algorithm:

PipelineModel

Caution
PipelineModel is a private[ml] class.

PipelineModel is a Model of Pipeline estimator.

Once fit, you can use the result model as any other models to transform datasets (as DataFrame).

A very interesting use case of PipelineModel is when a Pipeline is made up of Transformer instances.

PredictionModel

PredictionModel is an abstract class to represent a model for prediction algorithms like regression and classification (that have their own specialized models – details coming up below).

PredictionModel is basically a Transformer with predict method to calculate predictions (that end up in prediction column).

PredictionModel belongs to org.apache.spark.ml package.

The contract of PredictionModel class requires that every custom implementation defines predict method (with FeaturesType type being the type of features).

The direct less-algorithm-specific extensions of the PredictionModel class are:

As a custom Transformer it comes with its own custom transform method.

Internally, transform first ensures that the type of the features column matches the type of the model and adds the prediction column of type Double to the schema of the result DataFrame.

It then creates the result DataFrame and adds the prediction column with a predictUDF function applied to the values of the features column.

Caution
FIXME A diagram to show the transformation from a dataframe (on the left) and another (on the right) with an arrow to represent the transformation method.
Tip

Enable DEBUG logging level for a PredictionModel implementation, e.g. LinearRegressionModel, to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

ClassificationModel

ClassificationModel is a PredictionModel that transforms a DataFrame with mandatory features, label, and rawPrediction (of type Vector) columns to a DataFrame with prediction column added.

Note
A Model with ClassifierParams parameters, e.g. ClassificationModel, requires that a DataFrame have the mandatory features, label (of type Double), and rawPrediction (of type Vector) columns.

ClassificationModel comes with its own transform (as Transformer) and predict (as PredictionModel).

The following is a list of the known ClassificationModel custom implementations (as of March, 24th):

  • ProbabilisticClassificationModel (the abstract parent of the following classification models)

    • DecisionTreeClassificationModel (final)

    • LogisticRegressionModel

    • NaiveBayesModel

    • RandomForestClassificationModel (final)

RegressionModel

RegressionModel is a PredictionModel that transforms a DataFrame with mandatory label, features, and prediction columns.

It comes with no own methods or values and so is more a marker abstract class (to combine different features of regression models under one type).

LinearRegressionModel

LinearRegressionModel represents a model produced by a LinearRegression estimator. It transforms the required features column of type org.apache.spark.mllib.linalg.Vector.

Note
It is a private[ml] class so what you, a developer, may eventually work with is the more general RegressionModel, and since RegressionModel is just a marker no-method abstract class, it is more a PredictionModel.

As a linear regression model that extends LinearRegressionParams it expects the following schema of an input DataFrame:

  • label (required)

  • features (required)

  • prediction

  • regParam

  • elasticNetParam

  • maxIter (Int)

  • tol (Double)

  • fitIntercept (Boolean)

  • standardization (Boolean)

  • weightCol (String)

  • solver (String)

(New in 1.6.0) LinearRegressionModel is also a MLWritable (so you can save it to a persistent storage for later reuse).

With DEBUG logging enabled (see above) you can see the following messages in the logs when transform is called and transforms the schema.

The implementation of predict for LinearRegressionModel calculates dot(v1, v2) of two Vectors – features and coefficients – (of DenseVector or SparseVector types) of the same size and adds intercept.

Note
The coefficients Vector and intercept Double are the integral part of LinearRegressionModel as the required input parameters of the constructor.

LinearRegressionModel Example

RandomForestRegressionModel

RandomForestRegressionModel is a PredictionModel with features column of type Vector.

Interestingly, DataFrame transformation (as part of Transformer contract) uses SparkContext.broadcast to send itself to the nodes in a Spark cluster and calls calculates predictions (as prediction column) on features.

KMeansModel

KMeansModel is a Model of KMeans algorithm.

It belongs to org.apache.spark.ml.clustering package.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Models
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏