Example — Linear Regression-spark技术分享

Example — Linear Regression

The DataFrame used for Linear Regression has to have features column of org.apache.spark.mllib.linalg.VectorUDT type.

Note	You can change the name of the column using `featuresCol` parameter.

The list of the parameters of LinearRegression:



scala> println(lr.explainParams)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0)
solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto' (default: auto)
standardization: whether to standardize the training features before fitting the model (default: true)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0 (default: )

scala> println(lr.explainParams)

elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)

featuresCol: features column name (default: features)

fitIntercept: whether to fit an intercept term (default: true)

labelCol: label column name (default: label)

maxIter: maximum number of iterations (>= 0) (default: 100)

predictionCol: prediction column name (default: prediction)

regParam: regularization parameter (>= 0) (default: 0.0)

solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto' (default: auto)

standardization: whether to standardize the training features before fitting the model (default: true)

tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)

weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0 (default: )

Caution

FIXME The following example is work in progress.



import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline("my_pipeline")

import org.apache.spark.ml.regression._
val lr = new LinearRegression

val df = sc.parallelize(0 to 9).toDF("num")
val stages = Array(lr)
val model = pipeline.setStages(stages).fit(df)

// the above lines gives:
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually IntegerType.
  at scala.Predef$.require(Predef.scala:219)
  at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
  at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
  at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:72)
  at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:117)
  at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:182)
  at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:182)
  at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
  at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
  at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
  at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:182)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:66)
  at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)
  ... 51 elided

import org.apache.spark.ml.Pipeline

val pipeline = new Pipeline("my_pipeline")

import org.apache.spark.ml.regression._

val lr = new LinearRegression

val df = sc.parallelize(0 to 9).toDF("num")

val stages = Array(lr)

val model = pipeline.setStages(stages).fit(df)

// the above lines gives:

java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually IntegerType.

at scala.Predef$.require(Predef.scala:219)

at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)

at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)

at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:72)

at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:117)

at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:182)

at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)

at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)

at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)

at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:182)

at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:66)

at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)

... 51 elided

Example — Linear Regression

Example — Linear Regression

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部