关注 spark技术分享,
撸spark源码 玩spark最佳实践

Example — Text Classification

Example — Text Classification

Problem: Given a text document, classify it as a scientific or non-scientific one.

Note
The example uses a case class LabeledText to have the schema described nicely.

It is then tokenized and transformed into another DataFrame with an additional column called features that is a Vector of numerical values.

Note
Paste the code below into Spark Shell using :paste mode.

Now, the tokenization part comes that maps the input text of each text document into tokens (a Seq[String]) and then into a Vector of numerical values that can only then be understood by a machine learning algorithm (that operates on Vector instances).

The train a model phase uses the logistic regression machine learning algorithm to build a model and predict label for future input text documents (and hence classify them as scientific or non-scientific).

It uses two columns, namely label and features vector to build a logistic regression model to make predictions.

Let’s tune the model’s hyperparameters (using “tools” from org.apache.spark.ml.tuning package).

Caution
FIXME Review the available classes in the org.apache.spark.ml.tuning package.

Let’s use the cross-validated model to calculate predictions and evaluate their precision.

You can eventually save the model for later use.

Congratulations! You’re done.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Example — Text Classification
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏