Tokenizer-spark技术分享

Tokenizer

Tokenizer is a unary transformer that converts the column of String values to lowercase and then splits it by white spaces.



import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer()

// dataset to transform
val df = Seq(
  (1, "Hello world!"),
  (2, "Here is yet another sentence.")).toDF("id", "sentence")

val tokenized = tok.setInputCol("sentence").setOutputCol("tokens").transform(df)

scala> tokenized.show(truncate = false)
+---+-----------------------------+-----------------------------------+
|id |sentence                     |tokens                             |
+---+-----------------------------+-----------------------------------+
|1  |Hello world!                 |[hello, world!]                    |
|2  |Here is yet another sentence.|[here, is, yet, another, sentence.]|
+---+-----------------------------+-----------------------------------+

import org.apache.spark.ml.feature.Tokenizer

val tok = new Tokenizer()

// dataset to transform

val df = Seq(

(1, "Hello world!"),

(2, "Here is yet another sentence.")).toDF("id", "sentence")

val tokenized = tok.setInputCol("sentence").setOutputCol("tokens").transform(df)

scala> tokenized.show(truncate = false)

+---+-----------------------------+-----------------------------------+

|id |sentence |tokens |

+---+-----------------------------+-----------------------------------+

|1 |Hello world! |[hello, world!] |

|2 |Here is yet another sentence.|[here, is, yet, another, sentence.]|

+---+-----------------------------+-----------------------------------+

Tokenizer

Tokenizer

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部