Tokenizer
Tokenizer
is a unary transformer that converts the column of String values to lowercase and then splits it by white spaces.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import org.apache.spark.ml.feature.Tokenizer val tok = new Tokenizer() // dataset to transform val df = Seq( (1, "Hello world!"), (2, "Here is yet another sentence.")).toDF("id", "sentence") val tokenized = tok.setInputCol("sentence").setOutputCol("tokens").transform(df) scala> tokenized.show(truncate = false) +---+-----------------------------+-----------------------------------+ |id |sentence |tokens | +---+-----------------------------+-----------------------------------+ |1 |Hello world! |[hello, world!] | |2 |Here is yet another sentence.|[here, is, yet, another, sentence.]| +---+-----------------------------+-----------------------------------+ |