关注 spark技术分享,
撸spark源码 玩spark最佳实践

Transformers

Transformers

A transformer is a ML Pipeline component that transforms a DataFrame into another DataFrame (both called datasets).

Transformers prepare a dataset for an machine learning algorithm to work with. They are also very helpful to transform DataFrames in general (even outside the machine learning space).

Transformers are instances of org.apache.spark.ml.Transformer abstract class that offers transform family of methods:

A Transformer is a PipelineStage and thus can be a part of a Pipeline.

A few available implementations of Transformer:

See Custom UnaryTransformer section for a custom Transformer implementation.

StopWordsRemover

StopWordsRemover is a machine learning feature transformer that takes a string array column and outputs a string array column with all defined stop words removed. The transformer comes with a standard set of English stop words as default (that are the same as scikit-learn uses, i.e. from the Glasgow Information Retrieval Group).

StopWordsRemover class belongs to org.apache.spark.ml.feature package.

It accepts the following parameters:

Note
null values from the input array are preserved unless adding null to stopWords explicitly.

Binarizer

Binarizer is a Transformer that splits the values in the input column into two groups – “ones” for values larger than the threshold and “zeros” for the others.

It works with DataFrames with the input column of DoubleType or VectorUDT. The type of the result output column matches the type of the input column, i.e. DoubleType or VectorUDT.

SQLTransformer

SQLTransformer is a Transformer that does transformations by executing SELECT …​ FROM THIS with THIS being the underlying temporary table registered for the input dataset.

Note
It has been available since Spark 1.6.0.

It requires that the SELECT query uses THIS that corresponds to a temporary table and simply executes the mandatory statement using sql method.

You have to specify the mandatory statement parameter using setStatement method.

VectorAssembler

VectorAssembler is a feature transformer that assembles (merges) multiple columns into a (feature) vector column.

It supports columns of the types NumericType, BooleanType, and VectorUDT. Doubles are passed on untouched. Other numberic types and booleans are cast to doubles.

UnaryTransformers

The UnaryTransformer abstract class is a specialized Transformer that applies transformation to one input column and writes results to another (by appending a new column).

Each UnaryTransformer defines the input and output columns using the following “chain” methods (they return the transformer on which they were executed and so are chainable):

  • setInputCol(value: String)

  • setOutputCol(value: String)

Each UnaryTransformer calls validateInputType while executing transformSchema(schema: StructType) (that is part of PipelineStage contract).

Note
A UnaryTransformer is a PipelineStage.

When transform is called, it first calls transformSchema (with DEBUG logging enabled) and then adds the column as a result of calling a protected abstract createTransformFunc.

Note
createTransformFunc function is abstract and defined by concrete UnaryTransformer objects.

Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).

Note
Using udf and withColumn methods from Spark SQL demonstrates an excellent integration between the Spark modules: MLlib and SQL.

The following are UnaryTransformer implementations in spark.ml:

RegexTokenizer

RegexTokenizer is a UnaryTransformer that tokenizes a String into a collection of String.

Note
Read the official scaladoc for org.apache.spark.ml.feature.RegexTokenizer.

It supports minTokenLength parameter that is the minimum token length that you can change using setMinTokenLength method. It simply filters out smaller tokens and defaults to 1.

It has gaps parameter that indicates whether regex splits on gaps (true) or matches tokens (false). You can set it using setGaps. It defaults to true.

When set to true (i.e. splits on gaps) it uses Regex.split while Regex.findAllIn for false.

It has pattern parameter that is the regex for tokenizing. It uses Scala’s .r method to convert the string to regex. Use setPattern to set it. It defaults to \\s+.

It has toLowercase parameter that indicates whether to convert all characters to lowercase before tokenizing. Use setToLowercase to change it. It defaults to true.

NGram

In this example you use org.apache.spark.ml.feature.NGram that converts the input collection of strings into a collection of n-grams (of n words).

HashingTF

Another example of a transformer is org.apache.spark.ml.feature.HashingTF that works on a Column of ArrayType.

It transforms the rows for the input column into a sparse term frequency vector.

The name of the output column is optional, and if not specified, it becomes the identifier of a HashingTF object with the __output suffix.

OneHotEncoder

OneHotEncoder is a Tokenizer that maps a numeric input column of label indices onto a column of binary vectors.

Custom UnaryTransformer

The following class is a custom UnaryTransformer that transforms words using upper letters.

Given a DataFrame you could use it as follows:

赞(0) 打赏
未经允许不得转载:spark技术分享 » Transformers
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏