spark-sql-spark技术分享-第56页

DataFrameStatFunctions — Working With Statistic Functions

2012-01-31admin阅读(1226)

DataFrameStatFunctions — Working With Statistic Functions

DataFrameStatFunctions is used to work with statistic functions in a structured query (a DataFrame).

Table 1. DataFrameStatFunctions API

Method

Description



approxQuantile(
  cols: Array[String],
  probabilities: Array[Double],
  relativeError: Double): Array[Array[Double]]
approxQuantile(
  col: String,
  probabilities: Array[Double],
  relativeError: Double): Array[Double]

approxQuantile(

cols: Array[String],

probabilities: Array[Double],

relativeError: Double): Array[Array[Double]]

approxQuantile(

col: String,

probabilities: Array[Double],

relativeError: Double): Array[Double]

bloomFilter



bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

corr



corr(col1: String, col2: String): Double
corr(col1: String, col2: String, method: String): Double

corr(col1: String, col2: String): Double

corr(col1: String, col2: String, method: String): Double

countMinSketch



countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch

countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

cov



cov(col1: String, col2: String): Double

cov(col1: String, col2: String): Double

crosstab



crosstab(col1: String, col2: String): DataFrame

crosstab(col1: String, col2: String): DataFrame

freqItems



freqItems(cols: Array[String]): DataFrame
freqItems(cols: Array[String], support: Double): DataFrame
freqItems(cols: Seq[String]): DataFrame
freqItems(cols: Seq[String], support: Double): DataFrame

freqItems(cols: Array[String]): DataFrame

freqItems(cols: Array[String], support: Double): DataFrame

freqItems(cols: Seq[String]): DataFrame

freqItems(cols: Seq[String], support: Double): DataFrame

sampleBy



sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

DataFrameStatFunctions is available using stat untyped transformation.



val q: DataFrame = ...
q.stat

val q: DataFrame = ...

q.stat

`approxQuantile` Method



approxQuantile(
  cols: Array[String],
  probabilities: Array[Double],
  relativeError: Double): Array[Array[Double]]
approxQuantile(
  col: String,
  probabilities: Array[Double],
  relativeError: Double): Array[Double]

approxQuantile(

cols: Array[String],

probabilities: Array[Double],

relativeError: Double): Array[Array[Double]]

approxQuantile(

col: String,

probabilities: Array[Double],

relativeError: Double): Array[Double]

approxQuantile…FIXME

`bloomFilter` Method



bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter…FIXME

`buildBloomFilter` Internal Method



buildBloomFilter(col: Column, zero: BloomFilter): BloomFilter

buildBloomFilter(col: Column, zero: BloomFilter): BloomFilter

buildBloomFilter…FIXME

Note	`convertToDouble` is used when…FIXME

`corr` Method



corr(col1: String, col2: String): Double
corr(col1: String, col2: String, method: String): Double

corr(col1: String, col2: String): Double

corr(col1: String, col2: String, method: String): Double

corr…FIXME

`countMinSketch` Method



countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch
// PRIVATE API
countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch

countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch

countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

// PRIVATE API

countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch

countMinSketch…FIXME

`cov` Method



cov(col1: String, col2: String): Double

cov(col1: String, col2: String): Double

cov…FIXME

`crosstab` Method



crosstab(col1: String, col2: String): DataFrame

crosstab(col1: String, col2: String): DataFrame

crosstab…FIXME

`freqItems` Method



freqItems(cols: Array[String]): DataFrame
freqItems(cols: Array[String], support: Double): DataFrame
freqItems(cols: Seq[String]): DataFrame
freqItems(cols: Seq[String], support: Double): DataFrame

freqItems(cols: Array[String]): DataFrame

freqItems(cols: Array[String], support: Double): DataFrame

freqItems(cols: Seq[String]): DataFrame

freqItems(cols: Seq[String], support: Double): DataFrame

freqItems…FIXME

`sampleBy` Method



sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

sampleBy…FIXME

DataFrameNaFunctions — Working With Missing Data

2011-01-28admin阅读(2024)

DataFrameNaFunctions — Working With Missing Data

DataFrameNaFunctions is used to work with missing data in a structured query (a DataFrame).

Table 1. DataFrameNaFunctions API

Method

Description

drop



drop(): DataFrame
drop(cols: Array[String]): DataFrame
drop(minNonNulls: Int): DataFrame
drop(minNonNulls: Int, cols: Array[String]): DataFrame
drop(minNonNulls: Int, cols: Seq[String]): DataFrame
drop(cols: Seq[String]): DataFrame
drop(how: String): DataFrame
drop(how: String, cols: Array[String]): DataFrame
drop(how: String, cols: Seq[String]): DataFrame

drop(): DataFrame

drop(cols: Array[String]): DataFrame

drop(minNonNulls: Int): DataFrame

drop(minNonNulls: Int, cols: Array[String]): DataFrame

drop(minNonNulls: Int, cols: Seq[String]): DataFrame

drop(cols: Seq[String]): DataFrame

drop(how: String): DataFrame

drop(how: String, cols: Array[String]): DataFrame

drop(how: String, cols: Seq[String]): DataFrame

fill



fill(value: Boolean): DataFrame
fill(value: Boolean, cols: Array[String]): DataFrame
fill(value: Boolean, cols: Seq[String]): DataFrame
fill(value: Double): DataFrame
fill(value: Double, cols: Array[String]): DataFrame
fill(value: Double, cols: Seq[String]): DataFrame
fill(value: Long): DataFrame
fill(value: Long, cols: Array[String]): DataFrame
fill(value: Long, cols: Seq[String]): DataFrame
fill(valueMap: Map[String, Any]): DataFrame
fill(value: String): DataFrame
fill(value: String, cols: Array[String]): DataFrame
fill(value: String, cols: Seq[String]): DataFrame

fill(value: Boolean): DataFrame

fill(value: Boolean, cols: Array[String]): DataFrame

fill(value: Boolean, cols: Seq[String]): DataFrame

fill(value: Double): DataFrame

fill(value: Double, cols: Array[String]): DataFrame

fill(value: Double, cols: Seq[String]): DataFrame

fill(value: Long): DataFrame

fill(value: Long, cols: Array[String]): DataFrame

fill(value: Long, cols: Seq[String]): DataFrame

fill(valueMap: Map[String, Any]): DataFrame

fill(value: String): DataFrame

fill(value: String, cols: Array[String]): DataFrame

fill(value: String, cols: Seq[String]): DataFrame

replace



replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame
replace[T](col: String, replacement: Map[T, T]): DataFrame

replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame

replace[T](col: String, replacement: Map[T, T]): DataFrame

DataFrameNaFunctions is available using na untyped transformation.



val q: DataFrame = ...
q.na

val q: DataFrame = ...

q.na

`convertToDouble` Internal Method



convertToDouble(v: Any): Double

convertToDouble(v: Any): Double

convertToDouble…FIXME

Note	`convertToDouble` is used when…FIXME

`drop` Method



drop(): DataFrame
drop(cols: Array[String]): DataFrame
drop(minNonNulls: Int): DataFrame
drop(minNonNulls: Int, cols: Array[String]): DataFrame
drop(minNonNulls: Int, cols: Seq[String]): DataFrame
drop(cols: Seq[String]): DataFrame
drop(how: String): DataFrame
drop(how: String, cols: Array[String]): DataFrame
drop(how: String, cols: Seq[String]): DataFrame

drop(): DataFrame

drop(cols: Array[String]): DataFrame

drop(minNonNulls: Int): DataFrame

drop(minNonNulls: Int, cols: Array[String]): DataFrame

drop(minNonNulls: Int, cols: Seq[String]): DataFrame

drop(cols: Seq[String]): DataFrame

drop(how: String): DataFrame

drop(how: String, cols: Array[String]): DataFrame

drop(how: String, cols: Seq[String]): DataFrame

drop…FIXME

`fill` Method



fill(value: Boolean): DataFrame
fill(value: Boolean, cols: Array[String]): DataFrame
fill(value: Boolean, cols: Seq[String]): DataFrame
fill(value: Double): DataFrame
fill(value: Double, cols: Array[String]): DataFrame
fill(value: Double, cols: Seq[String]): DataFrame
fill(value: Long): DataFrame
fill(value: Long, cols: Array[String]): DataFrame
fill(value: Long, cols: Seq[String]): DataFrame
fill(valueMap: Map[String, Any]): DataFrame
fill(value: String): DataFrame
fill(value: String, cols: Array[String]): DataFrame
fill(value: String, cols: Seq[String]): DataFrame

fill(value: Boolean): DataFrame

fill(value: Boolean, cols: Array[String]): DataFrame

fill(value: Boolean, cols: Seq[String]): DataFrame

fill(value: Double): DataFrame

fill(value: Double, cols: Array[String]): DataFrame

fill(value: Double, cols: Seq[String]): DataFrame

fill(value: Long): DataFrame

fill(value: Long, cols: Array[String]): DataFrame

fill(value: Long, cols: Seq[String]): DataFrame

fill(valueMap: Map[String, Any]): DataFrame

fill(value: String): DataFrame

fill(value: String, cols: Array[String]): DataFrame

fill(value: String, cols: Seq[String]): DataFrame

fill…FIXME

`fillCol` Internal Method



fillCol[T](col: StructField, replacement: T): Column

fillCol[T](col: StructField, replacement: T): Column

fillCol…FIXME

Note	`fillCol` is used when…FIXME

`fillMap` Internal Method



fillMap(values: Seq[(String, Any)]): DataFrame

fillMap(values: Seq[(String, Any)]): DataFrame

fillMap…FIXME

Note	`fillMap` is used when…FIXME

`fillValue` Internal Method



fillValue[T](value: T, cols: Seq[String]): DataFrame

fillValue[T](value: T, cols: Seq[String]): DataFrame

fillValue…FIXME

Note	`fillValue` is used when…FIXME

`replace0` Internal Method



replace0[T](cols: Seq[String], replacement: Map[T, T]): DataFrame

replace0[T](cols: Seq[String], replacement: Map[T, T]): DataFrame

replace0…FIXME

Note	`replace0` is used when…FIXME

`replace` Method



replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame
replace[T](col: String, replacement: Map[T, T]): DataFrame

replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame

replace[T](col: String, replacement: Map[T, T]): DataFrame

replace…FIXME

`replaceCol` Internal Method



replaceCol(col: StructField, replacementMap: Map[_, _]): Column

replaceCol(col: StructField, replacementMap: Map[_, _]): Column

replaceCol…FIXME

Note	`replaceCol` is used when…FIXME

Actions

2011-01-28admin阅读(2634)

Dataset API — Actions

Actions are part of the Dataset API for…FIXME

Note	Actions are the methods in the `Dataset` Scala class that are grouped in `action` group name, i.e. `@group action`.

Action Description

collect



collect(): Array[T]

collect(): Array[T]

count



count(): Long

count(): Long

describe



describe(cols: String*): DataFrame

describe(cols: String*): DataFrame

first



first(): T

first(): T

foreach



foreach(f: T => Unit): Unit

foreach(f: T => Unit): Unit

foreachPartition



foreachPartition(f: Iterator[T] => Unit): Unit

foreachPartition(f: Iterator[T] => Unit): Unit

head



head(): T
head(n: Int): Array[T]

head(): T

head(n: Int): Array[T]

reduce



reduce(func: (T, T) => T): T

reduce(func: (T, T) => T): T

show



show(): Unit
show(truncate: Boolean): Unit
show(numRows: Int): Unit
show(numRows: Int, truncate: Boolean): Unit
show(numRows: Int, truncate: Int): Unit
show(numRows: Int, truncate: Int, vertical: Boolean): Unit

show(): Unit

show(truncate: Boolean): Unit

show(numRows: Int): Unit

show(numRows: Int, truncate: Boolean): Unit

show(numRows: Int, truncate: Int): Unit

show(numRows: Int, truncate: Int, vertical: Boolean): Unit

summary

Computes specified statistics for numeric and string columns. The default statistics are: count, mean, stddev, min, max and 25%, 50%, 75% percentiles.



summary(statistics: String*): DataFrame

summary(statistics: String*): DataFrame

Note	`summary` is an extended version of the describe action that simply calculates `count`, `mean`, `stddev`, `min` and `max` statistics.

take



take(n: Int): Array[T]

take(n: Int): Array[T]

toLocalIterator



toLocalIterator(): java.util.Iterator[T]

toLocalIterator(): java.util.Iterator[T]

`collect` Action



collect(): Array[T]

collect(): Array[T]

collect…FIXME

`count` Action



count(): Long

count(): Long

count…FIXME

Calculating Basic Statistics — `describe` Action



describe(cols: String*): DataFrame

describe(cols: String*): DataFrame

describe…FIXME

`first` Action



first(): T

first(): T

first…FIXME

`foreach` Action



foreach(f: T => Unit): Unit

foreach(f: T => Unit): Unit

foreach…FIXME

`foreachPartition` Action



foreachPartition(f: Iterator[T] => Unit): Unit

foreachPartition(f: Iterator[T] => Unit): Unit

foreachPartition…FIXME

`head` Action



head(): T (1)
head(n: Int): Array[T]

head(): T (1)

head(n: Int): Array[T]

Calls the other head with n as 1 and takes the first element

head…FIXME

`reduce` Action



reduce(func: (T, T) => T): T

reduce(func: (T, T) => T): T

reduce…FIXME

`show` Action



show(): Unit
show(truncate: Boolean): Unit
show(numRows: Int): Unit
show(numRows: Int, truncate: Boolean): Unit
show(numRows: Int, truncate: Int): Unit
show(numRows: Int, truncate: Int, vertical: Boolean): Unit

show(): Unit

show(truncate: Boolean): Unit

show(numRows: Int): Unit

show(numRows: Int, truncate: Boolean): Unit

show(numRows: Int, truncate: Int): Unit

show(numRows: Int, truncate: Int, vertical: Boolean): Unit

show…FIXME

Calculating Statistics — `summary` Action



summary(statistics: String*): DataFrame

summary(statistics: String*): DataFrame

summary calculates specified statistics for numeric and string columns.

The default statistics are: count, mean, stddev, min, max and 25%, 50%, 75% percentiles.

Note	`summary` accepts arbitrary approximate percentiles specified as a percentage (e.g. `10%`).

Internally, summary uses the StatFunctions to calculate the requested summaries for the Dataset.

Taking First Records — `take` Action



take(n: Int): Array[T]

take(n: Int): Array[T]

take is an action on a Dataset that returns a collection of n records.

Warning

take loads all the data into the memory of the Spark application’s driver process and for a large n could result in OutOfMemoryError.

Internally, take creates a new Dataset with Limit logical plan for Literal expression and the current LogicalPlan. It then runs the SparkPlan that produces a Array[InternalRow] that is in turn decoded to Array[T] using a bounded encoder.

`toLocalIterator` Action



toLocalIterator(): java.util.Iterator[T]

toLocalIterator(): java.util.Iterator[T]

toLocalIterator…FIXME

Basic Actions

2011-01-28admin阅读(1682)

Dataset API — Basic Actions

Basic actions are part of the Dataset API for transforming a Dataset into a session-scoped or global temporary view and other basic actions (FIXME).

Note	Basic actions are the methods in the `Dataset` Scala class that are grouped in `basic` group name, i.e. `@group basic`.

Action Description

cache



cache(): this.type

cache(): this.type

Caches the Dataset

checkpoint



checkpoint(): Dataset[T]
checkpoint(eager: Boolean): Dataset[T]

checkpoint(): Dataset[T]

checkpoint(eager: Boolean): Dataset[T]

Checkpoints the Dataset in a reliable way (using a reliable HDFS-compliant file system, e.g. Hadoop HDFS or Amazon S3)

columns



columns: Array[String]

columns: Array[String]

createGlobalTempView



createGlobalTempView(viewName: String): Unit

createGlobalTempView(viewName: String): Unit

createOrReplaceGlobalTempView



createOrReplaceGlobalTempView(viewName: String): Unit

createOrReplaceGlobalTempView(viewName: String): Unit

createOrReplaceTempView



createOrReplaceTempView(viewName: String): Unit

createOrReplaceTempView(viewName: String): Unit

createTempView



createTempView(viewName: String): Unit

createTempView(viewName: String): Unit

dtypes



dtypes: Array[(String, String)]

dtypes: Array[(String, String)]

explain



explain(): Unit
explain(extended: Boolean): Unit

explain(): Unit

explain(extended: Boolean): Unit

Displays the logical and physical plans of the Dataset, i.e. displays the logical and physical plans (with optional cost and codegen summaries) to the standard output

hint



hint(name: String, parameters: Any*): Dataset[T]

hint(name: String, parameters: Any*): Dataset[T]

inputFiles



inputFiles: Array[String]

inputFiles: Array[String]

isEmpty



isEmpty: Boolean

isEmpty: Boolean

(New in 2.4.0)

isLocal



isLocal: Boolean

isLocal: Boolean

localCheckpoint



localCheckpoint(): Dataset[T]
localCheckpoint(eager: Boolean): Dataset[T]

localCheckpoint(): Dataset[T]

localCheckpoint(eager: Boolean): Dataset[T]

Checkpoints the Dataset locally on executors (and therefore unreliably)

persist



persist(): this.type
persist(newLevel: StorageLevel): this.type

persist(): this.type

persist(newLevel: StorageLevel): this.type

Persists the Dataset

printSchema



printSchema(): Unit

printSchema(): Unit

rdd



rdd: RDD[T]

rdd: RDD[T]

schema



schema: StructType

schema: StructType

storageLevel



storageLevel: StorageLevel

storageLevel: StorageLevel

toDF



toDF(): DataFrame
toDF(colNames: String*): DataFrame

toDF(): DataFrame

toDF(colNames: String*): DataFrame

unpersist



unpersist(): this.type
unpersist(blocking: Boolean): this.type

unpersist(): this.type

unpersist(blocking: Boolean): this.type

Unpersists the Dataset

write



write: DataFrameWriter[T]

write: DataFrameWriter[T]

Returns a DataFrameWriter for saving the content of the (non-streaming) Dataset out to an external storage

Caching Dataset — `cache` Basic Action



cache(): this.type

cache(): this.type

cache merely executes the no-argument persist basic action.



val ds = spark.range(5).cache

val ds = spark.range(5).cache

Reliably Checkpointing Dataset — `checkpoint` Basic Action



checkpoint(): Dataset[T]  (1)
checkpoint(eager: Boolean): Dataset[T]  (2)

checkpoint(): Dataset[T] (1)

checkpoint(eager: Boolean): Dataset[T] (2)

eager and reliableCheckpoint flags enabled
reliableCheckpoint flag enabled

Note	`checkpoint` is an experimental operator and the API is evolving towards becoming stable.

checkpoint simply requests the Dataset to checkpoint with the given eager flag and the reliableCheckpoint flag enabled.

`createTempView` Basic Action



createTempView(viewName: String): Unit

createTempView(viewName: String): Unit

createTempView…FIXME

Note	`createTempView` is used when…FIXME

`createOrReplaceTempView` Basic Action



createOrReplaceTempView(viewName: String): Unit

createOrReplaceTempView(viewName: String): Unit

createOrReplaceTempView…FIXME

Note	`createOrReplaceTempView` is used when…FIXME

`createGlobalTempView` Basic Action



createGlobalTempView(viewName: String): Unit

createGlobalTempView(viewName: String): Unit

createGlobalTempView…FIXME

Note	`createGlobalTempView` is used when…FIXME

`createOrReplaceGlobalTempView` Basic Action



createOrReplaceGlobalTempView(viewName: String): Unit

createOrReplaceGlobalTempView(viewName: String): Unit

createOrReplaceGlobalTempView…FIXME

Note	`createOrReplaceGlobalTempView` is used when…FIXME

`createTempViewCommand` Internal Method



createTempViewCommand(
  viewName: String,
  replace: Boolean,
  global: Boolean): CreateViewCommand

createTempViewCommand(

viewName: String,

replace: Boolean,

global: Boolean): CreateViewCommand

createTempViewCommand…FIXME

Note	`createTempViewCommand` is used when the following `Dataset` operators are used: Dataset.createTempView, Dataset.createOrReplaceTempView, Dataset.createGlobalTempView and Dataset.createOrReplaceGlobalTempView.

Displaying Logical and Physical Plans, Their Cost and Codegen — `explain` Basic Action



explain(): Unit (1)
explain(extended: Boolean): Unit

explain(): Unit (1)

explain(extended: Boolean): Unit

Turns the extended flag on

explain prints the logical and (with extended flag enabled) physical plans, their cost and codegen to the console.

Tip	Use `explain` to review the structured queries and optimizations applied.

Internally, explain creates a ExplainCommand logical command and requests SessionState to execute it (to get a QueryExecution back).

Note	`explain` uses ExplainCommand logical command that, when executed, gives different text representations of QueryExecution (for the Dataset’s LogicalPlan) depending on the flags (e.g. extended, codegen, and cost which are disabled by default).

explain then requests QueryExecution for the optimized physical query plan and collects the records (as InternalRow objects).

Note	`explain` uses Dataset’s SparkSession to access the current `SessionState`.

In the end, explain goes over the InternalRow records and converts them to lines to display to console.

Note	`explain` “converts” an `InternalRow` record to a line using getString at position `0`.

Tip	If you are serious about query debugging you could also use the Debugging Query Execution facility.



scala> spark.range(10).explain(extended = true)
== Parsed Logical Plan ==
Range (0, 10, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint
Range (0, 10, step=1, splits=Some(8))

== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8))

== Physical Plan ==
*Range (0, 10, step=1, splits=Some(8))

scala> spark.range(10).explain(extended = true)

== Parsed Logical Plan ==

Range (0, 10, step=1, splits=Some(8))

== Analyzed Logical Plan ==

id: bigint

Range (0, 10, step=1, splits=Some(8))

== Optimized Logical Plan ==

Range (0, 10, step=1, splits=Some(8))

== Physical Plan ==

*Range (0, 10, step=1, splits=Some(8))

Specifying Hint — `hint` Basic Action



hint(name: String, parameters: Any*): Dataset[T]

hint(name: String, parameters: Any*): Dataset[T]

hint operator is part of Hint Framework to specify a hint (by name and parameters) for a Dataset.

Internally, hint simply attaches UnresolvedHint unary logical operator to an “analyzed” Dataset (i.e. the analyzed logical plan of a Dataset).



val ds = spark.range(3)
val plan = ds.queryExecution.logical
scala> println(plan.numberedTreeString)
00 Range (0, 3, step=1, splits=Some(8))

// Attach a hint
val dsHinted = ds.hint("myHint", 100, true)
val plan = dsHinted.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'UnresolvedHint myHint, [100, true]
01 +- Range (0, 3, step=1, splits=Some(8))

val ds = spark.range(3)

val plan = ds.queryExecution.logical

scala> println(plan.numberedTreeString)

00 Range (0, 3, step=1, splits=Some(8))

// Attach a hint

val dsHinted = ds.hint("myHint", 100, true)

val plan = dsHinted.queryExecution.logical

scala> println(plan.numberedTreeString)

00 'UnresolvedHint myHint, [100, true]

01 +- Range (0, 3, step=1, splits=Some(8))

Note	`hint` adds an UnresolvedHint unary logical operator to an analyzed logical plan that indirectly triggers analysis phase that executes logical commands and their unions as well as resolves all hints that have already been added to a logical plan.



// FIXME Demo with UnresolvedHint

// FIXME Demo with UnresolvedHint

Locally Checkpointing Dataset — `localCheckpoint` Basic Action



localCheckpoint(): Dataset[T] (1)
localCheckpoint(eager: Boolean): Dataset[T]

localCheckpoint(): Dataset[T] (1)

localCheckpoint(eager: Boolean): Dataset[T]

eager flag enabled

localCheckpoint simply uses Dataset.checkpoint operator with the input eager flag and reliableCheckpoint flag disabled (false).

`checkpoint` Internal Method



checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T]

checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T]

checkpoint requests QueryExecution (of the Dataset) to generate an RDD of internal binary rows (aka internalRdd) and then requests the RDD to make a copy of all the rows (by adding a MapPartitionsRDD).

Depending on reliableCheckpoint flag, checkpoint marks the RDD for (reliable) checkpointing (true) or local checkpointing (false).

With eager flag on, checkpoint counts the number of records in the RDD (by executing RDD.count) that gives the effect of immediate eager checkpointing.

checkpoint requests QueryExecution (of the Dataset) for optimized physical query plan (the plan is used to get the outputPartitioning and outputOrdering for the result Dataset).

In the end, checkpoint creates a DataFrame with a new logical plan node for scanning data from an RDD of InternalRows (LogicalRDD).

Note	`checkpoint` is used in the `Dataset` untyped transformations, i.e. checkpoint and localCheckpoint.

Persisting Dataset — `persist` Basic Action



persist(): this.type
persist(newLevel: StorageLevel): this.type

persist(): this.type

persist(newLevel: StorageLevel): this.type

persist caches the Dataset using the default storage level MEMORY_AND_DISK or newLevel and returns it.

Internally, persist requests CacheManager to cache the structured query (that is accessible through SharedState of the current SparkSession).

Caution

FIXME

Generating RDD of Internal Binary Rows — `rdd` Basic Action



rdd: RDD[T]

rdd: RDD[T]

Whenever you are in need to convert a Dataset into a RDD, executing rdd method gives you the RDD of the proper input object type (not Row as in DataFrames) that sits behind the Dataset.



scala> val rdd = tokens.rdd
rdd: org.apache.spark.rdd.RDD[Token] = MapPartitionsRDD[11] at rdd at <console>:30

scala> val rdd = tokens.rdd

rdd: org.apache.spark.rdd.RDD[Token] = MapPartitionsRDD[11] at rdd at <console>:30

Internally, it looks ExpressionEncoder (for the Dataset) up and accesses the deserializer expression. That gives the DataType of the result of evaluating the expression.

Note	A deserializer expression is used to decode an InternalRow to an object of type `T`. See ExpressionEncoder.

It then executes a DeserializeToObject logical operator that will produce a RDD[InternalRow] that is converted into the proper RDD[T] using the DataType and T.

Note	It is a lazy operation that “produces” a `RDD[T]`.

Accessing Schema — `schema` Basic Action

A Dataset has a schema.



schema: StructType

schema: StructType

Tip	You may also use the following methods to learn about the schema: `printSchema(): Unit` explain

Converting Typed Dataset to Untyped DataFrame — `toDF` Basic Action



toDF(): DataFrame
toDF(colNames: String*): DataFrame

toDF(): DataFrame

toDF(colNames: String*): DataFrame

toDF converts a Dataset into a DataFrame.

Internally, the empty-argument toDF creates a Dataset[Row] using the Dataset‘s SparkSession and QueryExecution with the encoder being RowEncoder.

Caution

FIXME Describe toDF(colNames: String*)

Unpersisting Cached Dataset — `unpersist` Basic Action



unpersist(): this.type
unpersist(blocking: Boolean): this.type

unpersist(): this.type

unpersist(blocking: Boolean): this.type

unpersist uncache the Dataset possibly by blocking the call.

Internally, unpersist requests CacheManager to uncache the query.

Caution

FIXME

Accessing DataFrameWriter (to Describe Writing Dataset) — `write` Basic Action



write: DataFrameWriter[T]

write: DataFrameWriter[T]

write gives DataFrameWriter for records of type T.



import org.apache.spark.sql.{DataFrameWriter, Dataset}
val ints: Dataset[Int] = (0 to 5).toDS
val writer: DataFrameWriter[Int] = ints.write

import org.apache.spark.sql.{DataFrameWriter, Dataset}

val ints: Dataset[Int] = (0 to 5).toDS

val writer: DataFrameWriter[Int] = ints.write

`isEmpty` Typed Transformation



isEmpty: Boolean

isEmpty: Boolean

isEmpty…FIXME

`isLocal` Typed Transformation



isLocal: Boolean

isLocal: Boolean

isLocal…FIXME

Untyped Transformations

2011-01-28admin阅读(1523)

Dataset API — Untyped Transformations

Untyped transformations are part of the Dataset API for transforming a Dataset to a DataFrame, a Column, a RelationalGroupedDataset, a DataFrameNaFunctions or a DataFrameStatFunctions (and hence untyped).

Note	Untyped transformations are the methods in the `Dataset` Scala class that are grouped in `untypedrel` group name, i.e. `@group untypedrel`.

Transformation Description

agg



agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

agg(expr: Column, exprs: Column*): DataFrame

agg(exprs: Map[String, String]): DataFrame

apply

Selects a column based on the column name (i.e. maps a Dataset onto a Column)



apply(colName: String): Column

apply(colName: String): Column

col

Selects a column based on the column name (i.e. maps a Dataset onto a Column)



col(colName: String): Column

col(colName: String): Column

colRegex



colRegex(colName: String): Column

colRegex(colName: String): Column

Selects a column based on the column name specified as a regex (i.e. maps a Dataset onto a Column)

crossJoin



crossJoin(right: Dataset[_]): DataFrame

crossJoin(right: Dataset[_]): DataFrame

cube



cube(cols: Column*): RelationalGroupedDataset
cube(col1: String, cols: String*): RelationalGroupedDataset

cube(cols: Column*): RelationalGroupedDataset

cube(col1: String, cols: String*): RelationalGroupedDataset

drop



drop(colName: String): DataFrame
drop(colNames: String*): DataFrame
drop(col: Column): DataFrame

drop(colName: String): DataFrame

drop(colNames: String*): DataFrame

drop(col: Column): DataFrame

groupBy



groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy(cols: Column*): RelationalGroupedDataset

groupBy(col1: String, cols: String*): RelationalGroupedDataset

join



join(right: Dataset[_]): DataFrame
join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

join(right: Dataset[_]): DataFrame

join(right: Dataset[_], usingColumn: String): DataFrame

join(right: Dataset[_], usingColumns: Seq[String]): DataFrame

join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

join(right: Dataset[_], joinExprs: Column): DataFrame

join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame



na: DataFrameNaFunctions

na: DataFrameNaFunctions

rollup



rollup(cols: Column*): RelationalGroupedDataset
rollup(col1: String, cols: String*): RelationalGroupedDataset

rollup(cols: Column*): RelationalGroupedDataset

rollup(col1: String, cols: String*): RelationalGroupedDataset

select



select(cols: Column*): DataFrame
select(col: String, cols: String*): DataFrame

select(cols: Column*): DataFrame

select(col: String, cols: String*): DataFrame

selectExpr



selectExpr(exprs: String*): DataFrame

selectExpr(exprs: String*): DataFrame

stat



stat: DataFrameStatFunctions

stat: DataFrameStatFunctions

withColumn



withColumn(colName: String, col: Column): DataFrame

withColumn(colName: String, col: Column): DataFrame

withColumnRenamed



withColumnRenamed(existingName: String, newName: String): DataFrame

withColumnRenamed(existingName: String, newName: String): DataFrame

`agg` Untyped Transformation



agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

agg(expr: Column, exprs: Column*): DataFrame

agg(exprs: Map[String, String]): DataFrame

agg…FIXME

`apply` Untyped Transformation



apply(colName: String): Column

apply(colName: String): Column

apply selects a column based on the column name (i.e. maps a Dataset onto a Column).

`col` Untyped Transformation



col(colName: String): Column

col(colName: String): Column

col selects a column based on the column name (i.e. maps a Dataset onto a Column).

Internally, col branches off per the input column name.

If the column name is * (a star), col simply creates a Column with ResolvedStar expression (with the schema output attributes of the analyzed logical plan of the QueryExecution).

Otherwise, col uses colRegex untyped transformation when spark.sql.parser.quotedRegexColumnNames configuration property is enabled.

In the case when the column name is not * and spark.sql.parser.quotedRegexColumnNames configuration property is disabled, col creates a Column with the column name resolved (as a NamedExpression).

`colRegex` Untyped Transformation



colRegex(colName: String): Column

colRegex(colName: String): Column

colRegex selects a column based on the column name specified as a regex (i.e. maps a Dataset onto a Column).

Note	`colRegex` is used in col when spark.sql.parser.quotedRegexColumnNames configuration property is enabled (and the column name is not `*`).

Internally, colRegex matches the input column name to different regular expressions (in the order):

For column names with quotes without a qualifier, colRegex simply creates a Column with a UnresolvedRegex (with no table)
For column names with quotes with a qualifier, colRegex simply creates a Column with a UnresolvedRegex (with a table specified)
For other column names, colRegex (behaves like col and) creates a Column with the column name resolved (as a NamedExpression)

`crossJoin` Untyped Transformation



crossJoin(right: Dataset[_]): DataFrame

crossJoin(right: Dataset[_]): DataFrame

crossJoin…FIXME

`cube` Untyped Transformation



cube(cols: Column*): RelationalGroupedDataset
cube(col1: String, cols: String*): RelationalGroupedDataset

cube(cols: Column*): RelationalGroupedDataset

cube(col1: String, cols: String*): RelationalGroupedDataset

cube…FIXME

Dropping One or More Columns — `drop` Untyped Transformation



drop(colName: String): DataFrame
drop(colNames: String*): DataFrame
drop(col: Column): DataFrame

drop(colName: String): DataFrame

drop(colNames: String*): DataFrame

drop(col: Column): DataFrame

drop…FIXME

`groupBy` Untyped Transformation



groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy(cols: Column*): RelationalGroupedDataset

groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy…FIXME

`join` Untyped Transformation



join(right: Dataset[_]): DataFrame
join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

join(right: Dataset[_]): DataFrame

join(right: Dataset[_], usingColumn: String): DataFrame

join(right: Dataset[_], usingColumns: Seq[String]): DataFrame

join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

join(right: Dataset[_], joinExprs: Column): DataFrame

join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

join…FIXME

`na` Untyped Transformation



na: DataFrameNaFunctions

na: DataFrameNaFunctions

na simply creates a DataFrameNaFunctions to work with missing data.

`rollup` Untyped Transformation



rollup(cols: Column*): RelationalGroupedDataset
rollup(col1: String, cols: String*): RelationalGroupedDataset

rollup(cols: Column*): RelationalGroupedDataset

rollup(col1: String, cols: String*): RelationalGroupedDataset

rollup…FIXME

`select` Untyped Transformation



select(cols: Column*): DataFrame
select(col: String, cols: String*): DataFrame

select(cols: Column*): DataFrame

select(col: String, cols: String*): DataFrame

select…FIXME

Projecting Columns using SQL Statements — `selectExpr` Untyped Transformation



selectExpr(exprs: String*): DataFrame

selectExpr(exprs: String*): DataFrame

selectExpr is like select, but accepts SQL statements.



val ds = spark.range(5)

scala> ds.selectExpr("rand() as random").show
16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random
+-------------------+
|             random|
+-------------------+
|  0.887675894185651|
|0.36766085091074086|
| 0.2700020856675186|
| 0.1489033635529543|
| 0.5862990791950973|
+-------------------+

val ds = spark.range(5)

scala> ds.selectExpr("rand() as random").show

16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random

+-------------------+

| random|

+-------------------+

| 0.887675894185651|

|0.36766085091074086|

| 0.2700020856675186|

| 0.1489033635529543|

| 0.5862990791950973|

+-------------------+

Internally, it executes select with every expression in exprs mapped to Column (using SparkSqlParser.parseExpression).



scala> ds.select(expr("rand() as random")).show
+------------------+
|            random|
+------------------+
|0.5514319279894851|
|0.2876221510433741|
|0.4599999092045741|
|0.5708558868374893|
|0.6223314406247136|
+------------------+

scala> ds.select(expr("rand() as random")).show

+------------------+

| random|

+------------------+

|0.5514319279894851|

|0.2876221510433741|

|0.4599999092045741|

|0.5708558868374893|

|0.6223314406247136|

+------------------+

`stat` Untyped Transformation



stat: DataFrameStatFunctions

stat: DataFrameStatFunctions

stat simply creates a DataFrameStatFunctions to work with statistic functions.

`withColumn` Untyped Transformation



withColumn(colName: String, col: Column): DataFrame

withColumn(colName: String, col: Column): DataFrame

withColumn…FIXME

`withColumnRenamed` Untyped Transformation



withColumnRenamed(existingName: String, newName: String): DataFrame

withColumnRenamed(existingName: String, newName: String): DataFrame

withColumnRenamed…FIXME

Typed Transformations

2011-01-28admin阅读(2104)

Dataset API — Typed Transformations

Typed transformations are part of the Dataset API for transforming a Dataset with an Encoder (except the RowEncoder).

Note	Typed transformations are the methods in the `Dataset` Scala class that are grouped in `typedrel` group name, i.e. `@group typedrel`.

Table 1. Dataset API’s Typed Transformations

Transformation

Description

alias



alias(alias: String): Dataset[T]
alias(alias: Symbol): Dataset[T]

alias(alias: String): Dataset[T]

alias(alias: Symbol): Dataset[T]



as(alias: String): Dataset[T]
as(alias: Symbol): Dataset[T]

as(alias: String): Dataset[T]

as(alias: Symbol): Dataset[T]



as[U : Encoder]: Dataset[U]

as[U : Encoder]: Dataset[U]

coalesce

Repartitions a Dataset



coalesce(numPartitions: Int): Dataset[T]

coalesce(numPartitions: Int): Dataset[T]

distinct



distinct(): Dataset[T]

distinct(): Dataset[T]

dropDuplicates



dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Array[String]): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

dropDuplicates(): Dataset[T]

dropDuplicates(colNames: Array[String]): Dataset[T]

dropDuplicates(colNames: Seq[String]): Dataset[T]

dropDuplicates(col1: String, cols: String*): Dataset[T]

except



except(other: Dataset[T]): Dataset[T]

except(other: Dataset[T]): Dataset[T]

filter



filter(condition: Column): Dataset[T]
filter(conditionExpr: String): Dataset[T]
filter(func: T => Boolean): Dataset[T]

filter(condition: Column): Dataset[T]

filter(conditionExpr: String): Dataset[T]

filter(func: T => Boolean): Dataset[T]

flatMap



flatMap[U : Encoder](func: T => TraversableOnce[U]): Dataset[U]

flatMap[U : Encoder](func: T => TraversableOnce[U]): Dataset[U]

groupByKey



groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

intersect



intersect(other: Dataset[T]): Dataset[T]

intersect(other: Dataset[T]): Dataset[T]

joinWith



joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]
joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

limit



limit(n: Int): Dataset[T]

limit(n: Int): Dataset[T]

map



map[U: Encoder](func: T => U): Dataset[U]

map[U: Encoder](func: T => U): Dataset[U]

mapPartitions



mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

orderBy



orderBy(sortExprs: Column*): Dataset[T]
orderBy(sortCol: String, sortCols: String*): Dataset[T]

orderBy(sortExprs: Column*): Dataset[T]

orderBy(sortCol: String, sortCols: String*): Dataset[T]

randomSplit



randomSplit(weights: Array[Double]): Array[Dataset[T]]
randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

randomSplit(weights: Array[Double]): Array[Dataset[T]]

randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

repartition



repartition(partitionExprs: Column*): Dataset[T]
repartition(numPartitions: Int): Dataset[T]
repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartition(partitionExprs: Column*): Dataset[T]

repartition(numPartitions: Int): Dataset[T]

repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartitionByRange



repartitionByRange(partitionExprs: Column*): Dataset[T]
repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartitionByRange(partitionExprs: Column*): Dataset[T]

repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

sample



sample(withReplacement: Boolean, fraction: Double): Dataset[T]
sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]
sample(fraction: Double): Dataset[T]
sample(fraction: Double, seed: Long): Dataset[T]

sample(withReplacement: Boolean, fraction: Double): Dataset[T]

sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

sample(fraction: Double): Dataset[T]

sample(fraction: Double, seed: Long): Dataset[T]

select



select[U1](c1: TypedColumn[T, U1]): Dataset[U1]
select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]
select[U1, U2, U3](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]
select[U1, U2, U3, U4](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3],
  c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]
select[U1, U2, U3, U4, U5](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3],
  c4: TypedColumn[T, U4],
  c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]

select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

select[U1, U2, U3](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

select[U1, U2, U3, U4](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3],

c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

select[U1, U2, U3, U4, U5](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3],

c4: TypedColumn[T, U4],

c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

sort



sort(sortExprs: Column*): Dataset[T]
sort(sortCol: String, sortCols: String*): Dataset[T]

sort(sortExprs: Column*): Dataset[T]

sort(sortCol: String, sortCols: String*): Dataset[T]

sortWithinPartitions



sortWithinPartitions(sortExprs: Column*): Dataset[T]
sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

sortWithinPartitions(sortExprs: Column*): Dataset[T]

sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

toJSON



toJSON: Dataset[String]

toJSON: Dataset[String]

transform



transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

union



union(other: Dataset[T]): Dataset[T]

union(other: Dataset[T]): Dataset[T]

unionByName



unionByName(other: Dataset[T]): Dataset[T]

unionByName(other: Dataset[T]): Dataset[T]

where



where(condition: Column): Dataset[T]
where(conditionExpr: String): Dataset[T]

where(condition: Column): Dataset[T]

where(conditionExpr: String): Dataset[T]

`as` Typed Transformation



as(alias: String): Dataset[T]
as(alias: Symbol): Dataset[T]

as(alias: String): Dataset[T]

as(alias: Symbol): Dataset[T]

as…FIXME

Enforcing Type — `as` Typed Transformation



as[U: Encoder]: Dataset[U]

as[U: Encoder]: Dataset[U]

as[T] allows for converting from a weakly-typed Dataset of Rows to Dataset[T] with T being a domain class (that can enforce a stronger schema).



// Create DataFrame of pairs
val df = Seq("hello", "world!").zipWithIndex.map(_.swap).toDF("id", "token")

scala> df.printSchema
root
 |-- id: integer (nullable = false)
 |-- token: string (nullable = true)

scala> val ds = df.as[(Int, String)]
ds: org.apache.spark.sql.Dataset[(Int, String)] = [id: int, token: string]

// It's more helpful to have a case class for the conversion
final case class MyRecord(id: Int, token: String)

scala> val myRecords = df.as[MyRecord]
myRecords: org.apache.spark.sql.Dataset[MyRecord] = [id: int, token: string]

// Create DataFrame of pairs

val df = Seq("hello", "world!").zipWithIndex.map(_.swap).toDF("id", "token")

scala> df.printSchema

root

|-- id: integer (nullable = false)

|-- token: string (nullable = true)

scala> val ds = df.as[(Int, String)]

ds: org.apache.spark.sql.Dataset[(Int, String)] = [id: int, token: string]

// It's more helpful to have a case class for the conversion

final case class MyRecord(id: Int, token: String)

scala> val myRecords = df.as[MyRecord]

myRecords: org.apache.spark.sql.Dataset[MyRecord] = [id: int, token: string]

Repartitioning Dataset with Shuffle Disabled — `coalesce` Typed Transformation



coalesce(numPartitions: Int): Dataset[T]

coalesce(numPartitions: Int): Dataset[T]

coalesce operator repartitions the Dataset to exactly numPartitions partitions.

Internally, coalesce creates a Repartition logical operator with shuffle disabled (which is marked as false in the below explain‘s output).



scala> spark.range(5).coalesce(1).explain(extended = true)
== Parsed Logical Plan ==
Repartition 1, false
+- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint
Repartition 1, false
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Repartition 1, false
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
Coalesce 1
+- *Range (0, 5, step=1, splits=Some(8))

scala> spark.range(5).coalesce(1).explain(extended = true)

== Parsed Logical Plan ==

Repartition 1, false

+- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==

id: bigint

Repartition 1, false

+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==

Repartition 1, false

+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==

Coalesce 1

+- *Range (0, 5, step=1, splits=Some(8))

`dropDuplicates` Typed Transformation



dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Array[String]): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

dropDuplicates(): Dataset[T]

dropDuplicates(colNames: Array[String]): Dataset[T]

dropDuplicates(colNames: Seq[String]): Dataset[T]

dropDuplicates(col1: String, cols: String*): Dataset[T]

dropDuplicates…FIXME

`except` Typed Transformation



except(other: Dataset[T]): Dataset[T]

except(other: Dataset[T]): Dataset[T]

except…FIXME

`exceptAll` Typed Transformation



exceptAll(other: Dataset[T]): Dataset[T]

exceptAll(other: Dataset[T]): Dataset[T]

exceptAll…FIXME

`filter` Typed Transformation



filter(condition: Column): Dataset[T]
filter(conditionExpr: String): Dataset[T]
filter(func: T => Boolean): Dataset[T]

filter(condition: Column): Dataset[T]

filter(conditionExpr: String): Dataset[T]

filter(func: T => Boolean): Dataset[T]

filter…FIXME

Creating Zero or More Records — `flatMap` Typed Transformation



flatMap[U: Encoder](func: T => TraversableOnce[U]): Dataset[U]

flatMap[U: Encoder](func: T => TraversableOnce[U]): Dataset[U]

flatMap returns a new Dataset (of type U) with all records (of type T) mapped over using the function func and then flattening the results.

Note	`flatMap` can create new records. It deprecated `explode`.



final case class Sentence(id: Long, text: String)
val sentences = Seq(Sentence(0, "hello world"), Sentence(1, "witaj swiecie")).toDS

scala> sentences.flatMap(s => s.text.split("\\s+")).show
+-------+
|  value|
+-------+
|  hello|
|  world|
|  witaj|
|swiecie|
+-------+

final case class Sentence(id: Long, text: String)

val sentences = Seq(Sentence(0, "hello world"), Sentence(1, "witaj swiecie")).toDS

scala> sentences.flatMap(s => s.text.split("\\s+")).show

+-------+

| value|

+-------+

| hello|

| world|

| witaj|

|swiecie|

+-------+

Internally, flatMap calls mapPartitions with the partitions flatMap(ped).

`intersect` Typed Transformation



intersect(other: Dataset[T]): Dataset[T]

intersect(other: Dataset[T]): Dataset[T]

intersect…FIXME

`intersectAll` Typed Transformation



intersectAll(other: Dataset[T]): Dataset[T]

intersectAll(other: Dataset[T]): Dataset[T]

intersectAll…FIXME

`joinWith` Typed Transformation



joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]
joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

joinWith…FIXME

`limit` Typed Transformation



limit(n: Int): Dataset[T]

limit(n: Int): Dataset[T]

limit…FIXME

`map` Typed Transformation



map[U : Encoder](func: T => U): Dataset[U]

map[U : Encoder](func: T => U): Dataset[U]

map…FIXME

`mapPartitions` Typed Transformation



mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

mapPartitions…FIXME

Randomly Split Dataset Into Two or More Datasets Per Weight — `randomSplit` Typed Transformation



randomSplit(weights: Array[Double]): Array[Dataset[T]]
randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

randomSplit(weights: Array[Double]): Array[Dataset[T]]

randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

randomSplit randomly splits the Dataset per weights.

weights doubles should sum up to 1 and will be normalized if they do not.

You can define seed and if you don’t, a random seed will be used.

Note	`randomSplit` is commonly used in Spark MLlib to split an input Dataset into two datasets for training and validation.



val ds = spark.range(10)
scala> ds.randomSplit(Array[Double](2, 3)).foreach(_.show)
+---+
| id|
+---+
|  0|
|  1|
|  2|
+---+

+---+
| id|
+---+
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

val ds = spark.range(10)

scala> ds.randomSplit(Array[Double](2, 3)).foreach(_.show)

+---+

| id|

+---+

| 0|

| 1|

| 2|

+---+

| id|

+---+

| 3|

| 4|

| 5|

| 6|

| 7|

| 8|

| 9|

+---+

Repartitioning Dataset (Shuffle Enabled) — `repartition` Typed Transformation



repartition(partitionExprs: Column*): Dataset[T]
repartition(numPartitions: Int): Dataset[T]
repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartition(partitionExprs: Column*): Dataset[T]

repartition(numPartitions: Int): Dataset[T]

repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartition operators repartition the Dataset to exactly numPartitions partitions or using partitionExprs expressions.

Internally, repartition creates a Repartition or RepartitionByExpression logical operators with shuffle enabled (which is true in the below explain‘s output beside Repartition).



scala> spark.range(5).repartition(1).explain(extended = true)
== Parsed Logical Plan ==
Repartition 1, true
+- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint
Repartition 1, true
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Repartition 1, true
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *Range (0, 5, step=1, splits=Some(8))

scala> spark.range(5).repartition(1).explain(extended = true)

== Parsed Logical Plan ==

Repartition 1, true

+- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==

id: bigint

Repartition 1, true

+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==

Repartition 1, true

+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==

Exchange RoundRobinPartitioning(1)

+- *Range (0, 5, step=1, splits=Some(8))

Note	`repartition` methods correspond to SQL’s DISTRIBUTE BY or CLUSTER BY clauses.

`repartitionByRange` Typed Transformation



repartitionByRange(partitionExprs: Column*): Dataset[T] (1)
repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartitionByRange(partitionExprs: Column*): Dataset[T] (1)

repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

Uses spark.sql.shuffle.partitions configuration property for the number of partitions to use

repartitionByRange simply creates a Dataset with a RepartitionByExpression logical operator.



scala> spark.version
res1: String = 2.3.1

val q = spark.range(10).repartitionByRange(numPartitions = 5, $"id")
scala> println(q.queryExecution.logical.numberedTreeString)
00 'RepartitionByExpression ['id ASC NULLS FIRST], 5
01 +- AnalysisBarrier
02       +- Range (0, 10, step=1, splits=Some(8))

scala> println(q.queryExecution.toRdd.getNumPartitions)
5

scala> println(q.queryExecution.toRdd.toDebugString)
(5) ShuffledRowRDD[18] at toRdd at <console>:26 []
 +-(8) MapPartitionsRDD[17] at toRdd at <console>:26 []
    |  MapPartitionsRDD[13] at toRdd at <console>:26 []
    |  MapPartitionsRDD[12] at toRdd at <console>:26 []
    |  ParallelCollectionRDD[11] at toRdd at <console>:26 []

scala> spark.version

res1: String = 2.3.1

val q = spark.range(10).repartitionByRange(numPartitions = 5, $"id")

scala> println(q.queryExecution.logical.numberedTreeString)

00 'RepartitionByExpression ['id ASC NULLS FIRST], 5

01 +- AnalysisBarrier

02 +- Range (0, 10, step=1, splits=Some(8))

scala> println(q.queryExecution.toRdd.getNumPartitions)

scala> println(q.queryExecution.toRdd.toDebugString)

(5) ShuffledRowRDD[18] at toRdd at <console>:26 []

+-(8) MapPartitionsRDD[17] at toRdd at <console>:26 []

| MapPartitionsRDD[13] at toRdd at <console>:26 []

| MapPartitionsRDD[12] at toRdd at <console>:26 []

| ParallelCollectionRDD[11] at toRdd at <console>:26 []

repartitionByRange uses a SortOrder with the Ascending sort order, i.e. ascending nulls first, when no explicit sort order is specified.

repartitionByRange throws a IllegalArgumentException when no partitionExprs partition-by expression is specified.



requirement failed: At least one partition-by expression must be specified.

requirement failed: At least one partition-by expression must be specified.

`sample` Typed Transformation



sample(withReplacement: Boolean, fraction: Double): Dataset[T]
sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]
sample(fraction: Double): Dataset[T]
sample(fraction: Double, seed: Long): Dataset[T]

sample(withReplacement: Boolean, fraction: Double): Dataset[T]

sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

sample(fraction: Double): Dataset[T]

sample(fraction: Double, seed: Long): Dataset[T]

sample…FIXME

`select` Typed Transformation



select[U1](c1: TypedColumn[T, U1]): Dataset[U1]
select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]
select[U1, U2, U3](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]
select[U1, U2, U3, U4](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3],
  c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]
select[U1, U2, U3, U4, U5](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3],
  c4: TypedColumn[T, U4],
  c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]

select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

select[U1, U2, U3](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

select[U1, U2, U3, U4](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3],

c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

select[U1, U2, U3, U4, U5](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3],

c4: TypedColumn[T, U4],

c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

select…FIXME

`sort` Typed Transformation



sort(sortExprs: Column*): Dataset[T]
sort(sortCol: String, sortCols: String*): Dataset[T]

sort(sortExprs: Column*): Dataset[T]

sort(sortCol: String, sortCols: String*): Dataset[T]

sort…FIXME

`sortWithinPartitions` Typed Transformation



sortWithinPartitions(sortExprs: Column*): Dataset[T]
sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

sortWithinPartitions(sortExprs: Column*): Dataset[T]

sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

sortWithinPartitions simply calls the internal sortInternal method with the global flag disabled (false).

`toJSON` Typed Transformation



toJSON: Dataset[String]

toJSON: Dataset[String]

toJSON maps the content of Dataset to a Dataset of strings in JSON format.



scala> val ds = Seq("hello", "world", "foo bar").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> ds.toJSON.show
+-------------------+
|              value|
+-------------------+
|  {"value":"hello"}|
|  {"value":"world"}|
|{"value":"foo bar"}|
+-------------------+

scala> val ds = Seq("hello", "world", "foo bar").toDS

ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> ds.toJSON.show

+-------------------+

| value|

+-------------------+

| {"value":"hello"}|

| {"value":"world"}|

|{"value":"foo bar"}|

+-------------------+

Internally, toJSON grabs the RDD[InternalRow] (of the QueryExecution of the Dataset) and maps the records (per RDD partition) into JSON.

Note	`toJSON` uses Jackson’s JSON parser — jackson-module-scala.

Transforming Datasets — `transform` Typed Transformation



transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

transform applies t function to the source Dataset[T] to produce a result Dataset[U]. It is for chaining custom transformations.



val dataset = spark.range(5)

// Transformation t
import org.apache.spark.sql.Dataset
def withDoubled(longs: Dataset[java.lang.Long]) = longs.withColumn("doubled", 'id * 2)

scala> dataset.transform(withDoubled).show
+---+-------+
| id|doubled|
+---+-------+
|  0|      0|
|  1|      2|
|  2|      4|
|  3|      6|
|  4|      8|
+---+-------+

val dataset = spark.range(5)

// Transformation t

import org.apache.spark.sql.Dataset

def withDoubled(longs: Dataset[java.lang.Long]) = longs.withColumn("doubled", 'id * 2)

scala> dataset.transform(withDoubled).show

+---+-------+

| id|doubled|

+---+-------+

| 0| 0|

| 1| 2|

| 2| 4|

| 3| 6|

| 4| 8|

+---+-------+

Internally, transform executes t function on the current Dataset[T].

`union` Typed Transformation



union(other: Dataset[T]): Dataset[T]

union(other: Dataset[T]): Dataset[T]

union…FIXME

`unionByName` Typed Transformation



unionByName(other: Dataset[T]): Dataset[T]

unionByName(other: Dataset[T]): Dataset[T]

unionByName creates a new Dataset that is an union of the rows in this and the other Datasets column-wise, i.e. the order of columns in Datasets does not matter as long as their names and number match.



val left = spark.range(1).withColumn("rand", rand()).select("id", "rand")
val right = Seq(("0.1", 11)).toDF("rand", "id")
val q = left.unionByName(right)
scala> q.show
+---+-------------------+
| id|               rand|
+---+-------------------+
|  0|0.14747380134150134|
| 11|                0.1|
+---+-------------------+

val left = spark.range(1).withColumn("rand", rand()).select("id", "rand")

val right = Seq(("0.1", 11)).toDF("rand", "id")

val q = left.unionByName(right)

scala> q.show

+---+-------------------+

| id| rand|

+---+-------------------+

| 0|0.14747380134150134|

| 11| 0.1|

+---+-------------------+

Internally, unionByName creates a Union logical operator for this Dataset and Project logical operator with the other Dataset.

In the end, unionByName applies the CombineUnions logical optimization to the Union logical operator and requests the result LogicalPlan to wrap the child operators with AnalysisBarriers.



scala> println(q.queryExecution.logical.numberedTreeString)
00 'Union
01 :- AnalysisBarrier
02 :     +- Project [id#90L, rand#92]
03 :        +- Project [id#90L, rand(-9144575865446031058) AS rand#92]
04 :           +- Range (0, 1, step=1, splits=Some(8))
05 +- AnalysisBarrier
06       +- Project [id#103, rand#102]
07          +- Project [_1#99 AS rand#102, _2#100 AS id#103]
08             +- LocalRelation [_1#99, _2#100]

scala> println(q.queryExecution.logical.numberedTreeString)

00 'Union

01 :- AnalysisBarrier

02 : +- Project [id#90L, rand#92]

03 : +- Project [id#90L, rand(-9144575865446031058) AS rand#92]

04 : +- Range (0, 1, step=1, splits=Some(8))

05 +- AnalysisBarrier

06 +- Project [id#103, rand#102]

07 +- Project [_1#99 AS rand#102, _2#100 AS id#103]

08 +- LocalRelation [_1#99, _2#100]

unionByName throws an AnalysisException if there are duplicate columns in either Dataset.



Found duplicate column(s)

Found duplicate column(s)

unionByName throws an AnalysisException if there are columns in this Dataset has a column that is not available in the other Dataset.



Cannot resolve column name "[name]" among ([rightNames])

Cannot resolve column name "[name]" among ([rightNames])

`where` Typed Transformation



where(condition: Column): Dataset[T]
where(conditionExpr: String): Dataset[T]

where(condition: Column): Dataset[T]

where(conditionExpr: String): Dataset[T]

where is simply a synonym of the filter operator, i.e. passes the input parameters along to filter.

Creating Streaming Dataset with EventTimeWatermark Logical Operator — `withWatermark` Streaming Typed Transformation



withWatermark(eventTime: String, delayThreshold: String): Dataset[T]

withWatermark(eventTime: String, delayThreshold: String): Dataset[T]

Internally, withWatermark creates a Dataset with EventTimeWatermark logical plan for streaming Datasets.

Note	`withWatermark` uses `EliminateEventTimeWatermark` logical rule to eliminate `EventTimeWatermark` logical plan for non-streaming batch `Datasets`.



// Create a batch dataset
val events = spark.range(0, 50, 10).
  withColumn("timestamp", from_unixtime(unix_timestamp - 'id)).
  select('timestamp, 'id as "count")
scala> events.show
+-------------------+-----+
|          timestamp|count|
+-------------------+-----+
|2017-06-25 21:21:14|    0|
|2017-06-25 21:21:04|   10|
|2017-06-25 21:20:54|   20|
|2017-06-25 21:20:44|   30|
|2017-06-25 21:20:34|   40|
+-------------------+-----+

// the dataset is a non-streaming batch one...
scala> events.isStreaming
res1: Boolean = false

// ...so EventTimeWatermark is not included in the logical plan
val watermarked = events.
  withWatermark(eventTime = "timestamp", delayThreshold = "20 seconds")
scala> println(watermarked.queryExecution.logical.numberedTreeString)
00 Project [timestamp#284, id#281L AS count#288L]
01 +- Project [id#281L, from_unixtime((unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss, Some(America/Chicago)) - id#281L), yyyy-MM-dd HH:mm:ss, Some(America/Chicago)) AS timestamp#284]
02    +- Range (0, 50, step=10, splits=Some(8))

// Let's create a streaming Dataset
import org.apache.spark.sql.types.StructType
val schema = new StructType().
  add($"timestamp".timestamp).
  add($"count".long)
scala> schema.printTreeString
root
 |-- timestamp: timestamp (nullable = true)
 |-- count: long (nullable = true)

val events = spark.
  readStream.
  schema(schema).
  csv("events").
  withWatermark(eventTime = "timestamp", delayThreshold = "20 seconds")
scala> println(events.queryExecution.logical.numberedTreeString)
00 'EventTimeWatermark 'timestamp, interval 20 seconds
01 +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@75abcdd4,csv,List(),Some(StructType(StructField(timestamp,TimestampType,true), StructField(count,LongType,true))),List(),None,Map(path -> events),None), FileSource[events], [timestamp#329, count#330L]

// Create a batch dataset

val events = spark.range(0, 50, 10).

withColumn("timestamp", from_unixtime(unix_timestamp - 'id)).

select('timestamp, 'id as "count")

scala> events.show

+-------------------+-----+

| timestamp|count|

+-------------------+-----+

|2017-06-25 21:21:14| 0|

|2017-06-25 21:21:04| 10|

|2017-06-25 21:20:54| 20|

|2017-06-25 21:20:44| 30|

|2017-06-25 21:20:34| 40|

+-------------------+-----+

// the dataset is a non-streaming batch one...

scala> events.isStreaming

res1: Boolean = false

// ...so EventTimeWatermark is not included in the logical plan

val watermarked = events.

withWatermark(eventTime = "timestamp", delayThreshold = "20 seconds")

scala> println(watermarked.queryExecution.logical.numberedTreeString)

00 Project [timestamp#284, id#281L AS count#288L]

01 +- Project [id#281L, from_unixtime((unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss, Some(America/Chicago)) - id#281L), yyyy-MM-dd HH:mm:ss, Some(America/Chicago)) AS timestamp#284]

02 +- Range (0, 50, step=10, splits=Some(8))

// Let's create a streaming Dataset

import org.apache.spark.sql.types.StructType

val schema = new StructType().

add($"timestamp".timestamp).

add($"count".long)

scala> schema.printTreeString

root

|-- timestamp: timestamp (nullable = true)

|-- count: long (nullable = true)

val events = spark.

readStream.

schema(schema).

csv("events").

withWatermark(eventTime = "timestamp", delayThreshold = "20 seconds")

scala> println(events.queryExecution.logical.numberedTreeString)

00 'EventTimeWatermark 'timestamp, interval 20 seconds

01 +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@75abcdd4,csv,List(),Some(StructType(StructField(timestamp,TimestampType,true), StructField(count,LongType,true))),List(),None,Map(path -> events),None), FileSource[events], [timestamp#329, count#330L]

Note

delayThreshold is parsed using CalendarInterval.fromString with interval formatted as described in TimeWindow unary expression.



0 years 0 months 1 week 0 days 0 hours 1 minute 20 seconds 0 milliseconds 0 microseconds

0 years 0 months 1 week 0 days 0 hours 1 minute 20 seconds 0 milliseconds 0 microseconds

Note	`delayThreshold` must not be negative (and `milliseconds` and `months` should both be equal or greater than `0`).

Note	`withWatermark` is used when…FIXME

Dataset API — Dataset Operators

2011-01-28admin阅读(1282)

Dataset API — Dataset Operators

Dataset API is a set of operators with typed and untyped transformations, and actions to work with a structured query (as a Dataset) as a whole.

Operator Description

agg



agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

agg(expr: Column, exprs: Column*): DataFrame

agg(exprs: Map[String, String]): DataFrame

An untyped transformation

alias



alias(alias: String): Dataset[T]
alias(alias: Symbol): Dataset[T]

alias(alias: String): Dataset[T]

alias(alias: Symbol): Dataset[T]

A typed transformation that is a mere synonym of as.

apply



apply(colName: String): Column

apply(colName: String): Column

An untyped transformation to select a column based on the column name (i.e. maps a Dataset onto a Column)



as(alias: String): Dataset[T]
as(alias: Symbol): Dataset[T]

as(alias: String): Dataset[T]

as(alias: Symbol): Dataset[T]

A typed transformation



as[U : Encoder]: Dataset[U]

as[U : Encoder]: Dataset[U]

A typed transformation to enforce a type, i.e. marking the records in the Dataset as of a given data type (data type conversion). as simply changes the view of the data that is passed into typed operations (e.g. map) and does not eagerly project away any columns that are not present in the specified class.

cache



cache(): this.type

cache(): this.type

A basic action that is a mere synonym of persist.

checkpoint



checkpoint(): Dataset[T]
checkpoint(eager: Boolean): Dataset[T]

checkpoint(): Dataset[T]

checkpoint(eager: Boolean): Dataset[T]

A basic action to checkpoint the Dataset in a reliable way (using a reliable HDFS-compliant file system, e.g. Hadoop HDFS or Amazon S3)

coalesce



coalesce(numPartitions: Int): Dataset[T]

coalesce(numPartitions: Int): Dataset[T]

A typed transformation to repartition a Dataset

col



col(colName: String): Column

col(colName: String): Column

An untyped transformation to create a column (reference) based on the column name

collect



collect(): Array[T]

collect(): Array[T]

An action

colRegex



colRegex(colName: String): Column

colRegex(colName: String): Column

An untyped transformation to create a column (reference) based on the column name specified as a regex

columns



columns: Array[String]

columns: Array[String]

A basic action

count



count(): Long

count(): Long

An action to count the number of rows

createGlobalTempView



createGlobalTempView(viewName: String): Unit

createGlobalTempView(viewName: String): Unit

A basic action

createOrReplaceGlobalTempView



createOrReplaceGlobalTempView(viewName: String): Unit

createOrReplaceGlobalTempView(viewName: String): Unit

A basic action

createOrReplaceTempView



createOrReplaceTempView(viewName: String): Unit

createOrReplaceTempView(viewName: String): Unit

A basic action

createTempView



createTempView(viewName: String): Unit

createTempView(viewName: String): Unit

A basic action

crossJoin



crossJoin(right: Dataset[_]): DataFrame

crossJoin(right: Dataset[_]): DataFrame

An untyped transformation

cube



cube(cols: Column*): RelationalGroupedDataset
cube(col1: String, cols: String*): RelationalGroupedDataset

cube(cols: Column*): RelationalGroupedDataset

cube(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

describe



describe(cols: String*): DataFrame

describe(cols: String*): DataFrame

An action

distinct



distinct(): Dataset[T]

distinct(): Dataset[T]

A typed transformation that is a mere synonym of dropDuplicates (with all the columns of the Dataset)

drop



drop(colName: String): DataFrame
drop(colNames: String*): DataFrame
drop(col: Column): DataFrame

drop(colName: String): DataFrame

drop(colNames: String*): DataFrame

drop(col: Column): DataFrame

An untyped transformation

dropDuplicates



dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Array[String]): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

dropDuplicates(): Dataset[T]

dropDuplicates(colNames: Array[String]): Dataset[T]

dropDuplicates(colNames: Seq[String]): Dataset[T]

dropDuplicates(col1: String, cols: String*): Dataset[T]

A typed transformation

dtypes



dtypes: Array[(String, String)]

dtypes: Array[(String, String)]

A basic action

except



except(other: Dataset[T]): Dataset[T]

except(other: Dataset[T]): Dataset[T]

A typed transformation

exceptAll



exceptAll(other: Dataset[T]): Dataset[T]

exceptAll(other: Dataset[T]): Dataset[T]

(New in 2.4.0) A typed transformation

explain



explain(): Unit
explain(extended: Boolean): Unit

explain(): Unit

explain(extended: Boolean): Unit

A basic action to display the logical and physical plans of the Dataset, i.e. displays the logical and physical plans (with optional cost and codegen summaries) to the standard output

filter



filter(condition: Column): Dataset[T]
filter(conditionExpr: String): Dataset[T]
filter(func: T => Boolean): Dataset[T]

filter(condition: Column): Dataset[T]

filter(conditionExpr: String): Dataset[T]

filter(func: T => Boolean): Dataset[T]

A typed transformation

first



first(): T

first(): T

An action that is a mere synonym of head

flatMap



flatMap[U : Encoder](func: T => TraversableOnce[U]): Dataset[U]

flatMap[U : Encoder](func: T => TraversableOnce[U]): Dataset[U]

A typed transformation

foreach



foreach(f: T => Unit): Unit

foreach(f: T => Unit): Unit

An action

foreachPartition



foreachPartition(f: Iterator[T] => Unit): Unit

foreachPartition(f: Iterator[T] => Unit): Unit

An action

groupBy



groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy(cols: Column*): RelationalGroupedDataset

groupBy(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

groupByKey



groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

A typed transformation

head



head(): T (1)
head(n: Int): Array[T]

head(): T (1)

head(n: Int): Array[T]

Uses 1 for n

An action

hint



hint(name: String, parameters: Any*): Dataset[T]

hint(name: String, parameters: Any*): Dataset[T]

A basic action to specify a hint (and optional parameters)

inputFiles



inputFiles: Array[String]

inputFiles: Array[String]

A basic action

intersect



intersect(other: Dataset[T]): Dataset[T]

intersect(other: Dataset[T]): Dataset[T]

A typed transformation

intersectAll



intersectAll(other: Dataset[T]): Dataset[T]

intersectAll(other: Dataset[T]): Dataset[T]

(New in 2.4.0) A typed transformation

isEmpty



isEmpty: Boolean

isEmpty: Boolean

(New in 2.4.0) A basic action

isLocal



isLocal: Boolean

isLocal: Boolean

A basic action

isStreaming



isStreaming: Boolean

isStreaming: Boolean

join



join(right: Dataset[_]): DataFrame
join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

join(right: Dataset[_]): DataFrame

join(right: Dataset[_], usingColumn: String): DataFrame

join(right: Dataset[_], usingColumns: Seq[String]): DataFrame

join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

join(right: Dataset[_], joinExprs: Column): DataFrame

join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

An untyped transformation

joinWith



joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]
joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

A typed transformation

limit



limit(n: Int): Dataset[T]

limit(n: Int): Dataset[T]

A typed transformation

localCheckpoint



localCheckpoint(): Dataset[T]
localCheckpoint(eager: Boolean): Dataset[T]

localCheckpoint(): Dataset[T]

localCheckpoint(eager: Boolean): Dataset[T]

A basic action to checkpoint the Dataset locally on executors (and therefore unreliably)

map



map[U: Encoder](func: T => U): Dataset[U]

map[U: Encoder](func: T => U): Dataset[U]

A typed transformation

mapPartitions



mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

A typed transformation



na: DataFrameNaFunctions

na: DataFrameNaFunctions

An untyped transformation

orderBy



orderBy(sortExprs: Column*): Dataset[T]
orderBy(sortCol: String, sortCols: String*): Dataset[T]

orderBy(sortExprs: Column*): Dataset[T]

orderBy(sortCol: String, sortCols: String*): Dataset[T]

A typed transformation

persist



persist(): this.type
persist(newLevel: StorageLevel): this.type

persist(): this.type

persist(newLevel: StorageLevel): this.type

A basic action to persist the Dataset

printSchema



printSchema(): Unit

printSchema(): Unit

A basic action

randomSplit



randomSplit(weights: Array[Double]): Array[Dataset[T]]
randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

randomSplit(weights: Array[Double]): Array[Dataset[T]]

randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

A typed transformation to split a Dataset randomly into two Datasets

rdd



rdd: RDD[T]

rdd: RDD[T]

A basic action

reduce



reduce(func: (T, T) => T): T

reduce(func: (T, T) => T): T

An action to reduce the records of the Dataset using the specified binary function.

repartition



repartition(partitionExprs: Column*): Dataset[T]
repartition(numPartitions: Int): Dataset[T]
repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartition(partitionExprs: Column*): Dataset[T]

repartition(numPartitions: Int): Dataset[T]

repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

A typed transformation to repartition a Dataset

repartitionByRange



repartitionByRange(partitionExprs: Column*): Dataset[T]
repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartitionByRange(partitionExprs: Column*): Dataset[T]

repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

A typed transformation

rollup



rollup(cols: Column*): RelationalGroupedDataset
rollup(col1: String, cols: String*): RelationalGroupedDataset

rollup(cols: Column*): RelationalGroupedDataset

rollup(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

sample



sample(withReplacement: Boolean, fraction: Double): Dataset[T]
sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]
sample(fraction: Double): Dataset[T]
sample(fraction: Double, seed: Long): Dataset[T]

sample(withReplacement: Boolean, fraction: Double): Dataset[T]

sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

sample(fraction: Double): Dataset[T]

sample(fraction: Double, seed: Long): Dataset[T]

A typed transformation

schema



schema: StructType

schema: StructType

A basic action

select



select(cols: Column*): DataFrame
select(col: String, cols: String*): DataFrame

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]
select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]
select[U1, U2, U3](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]
select[U1, U2, U3, U4](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3],
  c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]
select[U1, U2, U3, U4, U5](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3],
  c4: TypedColumn[T, U4],
  c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

select(cols: Column*): DataFrame

select(col: String, cols: String*): DataFrame

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]

select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

select[U1, U2, U3](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

select[U1, U2, U3, U4](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3],

c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

select[U1, U2, U3, U4, U5](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3],

c4: TypedColumn[T, U4],

c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

An (untyped and typed) transformation

selectExpr



selectExpr(exprs: String*): DataFrame

selectExpr(exprs: String*): DataFrame

An untyped transformation

show



show(): Unit
show(truncate: Boolean): Unit
show(numRows: Int): Unit
show(numRows: Int, truncate: Boolean): Unit
show(numRows: Int, truncate: Int): Unit
show(numRows: Int, truncate: Int, vertical: Boolean): Unit

show(): Unit

show(truncate: Boolean): Unit

show(numRows: Int): Unit

show(numRows: Int, truncate: Boolean): Unit

show(numRows: Int, truncate: Int): Unit

show(numRows: Int, truncate: Int, vertical: Boolean): Unit

An action

sort



sort(sortExprs: Column*): Dataset[T]
sort(sortCol: String, sortCols: String*): Dataset[T]

sort(sortExprs: Column*): Dataset[T]

sort(sortCol: String, sortCols: String*): Dataset[T]

A typed transformation to sort elements globally (across partitions). Use sortWithinPartitions transformation for partition-local sort

sortWithinPartitions



sortWithinPartitions(sortExprs: Column*): Dataset[T]
sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

sortWithinPartitions(sortExprs: Column*): Dataset[T]

sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

A typed transformation to sort elements within partitions (aka local sort). Use sort transformation for global sort (across partitions)

stat



stat: DataFrameStatFunctions

stat: DataFrameStatFunctions

An untyped transformation

storageLevel



storageLevel: StorageLevel

storageLevel: StorageLevel

A basic action

summary



summary(statistics: String*): DataFrame

summary(statistics: String*): DataFrame

An action to calculate statistics (e.g. count, mean, stddev, min, max and 25%, 50%, 75% percentiles)

take



take(n: Int): Array[T]

take(n: Int): Array[T]

An action to take the first records of a Dataset

toDF



toDF(): DataFrame
toDF(colNames: String*): DataFrame

toDF(): DataFrame

toDF(colNames: String*): DataFrame

A basic action to convert a Dataset to a DataFrame

toJSON



toJSON: Dataset[String]

toJSON: Dataset[String]

A typed transformation

toLocalIterator



toLocalIterator(): java.util.Iterator[T]

toLocalIterator(): java.util.Iterator[T]

An action that returns an iterator with all rows in the Dataset. The iterator will consume as much memory as the largest partition in the Dataset.

transform



transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

A typed transformation for chaining custom transformations

union



union(other: Dataset[T]): Dataset[T]

union(other: Dataset[T]): Dataset[T]

A typed transformation

unionByName



unionByName(other: Dataset[T]): Dataset[T]

unionByName(other: Dataset[T]): Dataset[T]

A typed transformation

unpersist



unpersist(): this.type (1)
unpersist(blocking: Boolean): this.type

unpersist(): this.type (1)

unpersist(blocking: Boolean): this.type

Uses unpersist with blocking disabled (false)

A basic action to unpersist the Dataset

where



where(condition: Column): Dataset[T]
where(conditionExpr: String): Dataset[T]

where(condition: Column): Dataset[T]

where(conditionExpr: String): Dataset[T]

A typed transformation

withColumn



withColumn(colName: String, col: Column): DataFrame

withColumn(colName: String, col: Column): DataFrame

An untyped transformation

withColumnRenamed



withColumnRenamed(existingName: String, newName: String): DataFrame

withColumnRenamed(existingName: String, newName: String): DataFrame

An untyped transformation

write



write: DataFrameWriter[T]

write: DataFrameWriter[T]

A basic action that returns a DataFrameWriter for saving the content of the (non-streaming) Dataset out to an external storage

DataFrameWriter — Saving Data To External Data Sources

2011-01-28admin阅读(1830)

DataFrameWriter — Saving Data To External Data Sources

DataFrameWriter is the interface to describe how data (as the result of executing a structured query) should be saved to an external data source.

Method Description

bucketBy



bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T]

bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T]

csv



csv(path: String): Unit

csv(path: String): Unit

format



format(source: String): DataFrameWriter[T]

format(source: String): DataFrameWriter[T]

insertInto



insertInto(tableName: String): Unit

insertInto(tableName: String): Unit

Inserts (the results of) a DataFrame into a table

jdbc



jdbc(url: String, table: String, connectionProperties: Properties): Unit

jdbc(url: String, table: String, connectionProperties: Properties): Unit

json



json(path: String): Unit

json(path: String): Unit

mode



mode(saveMode: SaveMode): DataFrameWriter[T]
mode(saveMode: String): DataFrameWriter[T]

mode(saveMode: SaveMode): DataFrameWriter[T]

mode(saveMode: String): DataFrameWriter[T]

option



option(key: String, value: String): DataFrameWriter[T]
option(key: String, value: Boolean): DataFrameWriter[T]
option(key: String, value: Long): DataFrameWriter[T]
option(key: String, value: Double): DataFrameWriter[T]

option(key: String, value: String): DataFrameWriter[T]

option(key: String, value: Boolean): DataFrameWriter[T]

option(key: String, value: Long): DataFrameWriter[T]

option(key: String, value: Double): DataFrameWriter[T]

options



options(options: scala.collection.Map[String, String]): DataFrameWriter[T]

options(options: scala.collection.Map[String, String]): DataFrameWriter[T]

orc



orc(path: String): Unit

orc(path: String): Unit

parquet



parquet(path: String): Unit

parquet(path: String): Unit

partitionBy



partitionBy(colNames: String*): DataFrameWriter[T]

partitionBy(colNames: String*): DataFrameWriter[T]

save



save(): Unit
save(path: String): Unit

save(): Unit

save(path: String): Unit

saveAsTable



saveAsTable(tableName: String): Unit

saveAsTable(tableName: String): Unit

sortBy



sortBy(colName: String, colNames: String*): DataFrameWriter[T]

sortBy(colName: String, colNames: String*): DataFrameWriter[T]

text



text(path: String): Unit

text(path: String): Unit

DataFrameWriter is available using Dataset.write operator.



scala> :type df
org.apache.spark.sql.DataFrame

val writer = df.write

scala> :type writer
org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row]

scala> :type df

org.apache.spark.sql.DataFrame

val writer = df.write

scala> :type writer

org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row]

DataFrameWriter supports many file formats and JDBC databases. It also allows for plugging in new formats.

DataFrameWriter defaults to parquet data source format. You can change the default format using spark.sql.sources.default configuration property or format or the format-specific methods.



// see above for writer definition

// Save dataset in Parquet format
writer.save(path = "nums")

// Save dataset in JSON format
writer.format("json").save(path = "nums-json")

// Alternatively, use format-specific method
write.json(path = "nums-json")

// see above for writer definition

// Save dataset in Parquet format

writer.save(path = "nums")

// Save dataset in JSON format

writer.format("json").save(path = "nums-json")

// Alternatively, use format-specific method

write.json(path = "nums-json")

In the end, you trigger the actual saving of the content of a Dataset (i.e. the result of executing a structured query) using save method.



writer.save

writer.save

DataFrameWriter uses internal mutable attributes to build a properly-defined “write specification” for insertInto, save and saveAsTable methods.

Table 2. Internal Attributes and Corresponding Setters
Attribute	Setters
`source`	format
`mode`	mode
`extraOptions`	option, options, save
`partitioningColumns`	partitionBy
`bucketColumnNames`	bucketBy
`numBuckets`	bucketBy
`sortColumnNames`	sortBy

Note	`DataFrameWriter` is a type constructor in Scala that keeps an internal reference to the source `DataFrame` for the whole lifecycle (starting right from the moment it was created).

Note	Spark Structured Streaming’s `DataStreamWriter` is responsible for writing the content of streaming Datasets in a streaming fashion.

Executing Logical Command(s) — `runCommand` Internal Method



runCommand(session: SparkSession, name: String)(command: LogicalPlan): Unit

runCommand(session: SparkSession, name: String)(command: LogicalPlan): Unit

runCommand uses the input SparkSession to access the SessionState that is in turn requested to execute the logical command (that simply creates a QueryExecution).

runCommand records the current time (start time) and uses the SQLExecution helper object to execute the action (under a new execution id) that simply requests the QueryExecution for the RDD[InternalRow] (and triggers execution of logical commands).

Tip	Use web UI’s SQL tab to see the execution or a `SparkListener` to be notified when the execution is started and finished. The `SparkListener` should intercept `SparkListenerSQLExecutionStart` and `SparkListenerSQLExecutionEnd` events.

runCommand records the current time (end time).

In the end, runCommand uses the input SparkSession to access the ExecutionListenerManager and requests it to onSuccess (with the input name, the QueryExecution and the duration).

In case of any exceptions, runCommand requests the ExecutionListenerManager to onFailure (with the exception) and (re)throws it.

Note

runCommand is used when DataFrameWriter is requested to save the rows of a structured query (a DataFrame) to a data source (and indirectly executing a logical command for writing to a data source V1), insert the rows of a structured streaming (a DataFrame) into a table and create a table (that is used exclusively for saveAsTable).

Saving Rows of Structured Streaming (DataFrame) to Table — `saveAsTable` Method



saveAsTable(tableName: String): Unit
// PRIVATE API
saveAsTable(tableIdent: TableIdentifier): Unit

saveAsTable(tableName: String): Unit

// PRIVATE API

saveAsTable(tableIdent: TableIdentifier): Unit

saveAsTable saves the content of a DataFrame to the tableName table.



val ids = spark.range(5)
ids.write.
  option("path", "/tmp/five_ids").
  saveAsTable("five_ids")

// Check out if saveAsTable as five_ids was successful
val q = spark.catalog.listTables.filter($"name" === "five_ids")
scala> q.show
+--------+--------+-----------+---------+-----------+
|    name|database|description|tableType|isTemporary|
+--------+--------+-----------+---------+-----------+
|five_ids| default|       null| EXTERNAL|      false|
+--------+--------+-----------+---------+-----------+

val ids = spark.range(5)

ids.write.

option("path", "/tmp/five_ids").

saveAsTable("five_ids")

// Check out if saveAsTable as five_ids was successful

val q = spark.catalog.listTables.filter($"name" === "five_ids")

scala> q.show

+--------+--------+-----------+---------+-----------+

+--------+--------+-----------+---------+-----------+

+--------+--------+-----------+---------+-----------+

Internally, saveAsTable requests the current ParserInterface to parse the input table name.

Note	`saveAsTable` uses the internal DataFrame to access the SparkSession that is used to access the SessionState and in the end the ParserInterface.

saveAsTable then requests the SessionCatalog to check whether the table exists or not.

Note	`saveAsTable` uses the internal DataFrame to access the SparkSession that is used to access the SessionState and in the end the SessionCatalog.

In the end, saveAsTable branches off per whether the table exists or not and the save mode.

Table 3. saveAsTable’s Behaviour per Save Mode
Does table exist?	Save Mode	Behaviour
yes	`Ignore`	Does nothing
yes	`ErrorIfExists`	Reports an `AnalysisException` with `Table [tableIdent] already exists.` error message
yes	`Overwrite`	FIXME
anything	anything	createTable

Saving Rows of Structured Query (DataFrame) to Data Source — `save` Method



save(): Unit

save(): Unit

save saves the rows of a structured query (a Dataset) to a data source.

Internally, save uses DataSource to look up the class of the requested data source (for the source option and the SQLConf).

Note

save uses SparkSession to access the SessionState that is in turn used to access the SQLConf.



val df: DataFrame = ???
df.sparkSession.sessionState.conf

val df: DataFrame = ???

df.sparkSession.sessionState.conf

If the class is a DataSourceV2…FIXME

Otherwise, if not a DataSourceV2, save simply saveToV1Source.

save does not support saving to Hive (i.e. the source is hive) and throws an AnalysisException when requested so.



Hive data source can only be used with tables, you can not write files of Hive data source directly.

Hive data source can only be used with tables, you can not write files of Hive data source directly.

save does not support bucketing (i.e. when the numBuckets or sortColumnNames options are defined) and throws an AnalysisException when requested so.



'[operation]' does not support bucketing right now

'[operation]' does not support bucketing right now

Saving Data to Table Using JDBC Data Source — `jdbc` Method



jdbc(url: String, table: String, connectionProperties: Properties): Unit

jdbc(url: String, table: String, connectionProperties: Properties): Unit

jdbc method saves the content of the DataFrame to an external database table via JDBC.

You can use mode to control save mode, i.e. what happens when an external table exists when save is executed.

It is assumed that the jdbc save pipeline is not partitioned and bucketed.

All options are overriden by the input connectionProperties.

The required options are:

driver which is the class name of the JDBC driver (that is passed to Spark’s own DriverRegistry.register and later used to connect(url, properties)).

When table exists and the override save mode is in use, DROP TABLE table is executed.

It creates the input table (using CREATE TABLE table (schema) where schema is the schema of the DataFrame).

`bucketBy` Method



bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T]

bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T]

bucketBy simply sets the internal numBuckets and bucketColumnNames to the input numBuckets and colName with colNames, respectively.



val df = spark.range(5)
import org.apache.spark.sql.DataFrameWriter
val writer: DataFrameWriter[java.lang.Long] = df.write

val bucketedTable = writer.bucketBy(numBuckets = 8, "col1", "col2")

scala> :type bucketedTable
org.apache.spark.sql.DataFrameWriter[Long]

val df = spark.range(5)

import org.apache.spark.sql.DataFrameWriter

val writer: DataFrameWriter[java.lang.Long] = df.write

val bucketedTable = writer.bucketBy(numBuckets = 8, "col1", "col2")

scala> :type bucketedTable

org.apache.spark.sql.DataFrameWriter[Long]

`partitionBy` Method



partitionBy(colNames: String*): DataFrameWriter[T]

partitionBy(colNames: String*): DataFrameWriter[T]

Caution

FIXME

Specifying Save Mode — `mode` Method



mode(saveMode: String): DataFrameWriter[T]
mode(saveMode: SaveMode): DataFrameWriter[T]

mode(saveMode: String): DataFrameWriter[T]

mode(saveMode: SaveMode): DataFrameWriter[T]

mode defines the behaviour of save when an external file or table (Spark writes to) already exists, i.e. SaveMode.

Table 4. Types of SaveMode
Name	Description
`Append`	Records are appended to existing data.
`ErrorIfExists`	Exception is thrown.
`Ignore`	Do not save the records and not change the existing data in any way.
`Overwrite`	Existing data is overwritten by new records.

Specifying Sorting Columns — `sortBy` Method



sortBy(colName: String, colNames: String*): DataFrameWriter[T]

sortBy(colName: String, colNames: String*): DataFrameWriter[T]

sortBy simply sets sorting columns to the input colName and colNames column names.

Note	`sortBy` must be used together with bucketBy or `DataFrameWriter` reports an `IllegalArgumentException`.

Note	assertNotBucketed asserts that bucketing is not used by some methods.

Specifying Writer Configuration — `option` Method



option(key: String, value: Boolean): DataFrameWriter[T]
option(key: String, value: Double): DataFrameWriter[T]
option(key: String, value: Long): DataFrameWriter[T]
option(key: String, value: String): DataFrameWriter[T]

option(key: String, value: Boolean): DataFrameWriter[T]

option(key: String, value: Double): DataFrameWriter[T]

option(key: String, value: Long): DataFrameWriter[T]

option(key: String, value: String): DataFrameWriter[T]

option…FIXME

Specifying Writer Configuration — `options` Method



options(options: scala.collection.Map[String, String]): DataFrameWriter[T]

options(options: scala.collection.Map[String, String]): DataFrameWriter[T]

options…FIXME

Writing DataFrames to Files

Caution

FIXME

Specifying Data Source (by Alias or Fully-Qualified Class Name) — `format` Method



format(source: String): DataFrameWriter[T]

format(source: String): DataFrameWriter[T]

format simply sets the source internal property.

Parquet

Caution

FIXME

Note	Parquet is the default data source format.

Inserting Rows of Structured Streaming (DataFrame) into Table — `insertInto` Method



insertInto(tableName: String): Unit (1)
insertInto(tableIdent: TableIdentifier): Unit

insertInto(tableName: String): Unit (1)

insertInto(tableIdent: TableIdentifier): Unit

Parses tableName and calls the other insertInto with a TableIdentifier

insertInto inserts the content of the DataFrame to the specified tableName table.

Note	`insertInto` ignores column names and just uses a position-based resolution, i.e. the order (not the names!) of the columns in (the output of) the Dataset matters.

Internally, insertInto creates an InsertIntoTable logical operator (with UnresolvedRelation operator as the only child) and executes it right away (that submits a Spark job).

spark sql DataFrameWrite insertInto webui query details.png

Figure 1. DataFrameWrite.insertInto Executes SQL Command (as a Spark job)

insertInto reports a AnalysisException for bucketed DataFrames, i.e. buckets or sortColumnNames are defined.



'insertInto' does not support bucketing right now

'insertInto' does not support bucketing right now



val writeSpec = spark.range(4).
  write.
  bucketBy(numBuckets = 3, colName = "id")
scala> writeSpec.insertInto("t1")
org.apache.spark.sql.AnalysisException: 'insertInto' does not support bucketing right now;
  at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:334)
  at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:302)
  at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:298)
  ... 49 elided

val writeSpec = spark.range(4).

write.

bucketBy(numBuckets = 3, colName = "id")

scala> writeSpec.insertInto("t1")

org.apache.spark.sql.AnalysisException: 'insertInto' does not support bucketing right now;

at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:334)

at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:302)

at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:298)

... 49 elided

insertInto reports a AnalysisException for partitioned DataFrames, i.e. partitioningColumns is defined.



insertInto() can't be used together with partitionBy(). Partition columns have already been defined for the table. It is not necessary to use partitionBy().

insertInto() can't be used together with partitionBy(). Partition columns have already been defined for the table. It is not necessary to use partitionBy().



val writeSpec = spark.range(4).
  write.
  partitionBy("id")
scala> writeSpec.insertInto("t1")
org.apache.spark.sql.AnalysisException: insertInto() can't be used together with partitionBy(). Partition columns have already be defined for the table. It is not necessary to use partitionBy().;
  at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:305)
  at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:298)
  ... 49 elided

val writeSpec = spark.range(4).

write.

partitionBy("id")

scala> writeSpec.insertInto("t1")

org.apache.spark.sql.AnalysisException: insertInto() can't be used together with partitionBy(). Partition columns have already be defined for the table. It is not necessary to use partitionBy().;

at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:305)

at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:298)

... 49 elided

`getBucketSpec` Internal Method



getBucketSpec: Option[BucketSpec]

getBucketSpec: Option[BucketSpec]

getBucketSpec returns a new BucketSpec if numBuckets was defined (with bucketColumnNames and sortColumnNames).

getBucketSpec throws an IllegalArgumentException when numBuckets are not defined when sortColumnNames are.



sortBy must be used together with bucketBy

sortBy must be used together with bucketBy

Note	`getBucketSpec` is used exclusively when `DataFrameWriter` is requested to create a table.

Creating Table — `createTable` Internal Method



createTable(tableIdent: TableIdentifier): Unit

createTable(tableIdent: TableIdentifier): Unit

createTable builds a CatalogStorageFormat per extraOptions.

createTable assumes CatalogTableType.EXTERNAL when location URI of CatalogStorageFormat is defined and CatalogTableType.MANAGED otherwise.

createTable creates a CatalogTable (with the bucketSpec per getBucketSpec).

In the end, createTable creates a CreateTable logical command (with the CatalogTable, mode and the logical query plan of the dataset) and runs it.

Note	`createTable` is used when `DataFrameWriter` is requested to saveAsTable.

`assertNotBucketed` Internal Method



assertNotBucketed(operation: String): Unit

assertNotBucketed(operation: String): Unit

assertNotBucketed simply throws an AnalysisException if either numBuckets or sortColumnNames internal property is defined:



'[operation]' does not support bucketing right now

'[operation]' does not support bucketing right now

Note	`assertNotBucketed` is used when `DataFrameWriter` is requested to save, insertInto and jdbc.

Executing Logical Command for Writing to Data Source V1 — `saveToV1Source` Internal Method



saveToV1Source(): Unit

saveToV1Source(): Unit

saveToV1Source creates a DataSource (for the source class name, the partitioningColumns and the extraOptions) and requests it for the logical command for writing (with the mode and the analyzed logical plan of the structured query).

Note	While requesting the analyzed logical plan of the structured query, `saveToV1Source` triggers execution of logical commands.

In the end, saveToV1Source runs the logical command for writing.

Note	The logical command for writing can be one of the following: A SaveIntoDataSourceCommand for CreatableRelationProviders An InsertIntoHadoopFsRelationCommand for FileFormats

Note	`saveToV1Source` is used exclusively when `DataFrameWriter` is requested to save the rows of a structured query (a DataFrame) to a data source (for all but DataSourceV2 writers with `WriteSupport`).

`assertNotPartitioned` Internal Method



assertNotPartitioned(operation: String): Unit

assertNotPartitioned(operation: String): Unit

assertNotPartitioned…FIXME

Note	`assertNotPartitioned` is used when…FIXME

`csv` Method



csv(path: String): Unit

csv(path: String): Unit

csv…FIXME

`json` Method



json(path: String): Unit

json(path: String): Unit

json…FIXME

`orc` Method



orc(path: String): Unit

orc(path: String): Unit

orc…FIXME

`parquet` Method



parquet(path: String): Unit

parquet(path: String): Unit

parquet…FIXME

`text` Method



text(path: String): Unit

text(path: String): Unit

text…FIXME

`partitionBy` Method



partitionBy(colNames: String*): DataFrameWriter[T]

partitionBy(colNames: String*): DataFrameWriter[T]

partitionBy simply sets the partitioningColumns internal property.

DataFrameReader — Loading Data From External Data Sources

2011-01-28admin阅读(1238)

DataFrameReader — Loading Data From External Data Sources

DataFrameReader is the public interface to describe how to load data from an external data source (e.g. files, tables, JDBC or Dataset[String]).

Table 1. DataFrameReader API

Method

Description

csv



csv(csvDataset: Dataset[String]): DataFrame
csv(path: String): DataFrame
csv(paths: String*): DataFrame

csv(csvDataset: Dataset[String]): DataFrame

csv(path: String): DataFrame

csv(paths: String*): DataFrame

format



format(source: String): DataFrameReader

format(source: String): DataFrameReader

jdbc



jdbc(
  url: String,
  table: String,
  predicates: Array[String],
  connectionProperties: Properties): DataFrame
jdbc(
  url: String,
  table: String,
  properties: Properties): DataFrame
jdbc(
  url: String,
  table: String,
  columnName: String,
  lowerBound: Long,
  upperBound: Long,
  numPartitions: Int,
  connectionProperties: Properties): DataFrame

jdbc(

url: String,

table: String,

predicates: Array[String],

connectionProperties: Properties): DataFrame

jdbc(

url: String,

table: String,

properties: Properties): DataFrame

jdbc(

url: String,

table: String,

columnName: String,

lowerBound: Long,

upperBound: Long,

numPartitions: Int,

connectionProperties: Properties): DataFrame

json



json(jsonDataset: Dataset[String]): DataFrame
json(path: String): DataFrame
json(paths: String*): DataFrame

json(jsonDataset: Dataset[String]): DataFrame

json(path: String): DataFrame

json(paths: String*): DataFrame

load



load(): DataFrame
load(path: String): DataFrame
load(paths: String*): DataFrame

load(): DataFrame

load(path: String): DataFrame

load(paths: String*): DataFrame

option



option(key: String, value: Boolean): DataFrameReader
option(key: String, value: Double): DataFrameReader
option(key: String, value: Long): DataFrameReader
option(key: String, value: String): DataFrameReader

option(key: String, value: Boolean): DataFrameReader

option(key: String, value: Double): DataFrameReader

option(key: String, value: Long): DataFrameReader

option(key: String, value: String): DataFrameReader

options



options(options: scala.collection.Map[String, String]): DataFrameReader
options(options: java.util.Map[String, String]): DataFrameReader

options(options: scala.collection.Map[String, String]): DataFrameReader

options(options: java.util.Map[String, String]): DataFrameReader

orc



orc(path: String): DataFrame
orc(paths: String*): DataFrame

orc(path: String): DataFrame

orc(paths: String*): DataFrame

parquet



parquet(path: String): DataFrame
parquet(paths: String*): DataFrame

parquet(path: String): DataFrame

parquet(paths: String*): DataFrame

schema



schema(schemaString: String): DataFrameReader
schema(schema: StructType): DataFrameReader

schema(schemaString: String): DataFrameReader

schema(schema: StructType): DataFrameReader

table



table(tableName: String): DataFrame

table(tableName: String): DataFrame

text



text(path: String): DataFrame
text(paths: String*): DataFrame

text(path: String): DataFrame

text(paths: String*): DataFrame

textFile



textFile(path: String): Dataset[String]
textFile(paths: String*): Dataset[String]

textFile(path: String): Dataset[String]

textFile(paths: String*): Dataset[String]

DataFrameReader is available using SparkSession.read.



import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import org.apache.spark.sql.DataFrameReader
val reader: DataFrameReader = spark.read

import org.apache.spark.sql.SparkSession

val spark: SparkSession = ...

import org.apache.spark.sql.DataFrameReader

val reader: DataFrameReader = spark.read

DataFrameReader supports many file formats natively and offers the interface to define custom formats.

Note	`DataFrameReader` assumes parquet data source file format by default that you can change using spark.sql.sources.default Spark property.

After you have described the loading pipeline (i.e. the “Extract” part of ETL in Spark SQL), you eventually “trigger” the loading using format-agnostic load or format-specific (e.g. json, csv, jdbc) operators.



import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import org.apache.spark.sql.DataFrame

// Using format-agnostic load operator
val csvs: DataFrame = spark
  .read
  .format("csv")
  .option("header", true)
  .option("inferSchema", true)
  .load("*.csv")

// Using format-specific load operator
val jsons: DataFrame = spark
  .read
  .json("metrics/*.json")

import org.apache.spark.sql.SparkSession

val spark: SparkSession = ...

import org.apache.spark.sql.DataFrame

// Using format-agnostic load operator

val csvs: DataFrame = spark

.read

.format("csv")

.option("header", true)

.option("inferSchema", true)

.load("*.csv")

// Using format-specific load operator

val jsons: DataFrame = spark

.read

.json("metrics/*.json")

Note	All methods of `DataFrameReader` merely describe a process of loading a data and do not trigger a Spark job (until an action is called).

DataFrameReader can read text files using textFile methods that return typed Datasets.



import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import org.apache.spark.sql.Dataset
val lines: Dataset[String] = spark
  .read
  .textFile("README.md")

import org.apache.spark.sql.SparkSession

val spark: SparkSession = ...

import org.apache.spark.sql.Dataset

val lines: Dataset[String] = spark

.read

.textFile("README.md")

Note	Loading datasets using textFile methods allows for additional preprocessing before final processing of the string values as json or csv lines.

(New in Spark 2.2) DataFrameReader can load datasets from Dataset[String] (with lines being complete “files”) using format-specific csv and json operators.



val csvLine = "0,Warsaw,Poland"

import org.apache.spark.sql.Dataset
val cities: Dataset[String] = Seq(csvLine).toDS
scala> cities.show
+---------------+
|          value|
+---------------+
|0,Warsaw,Poland|
+---------------+

// Define schema explicitly (as below)
// or
// option("header", true) + option("inferSchema", true)
import org.apache.spark.sql.types.StructType
val schema = new StructType()
  .add($"id".long.copy(nullable = false))
  .add($"city".string)
  .add($"country".string)
scala> schema.printTreeString
root
 |-- id: long (nullable = false)
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)

import org.apache.spark.sql.DataFrame
val citiesDF: DataFrame = spark
  .read
  .schema(schema)
  .csv(cities)
scala> citiesDF.show
+---+------+-------+
| id|  city|country|
+---+------+-------+
|  0|Warsaw| Poland|
+---+------+-------+

val csvLine = "0,Warsaw,Poland"

import org.apache.spark.sql.Dataset

val cities: Dataset[String] = Seq(csvLine).toDS

scala> cities.show

+---------------+

| value|

+---------------+

|0,Warsaw,Poland|

+---------------+

// Define schema explicitly (as below)

// or

// option("header", true) + option("inferSchema", true)

import org.apache.spark.sql.types.StructType

val schema = new StructType()

.add($"id".long.copy(nullable = false))

.add($"city".string)

.add($"country".string)

scala> schema.printTreeString

root

|-- id: long (nullable = false)

|-- city: string (nullable = true)

|-- country: string (nullable = true)

import org.apache.spark.sql.DataFrame

val citiesDF: DataFrame = spark

.read

.schema(schema)

.csv(cities)

scala> citiesDF.show

+---+------+-------+

| id| city|country|

+---+------+-------+

| 0|Warsaw| Poland|

+---+------+-------+

Table 2. DataFrameReader’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`extraOptions`	Used when…FIXME
`source`	Name of the input data source (aka format or provider) with the default format per spark.sql.sources.default configuration property (default: parquet). `source` can be changed using format method. Used exclusively when `DataFrameReader` is requested to load.
`userSpecifiedSchema`	Optional used-specified schema (default: `None`, i.e. undefined) Set when `DataFrameReader` is requested to set a schema, load a data from an external data source, loadV1Source (when creating a DataSource), and load a data using json and csv file formats Used when `DataFrameReader` is requested to assertNoSpecifiedSchema (while loading data using jdbc, table and textFile)

Specifying Format Of Input Data Source — `format` method



format(source: String): DataFrameReader

format(source: String): DataFrameReader

You use format to configure DataFrameReader to use appropriate source format.

Supported data formats:

json
csv (since 2.0.0)
parquet (see Parquet)
orc
text
jdbc
libsvm — only when used in format("libsvm")

Note	Spark SQL allows for developing custom data source formats.

Specifying Schema — `schema` method



schema(schema: StructType): DataFrameReader

schema(schema: StructType): DataFrameReader

schema allows for specyfing the schema of a data source (that the DataFrameReader is about to read a dataset from).



import org.apache.spark.sql.types.StructType
val schema = new StructType()
  .add($"id".long.copy(nullable = false))
  .add($"city".string)
  .add($"country".string)
scala> schema.printTreeString
root
 |-- id: long (nullable = false)
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)

import org.apache.spark.sql.DataFrameReader
val r: DataFrameReader = spark.read.schema(schema)

import org.apache.spark.sql.types.StructType

val schema = new StructType()

.add($"id".long.copy(nullable = false))

.add($"city".string)

.add($"country".string)

scala> schema.printTreeString

root

|-- id: long (nullable = false)

|-- city: string (nullable = true)

|-- country: string (nullable = true)

import org.apache.spark.sql.DataFrameReader

val r: DataFrameReader = spark.read.schema(schema)

Note	Some formats can infer schema from datasets (e.g. csv or json) using inferSchema option.

Tip	Read up on Schema.

Specifying Load Options — `option` and `options` Methods



option(key: String, value: String): DataFrameReader
option(key: String, value: Boolean): DataFrameReader
option(key: String, value: Long): DataFrameReader
option(key: String, value: Double): DataFrameReader

option(key: String, value: String): DataFrameReader

option(key: String, value: Boolean): DataFrameReader

option(key: String, value: Long): DataFrameReader

option(key: String, value: Double): DataFrameReader

You can also use options method to describe different options in a single Map.



options(options: scala.collection.Map[String, String]): DataFrameReader

options(options: scala.collection.Map[String, String]): DataFrameReader

Loading Datasets from Files (into DataFrames) Using Format-Specific Load Operators

DataFrameReader supports the following file formats:

JSON
CSV
parquet
ORC
text

`json` method



json(path: String): DataFrame
json(paths: String*): DataFrame
json(jsonDataset: Dataset[String]): DataFrame
json(jsonRDD: RDD[String]): DataFrame

json(path: String): DataFrame

json(paths: String*): DataFrame

json(jsonDataset: Dataset[String]): DataFrame

json(jsonRDD: RDD[String]): DataFrame

New in 2.0.0: prefersDecimal

`csv` method



csv(path: String): DataFrame
csv(paths: String*): DataFrame
csv(csvDataset: Dataset[String]): DataFrame

csv(path: String): DataFrame

csv(paths: String*): DataFrame

csv(csvDataset: Dataset[String]): DataFrame

`parquet` method



parquet(path: String): DataFrame
parquet(paths: String*): DataFrame

parquet(path: String): DataFrame

parquet(paths: String*): DataFrame

The supported options:

compression (default: snappy)

New in 2.0.0: snappy is the default Parquet codec. See [SPARK-14482][SQL] Change default Parquet codec from gzip to snappy.

The compressions supported:

none or uncompressed
snappy – the default codec in Spark 2.0.0.
gzip – the default codec in Spark before 2.0.0
lzo



val tokens = Seq("hello", "henry", "and", "harry")
  .zipWithIndex
  .map(_.swap)
  .toDF("id", "token")

val parquetWriter = tokens.write
parquetWriter.option("compression", "none").save("hello-none")

// The exception is mostly for my learning purposes
// so I know where and how to find the trace to the compressions
// Sorry...
scala> parquetWriter.option("compression", "unsupported").save("hello-unsupported")
java.lang.IllegalArgumentException: Codec [unsupported] is not available. Available codecs are uncompressed, gzip, lzo, snappy, none.
  at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.<init>(ParquetOptions.scala:43)
  at org.apache.spark.sql.execution.datasources.parquet.DefaultSource.prepareWrite(ParquetRelation.scala:77)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$4.apply(InsertIntoHadoopFsRelation.scala:122)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$4.apply(InsertIntoHadoopFsRelation.scala:122)
  at org.apache.spark.sql.execution.datasources.BaseWriterContainer.driverSideSetup(WriterContainer.scala:103)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:141)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:116)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:116)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:116)
  at org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:61)
  at org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult(commands.scala:59)
  at org.apache.spark.sql.execution.command.ExecutedCommand.doExecute(commands.scala:73)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:137)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:65)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:65)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:390)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
  ... 48 elided

val tokens = Seq("hello", "henry", "and", "harry")

.zipWithIndex

.map(_.swap)

.toDF("id", "token")

val parquetWriter = tokens.write

parquetWriter.option("compression", "none").save("hello-none")

// The exception is mostly for my learning purposes

// so I know where and how to find the trace to the compressions

// Sorry...

scala> parquetWriter.option("compression", "unsupported").save("hello-unsupported")

java.lang.IllegalArgumentException: Codec [unsupported] is not available. Available codecs are uncompressed, gzip, lzo, snappy, none.

at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.<init>(ParquetOptions.scala:43)

at org.apache.spark.sql.execution.datasources.parquet.DefaultSource.prepareWrite(ParquetRelation.scala:77)

at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$4.apply(InsertIntoHadoopFsRelation.scala:122)

at org.apache.spark.sql.execution.datasources.BaseWriterContainer.driverSideSetup(WriterContainer.scala:103)

at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:141)

at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:116)

at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)

at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:116)

at org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:61)

at org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult(commands.scala:59)

at org.apache.spark.sql.execution.command.ExecutedCommand.doExecute(commands.scala:73)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:137)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)

at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:65)

at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:65)

at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:390)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)

... 48 elided

`orc` method



orc(path: String): DataFrame
orc(paths: String*): DataFrame

orc(path: String): DataFrame

orc(paths: String*): DataFrame

Optimized Row Columnar (ORC) file format is a highly efficient columnar format to store Hive data with more than 1,000 columns and improve performance. ORC format was introduced in Hive version 0.11 to use and retain the type information from the table definition.

Tip	Read ORC Files document to learn about the ORC file format.

`text` method

text method loads a text file.



text(path: String): DataFrame
text(paths: String*): DataFrame

text(path: String): DataFrame

text(paths: String*): DataFrame

Example



val lines: Dataset[String] = spark.read.text("README.md").as[String]

scala> lines.show
+--------------------+
|               value|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Spark Streami...|
|                    |
|<http://spark.apa...|
|                    |
|                    |
|## Online Documen...|
|                    |
|You can find the ...|
|guide, on the [pr...|
|and [project wiki...|
|This README file ...|
|                    |
|   ## Building Spark|
+--------------------+
only showing top 20 rows

val lines: Dataset[String] = spark.read.text("README.md").as[String]

scala> lines.show

+--------------------+

| value|

+--------------------+

| # Apache Spark|

| |

|Spark is a fast a...|

|high-level APIs i...|

|supports general ...|

|rich set of highe...|

|MLlib for machine...|

|and Spark Streami...|

| |

|<http://spark.apa...|

| |

|## Online Documen...|

| |

|You can find the ...|

|guide, on the [pr...|

|and [project wiki...|

|This README file ...|

| |

| ## Building Spark|

+--------------------+

only showing top 20 rows

Loading Table to DataFrame — `table` Method



table(tableName: String): DataFrame

table(tableName: String): DataFrame

table loads the content of the tableName table into an untyped DataFrame.



scala> spark.catalog.tableExists("t1")
res1: Boolean = true

// t1 exists in the catalog
// let's load it
val t1 = spark.read.table("t1")

scala> spark.catalog.tableExists("t1")

res1: Boolean = true

// t1 exists in the catalog

// let's load it

val t1 = spark.read.table("t1")

Note	`table` simply passes the call to SparkSession.table after making sure that a user-defined schema has not been specified.

Loading Data From External Table using JDBC Data Source — `jdbc` Method



jdbc(url: String, table: String, properties: Properties): DataFrame
jdbc(
  url: String,
  table: String,
  predicates: Array[String],
  connectionProperties: Properties): DataFrame
jdbc(
  url: String,
  table: String,
  columnName: String,
  lowerBound: Long,
  upperBound: Long,
  numPartitions: Int,
  connectionProperties: Properties): DataFrame

jdbc(url: String, table: String, properties: Properties): DataFrame

jdbc(

url: String,

table: String,

predicates: Array[String],

connectionProperties: Properties): DataFrame

jdbc(

url: String,

table: String,

columnName: String,

lowerBound: Long,

upperBound: Long,

numPartitions: Int,

connectionProperties: Properties): DataFrame

jdbc loads data from an external table using the JDBC data source.

Internally, jdbc creates a JDBCOptions from the input url, table and extraOptions with connectionProperties.

jdbc then creates one JDBCPartition per predicates.

In the end, jdbc requests the SparkSession to create a DataFrame for a JDBCRelation (with JDBCPartitions and JDBCOptions created earlier).

Note

jdbc does not support a custom schema and throws an AnalysisException if defined:



User specified schema not supported with `[jdbc]`

User specified schema not supported with `[jdbc]`

Note	`jdbc` method uses `java.util.Properties` (and appears overly Java-centric). Use format(“jdbc”) instead.

Tip	Review the exercise Creating DataFrames from Tables using JDBC and PostgreSQL.

Loading Datasets From Text Files — `textFile` Method



textFile(path: String): Dataset[String]
textFile(paths: String*): Dataset[String]

textFile(path: String): Dataset[String]

textFile(paths: String*): Dataset[String]

textFile loads one or many text files into a typed Dataset[String].



import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import org.apache.spark.sql.Dataset
val lines: Dataset[String] = spark
  .read
  .textFile("README.md")

import org.apache.spark.sql.SparkSession

val spark: SparkSession = ...

import org.apache.spark.sql.Dataset

val lines: Dataset[String] = spark

.read

.textFile("README.md")

Note	`textFile` are similar to text family of methods in that they both read text files but `text` methods return untyped `DataFrame` while `textFile` return typed `Dataset[String]`.

Internally, textFile passes calls on to text method and selects the only value column before it applies Encoders.STRING encoder.

Creating DataFrameReader Instance

DataFrameReader takes the following when created:

SparkSession

Loading Dataset (Data Source API V1) — `loadV1Source` Internal Method



loadV1Source(paths: String*): DataFrame

loadV1Source(paths: String*): DataFrame

loadV1Source creates a DataSource and requests it to resolve the underlying relation (as a BaseRelation).

In the end, loadV1Source requests SparkSession to create a DataFrame from the BaseRelation.

Note	`loadV1Source` is used when `DataFrameReader` is requested to load (and the data source is neither of `DataSourceV2` type nor a DataSourceReader could not be created).

Loading Dataset from Data Source — `load` Method



load(): DataFrame
load(path: String): DataFrame
load(paths: String*): DataFrame

load(): DataFrame

load(path: String): DataFrame

load(paths: String*): DataFrame

load loads a dataset from a data source (with optional support for multiple paths) as an untyped DataFrame.

Internally, load lookupDataSource for the source. load then branches off per its type (i.e. whether it is of DataSourceV2 marker type or not).

For a “Data Source V2” data source, load…FIXME

Otherwise, if the source is not a “Data Source V2” data source, load simply loadV1Source.

load throws a AnalysisException when the source format is hive.



Hive data source can only be used with tables, you can not read files of Hive data source directly.

Hive data source can only be used with tables, you can not read files of Hive data source directly.

`assertNoSpecifiedSchema` Internal Method



assertNoSpecifiedSchema(operation: String): Unit

assertNoSpecifiedSchema(operation: String): Unit

assertNoSpecifiedSchema throws a AnalysisException if the userSpecifiedSchema is defined.



User specified schema not supported with `[operation]`

User specified schema not supported with `[operation]`

Note	`assertNoSpecifiedSchema` is used when `DataFrameReader` is requested to load data using jdbc, table and textFile.

`verifyColumnNameOfCorruptRecord` Internal Method



verifyColumnNameOfCorruptRecord(
  schema: StructType,
  columnNameOfCorruptRecord: String): Unit

verifyColumnNameOfCorruptRecord(

schema: StructType,

columnNameOfCorruptRecord: String): Unit

verifyColumnNameOfCorruptRecord…FIXME

Note	`verifyColumnNameOfCorruptRecord` is used when `DataFrameReader` is requested to load data using json and csv.

DataSource API — Managing Datasets in External Data Sources

2011-01-28admin阅读(1358)

DataSource API — Managing Datasets in External Data Sources

Reading Datasets

Spark SQL can read data from external storage systems like files, Hive tables and JDBC databases through DataFrameReader interface.

You use SparkSession to access DataFrameReader using read operation.



import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate

val reader = spark.read

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.getOrCreate

val reader = spark.read

DataFrameReader is an interface to create DataFrames (aka Dataset[Row]) from files, Hive tables or tables using JDBC.



val people = reader.csv("people.csv")
val cities = reader.format("json").load("cities.json")

val people = reader.csv("people.csv")

val cities = reader.format("json").load("cities.json")

As of Spark 2.0, DataFrameReader can read text files using textFile methods that return Dataset[String] (not DataFrames).



spark.read.textFile("README.md")

spark.read.textFile("README.md")

You can also define your own custom file formats.



val countries = reader.format("customFormat").load("countries.cf")

val countries = reader.format("customFormat").load("countries.cf")

There are two operation modes in Spark SQL, i.e. batch and streaming (part of Spark Structured Streaming).

You can access DataStreamReader for reading streaming datasets through SparkSession.readStream method.



import org.apache.spark.sql.streaming.DataStreamReader
val stream: DataStreamReader = spark.readStream

import org.apache.spark.sql.streaming.DataStreamReader

val stream: DataStreamReader = spark.readStream

The available methods in DataStreamReader are similar to DataFrameReader.

Saving Datasets

Spark SQL can save data to external storage systems like files, Hive tables and JDBC databases through DataFrameWriter interface.

You use write method on a Dataset to access DataFrameWriter.



import org.apache.spark.sql.{DataFrameWriter, Dataset}
val ints: Dataset[Int] = (0 to 5).toDS

val writer: DataFrameWriter[Int] = ints.write

import org.apache.spark.sql.{DataFrameWriter, Dataset}

val ints: Dataset[Int] = (0 to 5).toDS

val writer: DataFrameWriter[Int] = ints.write

DataFrameWriter is an interface to persist a Datasets to an external storage system in a batch fashion.

You can access DataStreamWriter for writing streaming datasets through Dataset.writeStream method.



val papers = spark.readStream.text("papers").as[String]

import org.apache.spark.sql.streaming.DataStreamWriter
val writer: DataStreamWriter[String] = papers.writeStream

val papers = spark.readStream.text("papers").as[String]

import org.apache.spark.sql.streaming.DataStreamWriter

val writer: DataStreamWriter[String] = papers.writeStream

The available methods in DataStreamWriter are similar to DataFrameWriter.

上一页
1
···
53
54
55
56
57
58
下一页
共 58 页

spark-sql 第56页

DataFrameStatFunctions — Working With Statistic Functions

approxQuantile Method

bloomFilter Method

buildBloomFilter Internal Method

corr Method

countMinSketch Method

cov Method

crosstab Method

freqItems Method

sampleBy Method

DataFrameNaFunctions — Working With Missing Data

convertToDouble Internal Method

drop Method

fill Method

fillCol Internal Method

fillMap Internal Method

fillValue Internal Method

replace0 Internal Method

replace Method

replaceCol Internal Method

Dataset API — Actions

collect Action

count Action

Calculating Basic Statistics — describe Action

first Action

foreach Action

foreachPartition Action

head Action

reduce Action

show Action

Calculating Statistics — summary Action

Taking First Records — take Action

toLocalIterator Action

Dataset API — Basic Actions

Caching Dataset — cache Basic Action

Reliably Checkpointing Dataset — checkpoint Basic Action

createTempView Basic Action

createOrReplaceTempView Basic Action

createGlobalTempView Basic Action

createOrReplaceGlobalTempView Basic Action

createTempViewCommand Internal Method

Displaying Logical and Physical Plans, Their Cost and Codegen — explain Basic Action

Specifying Hint — hint Basic Action

Locally Checkpointing Dataset — localCheckpoint Basic Action

checkpoint Internal Method

Persisting Dataset — persist Basic Action

Generating RDD of Internal Binary Rows — rdd Basic Action

Accessing Schema — schema Basic Action

Converting Typed Dataset to Untyped DataFrame — toDF Basic Action

Unpersisting Cached Dataset — unpersist Basic Action

Accessing DataFrameWriter (to Describe Writing Dataset) — write Basic Action

isEmpty Typed Transformation

isLocal Typed Transformation

Dataset API — Untyped Transformations

agg Untyped Transformation

apply Untyped Transformation

col Untyped Transformation

colRegex Untyped Transformation

crossJoin Untyped Transformation

cube Untyped Transformation

Dropping One or More Columns — drop Untyped Transformation

groupBy Untyped Transformation

join Untyped Transformation

na Untyped Transformation

rollup Untyped Transformation

select Untyped Transformation

Projecting Columns using SQL Statements — selectExpr Untyped Transformation

stat Untyped Transformation

withColumn Untyped Transformation

withColumnRenamed Untyped Transformation

Dataset API — Typed Transformations

as Typed Transformation

Enforcing Type — as Typed Transformation

Repartitioning Dataset with Shuffle Disabled — coalesce Typed Transformation

dropDuplicates Typed Transformation

except Typed Transformation

exceptAll Typed Transformation

filter Typed Transformation

Creating Zero or More Records — flatMap Typed Transformation

`approxQuantile` Method

`bloomFilter` Method

`buildBloomFilter` Internal Method

`corr` Method

`countMinSketch` Method

`cov` Method

`crosstab` Method

`freqItems` Method

`sampleBy` Method

`convertToDouble` Internal Method

`drop` Method

`fill` Method

`fillCol` Internal Method

`fillMap` Internal Method

`fillValue` Internal Method

`replace0` Internal Method

`replace` Method

`replaceCol` Internal Method

`collect` Action

`count` Action

Calculating Basic Statistics — `describe` Action

`first` Action

`foreach` Action

`foreachPartition` Action

`head` Action

`reduce` Action

`show` Action

Calculating Statistics — `summary` Action

Taking First Records — `take` Action

`toLocalIterator` Action

Caching Dataset — `cache` Basic Action

Reliably Checkpointing Dataset — `checkpoint` Basic Action

`createTempView` Basic Action

`createOrReplaceTempView` Basic Action

`createGlobalTempView` Basic Action

`createOrReplaceGlobalTempView` Basic Action

`createTempViewCommand` Internal Method

Displaying Logical and Physical Plans, Their Cost and Codegen — `explain` Basic Action

Specifying Hint — `hint` Basic Action

Locally Checkpointing Dataset — `localCheckpoint` Basic Action

`checkpoint` Internal Method

Persisting Dataset — `persist` Basic Action

Generating RDD of Internal Binary Rows — `rdd` Basic Action

Accessing Schema — `schema` Basic Action

Converting Typed Dataset to Untyped DataFrame — `toDF` Basic Action

Unpersisting Cached Dataset — `unpersist` Basic Action

Accessing DataFrameWriter (to Describe Writing Dataset) — `write` Basic Action

`isEmpty` Typed Transformation

`isLocal` Typed Transformation

`agg` Untyped Transformation

`apply` Untyped Transformation

`col` Untyped Transformation

`colRegex` Untyped Transformation

`crossJoin` Untyped Transformation

`cube` Untyped Transformation

Dropping One or More Columns — `drop` Untyped Transformation

`groupBy` Untyped Transformation

`join` Untyped Transformation

`na` Untyped Transformation

`rollup` Untyped Transformation

`select` Untyped Transformation

Projecting Columns using SQL Statements — `selectExpr` Untyped Transformation

`stat` Untyped Transformation

`withColumn` Untyped Transformation

`withColumnRenamed` Untyped Transformation

`as` Typed Transformation

Enforcing Type — `as` Typed Transformation

Repartitioning Dataset with Shuffle Disabled — `coalesce` Typed Transformation

`dropDuplicates` Typed Transformation

`except` Typed Transformation

`exceptAll` Typed Transformation

`filter` Typed Transformation

Creating Zero or More Records — `flatMap` Typed Transformation

`intersect` Typed Transformation

`intersectAll` Typed Transformation

`joinWith` Typed Transformation

`limit` Typed Transformation

`map` Typed Transformation

`mapPartitions` Typed Transformation

Randomly Split Dataset Into Two or More Datasets Per Weight — `randomSplit` Typed Transformation