spark-sql-spark技术分享-第57页

Row

2011-01-28admin阅读(1858)

Row

Row is a generic row object with an ordered collection of fields that can be accessed by an ordinal / an index (aka generic access by ordinal), a name (aka native primitive access) or using Scala’s pattern matching.

Note	`Row` is also called Catalyst Row.

Row may have an optional schema.

The traits of Row:

length or size – Row knows the number of elements (columns).
schema – Row knows the schema

Row belongs to org.apache.spark.sql.Row package.



import org.apache.spark.sql.Row

import org.apache.spark.sql.Row

Creating Row — `apply` Factory Method

Caution

FIXME

Field Access by Index — `apply` and `get` methods

Fields of a Row instance can be accessed by index (starting from 0) using apply or get.



scala> val row = Row(1, "hello")
row: org.apache.spark.sql.Row = [1,hello]

scala> row(1)
res0: Any = hello

scala> row.get(1)
res1: Any = hello

scala> val row = Row(1, "hello")

row: org.apache.spark.sql.Row = [1,hello]

scala> row(1)

res0: Any = hello

scala> row.get(1)

res1: Any = hello

Note	Generic access by ordinal (using `apply` or `get`) returns a value of type `Any`.

Get Field As Type — `getAs` method

You can query for fields with their proper types using getAs with an index



val row = Row(1, "hello")

scala> row.getAs[Int](0)
res1: Int = 1

scala> row.getAs[String](1)
res2: String = hello

val row = Row(1, "hello")

scala> row.getAs[Int](0)

res1: Int = 1

scala> row.getAs[String](1)

res2: String = hello

Note

FIXME



row.getAs[String](null)

row.getAs[String](null)

Schema

A Row instance can have a schema defined.

Note	Unless you are instantiating `Row` yourself (using Row Object), a `Row` has always a schema.

Note	It is RowEncoder to take care of assigning a schema to a `Row` when `toDF` on a Dataset or when instantiating DataFrame through DataFrameReader.

Row Object

Row companion object offers factory methods to create Row instances from a collection of elements (apply), a sequence of elements (fromSeq) and tuples (fromTuple).



scala> Row(1, "hello")
res0: org.apache.spark.sql.Row = [1,hello]

scala> Row.fromSeq(Seq(1, "hello"))
res1: org.apache.spark.sql.Row = [1,hello]

scala> Row.fromTuple((0, "hello"))
res2: org.apache.spark.sql.Row = [0,hello]

scala> Row(1, "hello")

res0: org.apache.spark.sql.Row = [1,hello]

scala> Row.fromSeq(Seq(1, "hello"))

res1: org.apache.spark.sql.Row = [1,hello]

scala> Row.fromTuple((0, "hello"))

res2: org.apache.spark.sql.Row = [0,hello]

Row object can merge Row instances.



scala> Row.merge(Row(1), Row("hello"))
res3: org.apache.spark.sql.Row = [1,hello]

scala> Row.merge(Row(1), Row("hello"))

res3: org.apache.spark.sql.Row = [1,hello]

It can also return an empty Row instance.



scala> Row.empty == Row()
res4: Boolean = true

scala> Row.empty == Row()

res4: Boolean = true

Pattern Matching on Row

Row can be used in pattern matching (since Row Object comes with unapplySeq).



scala> Row.unapplySeq(Row(1, "hello"))
res5: Some[Seq[Any]] = Some(WrappedArray(1, hello))

Row(1, "hello") match { case Row(key: Int, value: String) =>
  key -> value
}

scala> Row.unapplySeq(Row(1, "hello"))

res5: Some[Seq[Any]] = Some(WrappedArray(1, hello))

Row(1, "hello") match { case Row(key: Int, value: String) =>

key -> value

}

DataFrame — Dataset of Rows with RowEncoder

2011-01-28admin阅读(1323)

DataFrame — Dataset of Rows with RowEncoder

Spark SQL introduces a tabular functional data abstraction called DataFrame. It is designed to ease developing Spark applications for processing large amount of structured tabular data on Spark infrastructure.

DataFrame is a data abstraction or a domain-specific language (DSL) for working with structured and semi-structured data, i.e. datasets that you can specify a schema for.

DataFrame is a collection of rows with a schema that is the result of executing a structured query (once it will have been executed).

DataFrame uses the immutable, in-memory, resilient, distributed and parallel capabilities of RDD, and applies a structure called schema to the data.

Note

In Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row].



type DataFrame = Dataset[Row]

type DataFrame = Dataset[Row]

See org.apache.spark.package.scala.

DataFrame is a distributed collection of tabular data organized into rows and named columns. It is conceptually equivalent to a table in a relational database with operations to project (select), filter, intersect, join, group, sort, join, aggregate, or convert to a RDD (consult DataFrame API)



data.groupBy('Product_ID).sum('Score)

data.groupBy('Product_ID).sum('Score)

Spark SQL borrowed the concept of DataFrame from pandas’ DataFrame and made it immutable, parallel (one machine, perhaps with many processors and cores) and distributed (many machines, perhaps with many processors and cores).

Note	Hey, big data consultants, time to help teams migrate the code from pandas’ DataFrame into Spark’s DataFrames (at least to PySpark’s DataFrame) and offer services to set up large clusters!

DataFrames in Spark SQL strongly rely on the features of RDD – it’s basically a RDD exposed as structured DataFrame by appropriate operations to handle very big data from the day one. So, petabytes of data should not scare you (unless you’re an administrator to create such clustered Spark environment – contact me when you feel alone with the task).



val df = Seq(("one", 1), ("one", 1), ("two", 1))
  .toDF("word", "count")

scala> df.show
+----+-----+
|word|count|
+----+-----+
| one|    1|
| one|    1|
| two|    1|
+----+-----+

val counted = df.groupBy('word).count

scala> counted.show
+----+-----+
|word|count|
+----+-----+
| two|    1|
| one|    2|
+----+-----+

val df = Seq(("one", 1), ("one", 1), ("two", 1))

.toDF("word", "count")

scala> df.show

+----+-----+

|word|count|

+----+-----+

| one| 1|

| two| 1|

+----+-----+

val counted = df.groupBy('word).count

scala> counted.show

+----+-----+

|word|count|

+----+-----+

| two| 1|

| one| 2|

+----+-----+

You can create DataFrames by loading data from structured files (JSON, Parquet, CSV), RDDs, tables in Hive, or external databases (JDBC). You can also create DataFrames from scratch and build upon them (as in the above example). See DataFrame API. You can read any format given you have appropriate Spark SQL extension of DataFrameReader to format the dataset appropriately.

Caution

FIXME Diagram of reading data from sources to create DataFrame

You can execute queries over DataFrames using two approaches:

the good ol’ SQL – helps migrating from “SQL databases” world into the world of DataFrame in Spark SQL
Query DSL – an API that helps ensuring proper syntax at compile time.

DataFrame also allows you to do the following tasks:

Filtering

DataFrames use the Catalyst query optimizer to produce efficient queries (and so they are supposed to be faster than corresponding RDD-based queries).

Note	Your DataFrames can also be type-safe and moreover further improve their performance through specialized encoders that can significantly cut serialization and deserialization times.

You can enforce types on generic rows and hence bring type safety (at compile time) by encoding rows into type-safe Dataset object. As of Spark 2.0 it is a preferred way of developing Spark applications.

Features of DataFrame

A DataFrame is a collection of “generic” Row instances (as RDD[Row]) and a schema.

Note	Regardless of how you create a `DataFrame`, it will always be a pair of `RDD[Row]` and StructType.

SQLContext, spark, and Spark shell

You use org.apache.spark.sql.SQLContext to build DataFrames and execute SQL queries.

The quickest and easiest way to work with Spark SQL is to use Spark shell and spark object.



scala> spark
res1: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@60ae950f

scala> spark

res1: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@60ae950f

As you may have noticed, spark in Spark shell is actually a org.apache.spark.sql.hive.HiveContext that integrates the Spark SQL execution engine with data stored in Apache Hive.

The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Creating DataFrames from Scratch

Use Spark shell as described in Spark shell.

Using toDF

After you import spark.implicits._ (which is done for you by Spark shell) you may apply toDF method to convert objects to DataFrames.



scala> val df = Seq("I am a DataFrame!").toDF("text")
df: org.apache.spark.sql.DataFrame = [text: string]

scala> val df = Seq("I am a DataFrame!").toDF("text")

df: org.apache.spark.sql.DataFrame = [text: string]

Creating DataFrame using Case Classes in Scala

This method assumes the data comes from a Scala case class that will describe the schema.



scala> case class Person(name: String, age: Int)
defined class Person

scala> val people = Seq(Person("Jacek", 42), Person("Patryk", 19), Person("Maksym", 5))
people: Seq[Person] = List(Person(Jacek,42), Person(Patryk,19), Person(Maksym,5))

scala> val df = spark.createDataFrame(people)
df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> df.show
+------+---+
|  name|age|
+------+---+
| Jacek| 42|
|Patryk| 19|
|Maksym|  5|
+------+---+

scala> case class Person(name: String, age: Int)

defined class Person

scala> val people = Seq(Person("Jacek", 42), Person("Patryk", 19), Person("Maksym", 5))

people: Seq[Person] = List(Person(Jacek,42), Person(Patryk,19), Person(Maksym,5))

scala> val df = spark.createDataFrame(people)

df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> df.show

+------+---+

| name|age|

+------+---+

| Jacek| 42|

|Patryk| 19|

|Maksym| 5|

+------+---+

Custom DataFrame Creation using createDataFrame

SQLContext offers a family of createDataFrame operations.



scala> val lines = sc.textFile("Cartier+for+WinnersCurse.csv")
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:24

scala> val headers = lines.first
headers: String = auctionid,bid,bidtime,bidder,bidderrate,openbid,price

scala> import org.apache.spark.sql.types.{StructField, StringType}
import org.apache.spark.sql.types.{StructField, StringType}

scala> val fs = headers.split(",").map(f => StructField(f, StringType))
fs: Array[org.apache.spark.sql.types.StructField] = Array(StructField(auctionid,StringType,true), StructField(bid,StringType,true), StructField(bidtime,StringType,true), StructField(bidder,StringType,true), StructField(bidderrate,StringType,true), StructField(openbid,StringType,true), StructField(price,StringType,true))

scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType

scala> val schema = StructType(fs)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(auctionid,StringType,true), StructField(bid,StringType,true), StructField(bidtime,StringType,true), StructField(bidder,StringType,true), StructField(bidderrate,StringType,true), StructField(openbid,StringType,true), StructField(price,StringType,true))

scala> val noheaders = lines.filter(_ != header)
noheaders: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:33

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val rows = noheaders.map(_.split(",")).map(a => Row.fromSeq(a))
rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at map at <console>:35

scala> val auctions = spark.createDataFrame(rows, schema)
auctions: org.apache.spark.sql.DataFrame = [auctionid: string, bid: string, bidtime: string, bidder: string, bidderrate: string, openbid: string, price: string]

scala> auctions.printSchema
root
 |-- auctionid: string (nullable = true)
 |-- bid: string (nullable = true)
 |-- bidtime: string (nullable = true)
 |-- bidder: string (nullable = true)
 |-- bidderrate: string (nullable = true)
 |-- openbid: string (nullable = true)
 |-- price: string (nullable = true)

scala> auctions.dtypes
res28: Array[(String, String)] = Array((auctionid,StringType), (bid,StringType), (bidtime,StringType), (bidder,StringType), (bidderrate,StringType), (openbid,StringType), (price,StringType))

scala> auctions.show(5)
+----------+----+-----------+-----------+----------+-------+-----+
| auctionid| bid|    bidtime|     bidder|bidderrate|openbid|price|
+----------+----+-----------+-----------+----------+-------+-----+
|1638843936| 500|0.478368056|  kona-java|       181|    500| 1625|
|1638843936| 800|0.826388889|     doc213|        60|    500| 1625|
|1638843936| 600|3.761122685|       zmxu|         7|    500| 1625|
|1638843936|1500|5.226377315|carloss8055|         5|    500| 1625|
|1638843936|1600|   6.570625|    jdrinaz|         6|    500| 1625|
+----------+----+-----------+-----------+----------+-------+-----+
only showing top 5 rows

scala> val lines = sc.textFile("Cartier+for+WinnersCurse.csv")

lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:24

scala> val headers = lines.first

headers: String = auctionid,bid,bidtime,bidder,bidderrate,openbid,price

scala> import org.apache.spark.sql.types.{StructField, StringType}

import org.apache.spark.sql.types.{StructField, StringType}

scala> val fs = headers.split(",").map(f => StructField(f, StringType))

fs: Array[org.apache.spark.sql.types.StructField] = Array(StructField(auctionid,StringType,true), StructField(bid,StringType,true), StructField(bidtime,StringType,true), StructField(bidder,StringType,true), StructField(bidderrate,StringType,true), StructField(openbid,StringType,true), StructField(price,StringType,true))

scala> import org.apache.spark.sql.types.StructType

import org.apache.spark.sql.types.StructType

scala> val schema = StructType(fs)

schema: org.apache.spark.sql.types.StructType = StructType(StructField(auctionid,StringType,true), StructField(bid,StringType,true), StructField(bidtime,StringType,true), StructField(bidder,StringType,true), StructField(bidderrate,StringType,true), StructField(openbid,StringType,true), StructField(price,StringType,true))

scala> val noheaders = lines.filter(_ != header)

noheaders: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:33

scala> import org.apache.spark.sql.Row

import org.apache.spark.sql.Row

scala> val rows = noheaders.map(_.split(",")).map(a => Row.fromSeq(a))

rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at map at <console>:35

scala> val auctions = spark.createDataFrame(rows, schema)

auctions: org.apache.spark.sql.DataFrame = [auctionid: string, bid: string, bidtime: string, bidder: string, bidderrate: string, openbid: string, price: string]

scala> auctions.printSchema

root

|-- auctionid: string (nullable = true)

|-- bid: string (nullable = true)

|-- bidtime: string (nullable = true)

|-- bidder: string (nullable = true)

|-- bidderrate: string (nullable = true)

|-- openbid: string (nullable = true)

|-- price: string (nullable = true)

scala> auctions.dtypes

res28: Array[(String, String)] = Array((auctionid,StringType), (bid,StringType), (bidtime,StringType), (bidder,StringType), (bidderrate,StringType), (openbid,StringType), (price,StringType))

scala> auctions.show(5)

+----------+----+-----------+-----------+----------+-------+-----+

+----------+----+-----------+-----------+----------+-------+-----+

|1638843936| 500|0.478368056| kona-java| 181| 500| 1625|

|1638843936| 800|0.826388889| doc213| 60| 500| 1625|

|1638843936| 600|3.761122685| zmxu| 7| 500| 1625|

|1638843936|1500|5.226377315|carloss8055| 5| 500| 1625|

|1638843936|1600| 6.570625| jdrinaz| 6| 500| 1625|

+----------+----+-----------+-----------+----------+-------+-----+

only showing top 5 rows

Loading data from structured files

Creating DataFrame from CSV file

Let’s start with an example in which schema inference relies on a custom case class in Scala.



scala> val lines = sc.textFile("Cartier+for+WinnersCurse.csv")
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:24

scala> val header = lines.first
header: String = auctionid,bid,bidtime,bidder,bidderrate,openbid,price

scala> lines.count
res3: Long = 1349

scala> case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Int, openbid: Float, price: Float)
defined class Auction

scala> val noheader = lines.filter(_ != header)
noheader: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at filter at <console>:31

scala> val auctions = noheader.map(_.split(",")).map(r => Auction(r(0), r(1).toFloat, r(2).toFloat, r(3), r(4).toInt, r(5).toFloat, r(6).toFloat))
auctions: org.apache.spark.rdd.RDD[Auction] = MapPartitionsRDD[59] at map at <console>:35

scala> val df = auctions.toDF
df: org.apache.spark.sql.DataFrame = [auctionid: string, bid: float, bidtime: float, bidder: string, bidderrate: int, openbid: float, price: float]

scala> df.printSchema
root
 |-- auctionid: string (nullable = true)
 |-- bid: float (nullable = false)
 |-- bidtime: float (nullable = false)
 |-- bidder: string (nullable = true)
 |-- bidderrate: integer (nullable = false)
 |-- openbid: float (nullable = false)
 |-- price: float (nullable = false)

scala> df.show
+----------+------+----------+-----------------+----------+-------+------+
| auctionid|   bid|   bidtime|           bidder|bidderrate|openbid| price|
+----------+------+----------+-----------------+----------+-------+------+
|1638843936| 500.0|0.47836804|        kona-java|       181|  500.0|1625.0|
|1638843936| 800.0| 0.8263889|           doc213|        60|  500.0|1625.0|
|1638843936| 600.0| 3.7611227|             zmxu|         7|  500.0|1625.0|
|1638843936|1500.0| 5.2263775|      carloss8055|         5|  500.0|1625.0|
|1638843936|1600.0|  6.570625|          jdrinaz|         6|  500.0|1625.0|
|1638843936|1550.0| 6.8929167|      carloss8055|         5|  500.0|1625.0|
|1638843936|1625.0| 6.8931136|      carloss8055|         5|  500.0|1625.0|
|1638844284| 225.0|  1.237419|dre_313@yahoo.com|         0|  200.0| 500.0|
|1638844284| 500.0| 1.2524074|        njbirdmom|        33|  200.0| 500.0|
|1638844464| 300.0| 1.8111342|          aprefer|        58|  300.0| 740.0|
|1638844464| 305.0| 3.2126737|        19750926o|         3|  300.0| 740.0|
|1638844464| 450.0| 4.1657987|         coharley|        30|  300.0| 740.0|
|1638844464| 450.0| 6.7363195|        adammurry|         5|  300.0| 740.0|
|1638844464| 500.0| 6.7364697|        adammurry|         5|  300.0| 740.0|
|1638844464|505.78| 6.9881945|        19750926o|         3|  300.0| 740.0|
|1638844464| 551.0| 6.9896526|        19750926o|         3|  300.0| 740.0|
|1638844464| 570.0| 6.9931483|        19750926o|         3|  300.0| 740.0|
|1638844464| 601.0| 6.9939003|        19750926o|         3|  300.0| 740.0|
|1638844464| 610.0|  6.994965|        19750926o|         3|  300.0| 740.0|
|1638844464| 560.0| 6.9953704|            ps138|         5|  300.0| 740.0|
+----------+------+----------+-----------------+----------+-------+------+
only showing top 20 rows

scala> val lines = sc.textFile("Cartier+for+WinnersCurse.csv")

lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:24

scala> val header = lines.first

header: String = auctionid,bid,bidtime,bidder,bidderrate,openbid,price

scala> lines.count

res3: Long = 1349

scala> case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Int, openbid: Float, price: Float)

defined class Auction

scala> val noheader = lines.filter(_ != header)

noheader: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at filter at <console>:31

scala> val auctions = noheader.map(_.split(",")).map(r => Auction(r(0), r(1).toFloat, r(2).toFloat, r(3), r(4).toInt, r(5).toFloat, r(6).toFloat))

auctions: org.apache.spark.rdd.RDD[Auction] = MapPartitionsRDD[59] at map at <console>:35

scala> val df = auctions.toDF

df: org.apache.spark.sql.DataFrame = [auctionid: string, bid: float, bidtime: float, bidder: string, bidderrate: int, openbid: float, price: float]

scala> df.printSchema

root

|-- auctionid: string (nullable = true)

|-- bid: float (nullable = false)

|-- bidtime: float (nullable = false)

|-- bidder: string (nullable = true)

|-- bidderrate: integer (nullable = false)

|-- openbid: float (nullable = false)

|-- price: float (nullable = false)

scala> df.show

+----------+------+----------+-----------------+----------+-------+------+

+----------+------+----------+-----------------+----------+-------+------+

|1638843936| 500.0|0.47836804| kona-java| 181| 500.0|1625.0|

|1638843936| 800.0| 0.8263889| doc213| 60| 500.0|1625.0|

|1638843936| 600.0| 3.7611227| zmxu| 7| 500.0|1625.0|

|1638843936|1500.0| 5.2263775| carloss8055| 5| 500.0|1625.0|

|1638843936|1600.0| 6.570625| jdrinaz| 6| 500.0|1625.0|

|1638843936|1550.0| 6.8929167| carloss8055| 5| 500.0|1625.0|

|1638843936|1625.0| 6.8931136| carloss8055| 5| 500.0|1625.0|

|1638844284| 225.0| 1.237419|dre_313@yahoo.com| 0| 200.0| 500.0|

|1638844284| 500.0| 1.2524074| njbirdmom| 33| 200.0| 500.0|

|1638844464| 300.0| 1.8111342| aprefer| 58| 300.0| 740.0|

|1638844464| 305.0| 3.2126737| 19750926o| 3| 300.0| 740.0|

|1638844464| 450.0| 4.1657987| coharley| 30| 300.0| 740.0|

|1638844464| 450.0| 6.7363195| adammurry| 5| 300.0| 740.0|

|1638844464| 500.0| 6.7364697| adammurry| 5| 300.0| 740.0|

|1638844464|505.78| 6.9881945| 19750926o| 3| 300.0| 740.0|

|1638844464| 551.0| 6.9896526| 19750926o| 3| 300.0| 740.0|

|1638844464| 570.0| 6.9931483| 19750926o| 3| 300.0| 740.0|

|1638844464| 601.0| 6.9939003| 19750926o| 3| 300.0| 740.0|

|1638844464| 610.0| 6.994965| 19750926o| 3| 300.0| 740.0|

|1638844464| 560.0| 6.9953704| ps138| 5| 300.0| 740.0|

+----------+------+----------+-----------------+----------+-------+------+

only showing top 20 rows

Creating DataFrame from CSV files using spark-csv module

You’re going to use spark-csv module to load data from a CSV data source that handles proper parsing and loading.

Note	Support for CSV data sources is available by default in Spark 2.0.0. No need for an external module.

Start the Spark shell using --packages option as follows:



➜  spark git:(master) ✗ ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
Ivy Default Cache set to: /Users/jacek/.ivy2/cache
The jars for the packages stored in: /Users/jacek/.ivy2/jars
:: loading settings :: url = jar:file:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency

scala> val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load("Cartier+for+WinnersCurse.csv")
df: org.apache.spark.sql.DataFrame = [auctionid: string, bid: string, bidtime: string, bidder: string, bidderrate: string, openbid: string, price: string]

scala> df.printSchema
root
 |-- auctionid: string (nullable = true)
 |-- bid: string (nullable = true)
 |-- bidtime: string (nullable = true)
 |-- bidder: string (nullable = true)
 |-- bidderrate: string (nullable = true)
 |-- openbid: string (nullable = true)
 |-- price: string (nullable = true)

 scala> df.show
 +----------+------+-----------+-----------------+----------+-------+-----+
 | auctionid|   bid|    bidtime|           bidder|bidderrate|openbid|price|
 +----------+------+-----------+-----------------+----------+-------+-----+
 |1638843936|   500|0.478368056|        kona-java|       181|    500| 1625|
 |1638843936|   800|0.826388889|           doc213|        60|    500| 1625|
 |1638843936|   600|3.761122685|             zmxu|         7|    500| 1625|
 |1638843936|  1500|5.226377315|      carloss8055|         5|    500| 1625|
 |1638843936|  1600|   6.570625|          jdrinaz|         6|    500| 1625|
 |1638843936|  1550|6.892916667|      carloss8055|         5|    500| 1625|
 |1638843936|  1625|6.893113426|      carloss8055|         5|    500| 1625|
 |1638844284|   225|1.237418982|dre_313@yahoo.com|         0|    200|  500|
 |1638844284|   500|1.252407407|        njbirdmom|        33|    200|  500|
 |1638844464|   300|1.811134259|          aprefer|        58|    300|  740|
 |1638844464|   305|3.212673611|        19750926o|         3|    300|  740|
 |1638844464|   450|4.165798611|         coharley|        30|    300|  740|
 |1638844464|   450|6.736319444|        adammurry|         5|    300|  740|
 |1638844464|   500|6.736469907|        adammurry|         5|    300|  740|
 |1638844464|505.78|6.988194444|        19750926o|         3|    300|  740|
 |1638844464|   551|6.989652778|        19750926o|         3|    300|  740|
 |1638844464|   570|6.993148148|        19750926o|         3|    300|  740|
 |1638844464|   601|6.993900463|        19750926o|         3|    300|  740|
 |1638844464|   610|6.994965278|        19750926o|         3|    300|  740|
 |1638844464|   560| 6.99537037|            ps138|         5|    300|  740|
 +----------+------+-----------+-----------------+----------+-------+-----+
 only showing top 20 rows

➜ spark git:(master) ✗ ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2.0

Ivy Default Cache set to: /Users/jacek/.ivy2/cache

The jars for the packages stored in: /Users/jacek/.ivy2/jars

:: loading settings :: url = jar:file:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.1.jar!/org/apache/ivy/core/settings/ivysettings.xml

com.databricks#spark-csv_2.11 added as a dependency

scala> val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load("Cartier+for+WinnersCurse.csv")

df: org.apache.spark.sql.DataFrame = [auctionid: string, bid: string, bidtime: string, bidder: string, bidderrate: string, openbid: string, price: string]

scala> df.printSchema

root

|-- auctionid: string (nullable = true)

|-- bid: string (nullable = true)

|-- bidtime: string (nullable = true)

|-- bidder: string (nullable = true)

|-- bidderrate: string (nullable = true)

|-- openbid: string (nullable = true)

|-- price: string (nullable = true)

scala> df.show

+----------+------+-----------+-----------------+----------+-------+-----+

+----------+------+-----------+-----------------+----------+-------+-----+

|1638843936| 500|0.478368056| kona-java| 181| 500| 1625|

|1638843936| 800|0.826388889| doc213| 60| 500| 1625|

|1638843936| 600|3.761122685| zmxu| 7| 500| 1625|

|1638843936| 1500|5.226377315| carloss8055| 5| 500| 1625|

|1638843936| 1600| 6.570625| jdrinaz| 6| 500| 1625|

|1638843936| 1550|6.892916667| carloss8055| 5| 500| 1625|

|1638843936| 1625|6.893113426| carloss8055| 5| 500| 1625|

|1638844284| 225|1.237418982|dre_313@yahoo.com| 0| 200| 500|

|1638844284| 500|1.252407407| njbirdmom| 33| 200| 500|

|1638844464| 300|1.811134259| aprefer| 58| 300| 740|

|1638844464| 305|3.212673611| 19750926o| 3| 300| 740|

|1638844464| 450|4.165798611| coharley| 30| 300| 740|

|1638844464| 450|6.736319444| adammurry| 5| 300| 740|

|1638844464| 500|6.736469907| adammurry| 5| 300| 740|

|1638844464|505.78|6.988194444| 19750926o| 3| 300| 740|

|1638844464| 551|6.989652778| 19750926o| 3| 300| 740|

|1638844464| 570|6.993148148| 19750926o| 3| 300| 740|

|1638844464| 601|6.993900463| 19750926o| 3| 300| 740|

|1638844464| 610|6.994965278| 19750926o| 3| 300| 740|

|1638844464| 560| 6.99537037| ps138| 5| 300| 740|

+----------+------+-----------+-----------------+----------+-------+-----+

only showing top 20 rows

Reading Data from External Data Sources (read method)

You can create DataFrames by loading data from structured files (JSON, Parquet, CSV), RDDs, tables in Hive, or external databases (JDBC) using SQLContext.read method.



read: DataFrameReader

read: DataFrameReader

read returns a DataFrameReader instance.

Among the supported structured data (file) formats are (consult Specifying Data Format (format method) for DataFrameReader):

JSON
parquet
JDBC
ORC
Tables in Hive and any JDBC-compliant database
libsvm



val reader = spark.read
r: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@59e67a18

reader.parquet("file.parquet")
reader.json("file.json")
reader.format("libsvm").load("sample_libsvm_data.txt")

val reader = spark.read

r: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@59e67a18

reader.parquet("file.parquet")

reader.json("file.json")

reader.format("libsvm").load("sample_libsvm_data.txt")

Querying DataFrame

Note	Spark SQL offers a Pandas-like Query DSL.

Using Query DSL

You can select specific columns using select method.

Note	This variant (in which you use stringified column names) can only select existing columns, i.e. you cannot create new ones using select expressions.



scala> predictions.printSchema
root
 |-- id: long (nullable = false)
 |-- topic: string (nullable = true)
 |-- text: string (nullable = true)
 |-- label: double (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)

scala> predictions.select("label", "words").show
+-----+-------------------+
|label|              words|
+-----+-------------------+
|  1.0|     [hello, math!]|
|  0.0| [hello, religion!]|
|  1.0|[hello, phy, ic, !]|
+-----+-------------------+

scala> predictions.printSchema

root

|-- id: long (nullable = false)

|-- topic: string (nullable = true)

|-- text: string (nullable = true)

|-- label: double (nullable = true)

|-- words: array (nullable = true)

| |-- element: string (containsNull = true)

|-- features: vector (nullable = true)

|-- rawPrediction: vector (nullable = true)

|-- probability: vector (nullable = true)

|-- prediction: double (nullable = true)

scala> predictions.select("label", "words").show

+-----+-------------------+

|label| words|

+-----+-------------------+

| 1.0| [hello, math!]|

| 0.0| [hello, religion!]|

| 1.0|[hello, phy, ic, !]|

+-----+-------------------+



scala> auctions.groupBy("bidder").count().show(5)
+--------------------+-----+
|              bidder|count|
+--------------------+-----+
|    dennisthemenace1|    1|
|            amskymom|    5|
| nguyenat@san.rr.com|    4|
|           millyjohn|    1|
|ykelectro@hotmail...|    2|
+--------------------+-----+
only showing top 5 rows

scala> auctions.groupBy("bidder").count().show(5)

+--------------------+-----+

| bidder|count|

+--------------------+-----+

| dennisthemenace1| 1|

| amskymom| 5|

| nguyenat@san.rr.com| 4|

| millyjohn| 1|

|ykelectro@hotmail...| 2|

+--------------------+-----+

only showing top 5 rows

In the following example you query for the top 5 of the most active bidders.

Note the tiny $ and desc together with the column name to sort the rows by.



scala> auctions.groupBy("bidder").count().sort($"count".desc).show(5)
+------------+-----+
|      bidder|count|
+------------+-----+
|    lass1004|   22|
|  pascal1666|   19|
|     freembd|   17|
|restdynamics|   17|
|   happyrova|   17|
+------------+-----+
only showing top 5 rows

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> auctions.groupBy("bidder").count().sort(desc("count")).show(5)
+------------+-----+
|      bidder|count|
+------------+-----+
|    lass1004|   22|
|  pascal1666|   19|
|     freembd|   17|
|restdynamics|   17|
|   happyrova|   17|
+------------+-----+
only showing top 5 rows

scala> auctions.groupBy("bidder").count().sort($"count".desc).show(5)

+------------+-----+

| bidder|count|

+------------+-----+

| lass1004| 22|

| pascal1666| 19|

| freembd| 17|

|restdynamics| 17|

| happyrova| 17|

+------------+-----+

only showing top 5 rows

scala> import org.apache.spark.sql.functions._

import org.apache.spark.sql.functions._

scala> auctions.groupBy("bidder").count().sort(desc("count")).show(5)

+------------+-----+

| bidder|count|

+------------+-----+

| lass1004| 22|

| pascal1666| 19|

| freembd| 17|

|restdynamics| 17|

| happyrova| 17|

+------------+-----+

only showing top 5 rows



scala> df.select("auctionid").distinct.count
res88: Long = 97

scala> df.groupBy("bidder").count.show
+--------------------+-----+
|              bidder|count|
+--------------------+-----+
|    dennisthemenace1|    1|
|            amskymom|    5|
| nguyenat@san.rr.com|    4|
|           millyjohn|    1|
|ykelectro@hotmail...|    2|
|   shetellia@aol.com|    1|
|              rrolex|    1|
|            bupper99|    2|
|           cheddaboy|    2|
|             adcc007|    1|
|           varvara_b|    1|
|            yokarine|    4|
|          steven1328|    1|
|              anjara|    2|
|              roysco|    1|
|lennonjasonmia@ne...|    2|
|northwestportland...|    4|
|             bosspad|   10|
|        31strawberry|    6|
|          nana-tyler|   11|
+--------------------+-----+
only showing top 20 rows

scala> df.select("auctionid").distinct.count

res88: Long = 97

scala> df.groupBy("bidder").count.show

+--------------------+-----+

| bidder|count|

+--------------------+-----+

| dennisthemenace1| 1|

| amskymom| 5|

| nguyenat@san.rr.com| 4|

| millyjohn| 1|

|ykelectro@hotmail...| 2|

| shetellia@aol.com| 1|

| rrolex| 1|

| bupper99| 2|

| cheddaboy| 2|

| adcc007| 1|

| varvara_b| 1|

| yokarine| 4|

| steven1328| 1|

| anjara| 2|

| roysco| 1|

|lennonjasonmia@ne...| 2|

|northwestportland...| 4|

| bosspad| 10|

| 31strawberry| 6|

| nana-tyler| 11|

+--------------------+-----+

only showing top 20 rows

Using SQL



scala> df.registerTempTable("auctions") (1)

scala> val sql = spark.sql("SELECT count(*) AS count FROM auctions")
sql: org.apache.spark.sql.DataFrame = [count: bigint]

scala> df.registerTempTable("auctions") (1)

scala> val sql = spark.sql("SELECT count(*) AS count FROM auctions")

sql: org.apache.spark.sql.DataFrame = [count: bigint]

You can execute a SQL query on a DataFrame using sql operation, but before the query is executed it is optimized by Catalyst query optimizer. You can print the physical plan for a DataFrame using the explain operation.



scala> sql.explain
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#148L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#156L])
   TungstenProject
    Scan PhysicalRDD[auctionid#49,bid#50,bidtime#51,bidder#52,bidderrate#53,openbid#54,price#55]

scala> sql.show
+-----+
|count|
+-----+
| 1348|
+-----+

scala> val count = sql.collect()(0).getLong(0)
count: Long = 1348

scala> sql.explain

== Physical Plan ==

TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#148L])

TungstenExchange SinglePartition

TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#156L])

TungstenProject

Scan PhysicalRDD[auctionid#49,bid#50,bidtime#51,bidder#52,bidderrate#53,openbid#54,price#55]

scala> sql.show

+-----+

|count|

+-----+

| 1348|

+-----+

scala> val count = sql.collect()(0).getLong(0)

count: Long = 1348

Filtering



scala> df.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa|      100| 0.12|
| aaa|      200| 0.29|
| bbb|      200| 0.53|
| bbb|      300| 0.42|
+----+---------+-----+

scala> df.filter($"name".like("a%")).show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa|      100| 0.12|
| aaa|      200| 0.29|
+----+---------+-----+

scala> df.show

+----+---------+-----+

|name|productId|score|

+----+---------+-----+

| aaa| 100| 0.12|

| aaa| 200| 0.29|

| bbb| 200| 0.53|

| bbb| 300| 0.42|

+----+---------+-----+

scala> df.filter($"name".like("a%")).show

+----+---------+-----+

|name|productId|score|

+----+---------+-----+

| aaa| 100| 0.12|

| aaa| 200| 0.29|

+----+---------+-----+

Handling data in Avro format

Use custom serializer using spark-avro.

Run Spark shell with --packages com.databricks:spark-avro_2.11:2.0.0 (see 2.0.0 artifact is not in any public maven repo why --repositories is required).



./bin/spark-shell --packages com.databricks:spark-avro_2.11:2.0.0 --repositories "http://dl.bintray.com/databricks/maven"

./bin/spark-shell --packages com.databricks:spark-avro_2.11:2.0.0 --repositories "http://dl.bintray.com/databricks/maven"

And then…



val fileRdd = sc.textFile("README.md")
val df = fileRdd.toDF

import org.apache.spark.sql.SaveMode

val outputF = "test.avro"
df.write.mode(SaveMode.Append).format("com.databricks.spark.avro").save(outputF)

val fileRdd = sc.textFile("README.md")

val df = fileRdd.toDF

import org.apache.spark.sql.SaveMode

val outputF = "test.avro"

df.write.mode(SaveMode.Append).format("com.databricks.spark.avro").save(outputF)

See org.apache.spark.sql.SaveMode (and perhaps org.apache.spark.sql.SaveMode from Scala’s perspective).



val df = spark.read.format("com.databricks.spark.avro").load("test.avro")

val df = spark.read.format("com.databricks.spark.avro").load("test.avro")

Example Datasets

Dataset — Structured Query with Data Encoder

2011-01-28admin阅读(1597)

Dataset — Structured Query with Data Encoder

Dataset is a strongly-typed data structure in Spark SQL that represents a structured query.

Note	A structured query can be written using SQL or Dataset API.

The following figure shows the relationship between different entities of Spark SQL that all together give the Dataset data structure.

Figure 1. Dataset’s Internals

It is therefore fair to say that Dataset consists of the following three elements:

QueryExecution (with the parsed unanalyzed LogicalPlan of a structured query)
Encoder (of the type of the records for fast serialization and deserialization to and from InternalRow)
SparkSession

When created, Dataset takes such a 3-element tuple with a SparkSession, a QueryExecution and an Encoder.

Dataset is created when:

Dataset.apply (for a LogicalPlan and a SparkSession with the Encoder in a Scala implicit scope)
Dataset.ofRows (for a LogicalPlan and a SparkSession)
Dataset.toDF untyped transformation is used
Dataset.select, Dataset.randomSplit and Dataset.mapPartitions typed transformations are used
KeyValueGroupedDataset.agg operator is used (that requests KeyValueGroupedDataset to aggUntyped)
SparkSession.emptyDataset and SparkSession.range operators are used
CatalogImpl is requested to
makeDataset (when requested to list databases, tables, functions and columns)
Spark Structured Streaming’s MicroBatchExecution is requested to runBatch

Datasets are lazy and structured query operators and expressions are only triggered when an action is invoked.



import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

scala> val dataset = spark.range(5)
dataset: org.apache.spark.sql.Dataset[Long] = [id: bigint]

// Variant 1: filter operator accepts a Scala function
dataset.filter(n => n % 2 == 0).count

// Variant 2: filter operator accepts a Column-based SQL expression
dataset.filter('value % 2 === 0).count

// Variant 3: filter operator accepts a SQL query
dataset.filter("value % 2 = 0").count

import org.apache.spark.sql.SparkSession

val spark: SparkSession = ...

scala> val dataset = spark.range(5)

dataset: org.apache.spark.sql.Dataset[Long] = [id: bigint]

// Variant 1: filter operator accepts a Scala function

dataset.filter(n => n % 2 == 0).count

// Variant 2: filter operator accepts a Column-based SQL expression

dataset.filter('value % 2 === 0).count

// Variant 3: filter operator accepts a SQL query

dataset.filter("value % 2 = 0").count

The Dataset API offers declarative and type-safe operators that makes for an improved experience for data processing (comparing to DataFrames that were a set of index- or column name-based Rows).

Note

Dataset was first introduced in Apache Spark 1.6.0 as an experimental feature, and has since turned itself into a fully supported API.

As of Spark 2.0.0, DataFrame – the flagship data abstraction of previous versions of Spark SQL – is currently a mere type alias for Dataset[Row]:



type DataFrame = Dataset[Row]

type DataFrame = Dataset[Row]

See package object sql.

Dataset offers convenience of RDDs with the performance optimizations of DataFrames and the strong static type-safety of Scala. The last feature of bringing the strong type-safety to DataFrame makes Dataset so appealing. All the features together give you a more functional programming interface to work with structured data.



scala> spark.range(1).filter('id === 0).explain(true)
== Parsed Logical Plan ==
'Filter ('id = 0)
+- Range (0, 1, splits=8)

== Analyzed Logical Plan ==
id: bigint
Filter (id#51L = cast(0 as bigint))
+- Range (0, 1, splits=8)

== Optimized Logical Plan ==
Filter (id#51L = 0)
+- Range (0, 1, splits=8)

== Physical Plan ==
*Filter (id#51L = 0)
+- *Range (0, 1, splits=8)

scala> spark.range(1).filter(_ == 0).explain(true)
== Parsed Logical Plan ==
'TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], unresolveddeserializer(newInstance(class java.lang.Long))
+- Range (0, 1, splits=8)

== Analyzed Logical Plan ==
id: bigint
TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], newInstance(class java.lang.Long)
+- Range (0, 1, splits=8)

== Optimized Logical Plan ==
TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], newInstance(class java.lang.Long)
+- Range (0, 1, splits=8)

== Physical Plan ==
*Filter <function1>.apply
+- *Range (0, 1, splits=8)

scala> spark.range(1).filter('id === 0).explain(true)

== Parsed Logical Plan ==

'Filter ('id = 0)

+- Range (0, 1, splits=8)

== Analyzed Logical Plan ==

id: bigint

Filter (id#51L = cast(0 as bigint))

+- Range (0, 1, splits=8)

== Optimized Logical Plan ==

Filter (id#51L = 0)

+- Range (0, 1, splits=8)

== Physical Plan ==

*Filter (id#51L = 0)

+- *Range (0, 1, splits=8)

scala> spark.range(1).filter(_ == 0).explain(true)

== Parsed Logical Plan ==

'TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], unresolveddeserializer(newInstance(class java.lang.Long))

+- Range (0, 1, splits=8)

== Analyzed Logical Plan ==

id: bigint

TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], newInstance(class java.lang.Long)

+- Range (0, 1, splits=8)

== Optimized Logical Plan ==

TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], newInstance(class java.lang.Long)

+- Range (0, 1, splits=8)

== Physical Plan ==

*Filter <function1>.apply

+- *Range (0, 1, splits=8)

It is only with Datasets to have syntax and analysis checks at compile time (that was not possible using DataFrame, regular SQL queries or even RDDs).

Using Dataset objects turns DataFrames of Row instances into a DataFrames of case classes with proper names and types (following their equivalents in the case classes). Instead of using indices to access respective fields in a DataFrame and cast it to a type, all this is automatically handled by Datasets and checked by the Scala compiler.

If however a LogicalPlan is used to create a Dataset, the logical plan is first executed (using the current SessionState in the SparkSession) that yields the QueryExecution plan.

A Dataset is Queryable and Serializable, i.e. can be saved to a persistent storage.

Note	SparkSession and QueryExecution are transient attributes of a `Dataset` and therefore do not participate in Dataset serialization. The only firmly-tied feature of a `Dataset` is the Encoder.

You can request the “untyped” view of a Dataset or access the RDD that is generated after executing the query. It is supposed to give you a more pleasant experience while transitioning from the legacy RDD-based or DataFrame-based APIs you may have used in the earlier versions of Spark SQL or encourage migrating from Spark Core’s RDD API to Spark SQL’s Dataset API.

The default storage level for Datasets is MEMORY_AND_DISK because recomputing the in-memory columnar representation of the underlying table is expensive. You can however persist a Dataset.

Note	Spark 2.0 has introduced a new query model called Structured Streaming for continuous incremental execution of structured queries. That made possible to consider Datasets a static and bounded as well as streaming and unbounded data sets with a single unified API for different execution models.

A Dataset is local if it was created from local collections using SparkSession.emptyDataset or SparkSession.createDataset methods and their derivatives like toDF. If so, the queries on the Dataset can be optimized and run locally, i.e. without using Spark executors.

Note	`Dataset` makes sure that the underlying `QueryExecution` is analyzed and checked.

Name Description

boundEnc

ExpressionEncoder

Used when…FIXME

deserializer

Deserializer expression to convert internal rows to objects of type T

Created lazily by requesting the ExpressionEncoder to resolveAndBind

Used when:

Dataset is created (for a logical plan in a given SparkSession)
Dataset.toLocalIterator operator is used (to create a Java Iterator of objects of type T)
Dataset is requested to collect all rows from a spark plan

exprEnc

Implicit ExpressionEncoder

Used when…FIXME

logicalPlan

Analyzed logical plan with all logical commands executed and turned into a LocalRelation.



logicalPlan: LogicalPlan

logicalPlan: LogicalPlan

When initialized, logicalPlan requests the QueryExecution for analyzed logical plan. If the plan is a logical command or a union thereof, logicalPlan executes the QueryExecution (using executeCollect).

planWithBarrier



planWithBarrier: AnalysisBarrier

planWithBarrier: AnalysisBarrier

rdd

(lazily-created) RDD of JVM objects of type T (as converted from rows in Dataset in the internal binary row format).



rdd: RDD[T]

rdd: RDD[T]

Note	`rdd` gives `RDD` with the extra execution step to convert rows from their internal binary row format to JVM objects that will impact the JVM memory as the objects are inside JVM (while were outside before). You should not use `rdd` directly.

Internally, rdd first creates a new logical plan that deserializes the Dataset’s logical plan.



val dataset = spark.range(5).withColumn("group", 'id % 2)
scala> dataset.rdd.toDebugString
res1: String =
(8) MapPartitionsRDD[8] at rdd at <console>:26 [] // <-- extra deserialization step
 |  MapPartitionsRDD[7] at rdd at <console>:26 []
 |  MapPartitionsRDD[6] at rdd at <console>:26 []
 |  MapPartitionsRDD[5] at rdd at <console>:26 []
 |  ParallelCollectionRDD[4] at rdd at <console>:26 []

scala> dataset.queryExecution.toRdd.toDebugString
res2: String =
(8) MapPartitionsRDD[11] at toRdd at <console>:26 []
 |  MapPartitionsRDD[10] at toRdd at <console>:26 []
 |  ParallelCollectionRDD[9] at toRdd at <console>:26 []

val dataset = spark.range(5).withColumn("group", 'id % 2)

scala> dataset.rdd.toDebugString

res1: String =

(8) MapPartitionsRDD[8] at rdd at <console>:26 [] // <-- extra deserialization step

| MapPartitionsRDD[7] at rdd at <console>:26 []

| MapPartitionsRDD[6] at rdd at <console>:26 []

| MapPartitionsRDD[5] at rdd at <console>:26 []

| ParallelCollectionRDD[4] at rdd at <console>:26 []

scala> dataset.queryExecution.toRdd.toDebugString

res2: String =

(8) MapPartitionsRDD[11] at toRdd at <console>:26 []

| MapPartitionsRDD[10] at toRdd at <console>:26 []

| ParallelCollectionRDD[9] at toRdd at <console>:26 []

rdd then requests SessionState to execute the logical plan to get the corresponding RDD of binary rows.

Note	`rdd` uses SparkSession to access `SessionState`.

rdd then requests the Dataset’s ExpressionEncoder for the data type of the rows (using deserializer expression) and maps over them (per partition) to create records of the expected type T.

Note	`rdd` is at the “boundary” between the internal binary row format and the JVM type of the dataset. Avoid the extra deserialization step to lower JVM memory requirements of your Spark application.

sqlContext

Lazily-created SQLContext

Used when…FIXME

Getting Input Files of Relations (in Structured Query) — `inputFiles` Method



inputFiles: Array[String]

inputFiles: Array[String]

inputFiles requests QueryExecution for optimized logical plan and collects the following logical operators:

LogicalRelation with FileRelation (as the BaseRelation)
FileRelation
HiveTableRelation

inputFiles then requests the logical operators for their underlying files:

inputFiles of the FileRelations
locationUri of the HiveTableRelation

`resolve` Internal Method



resolve(colName: String): NamedExpression

resolve(colName: String): NamedExpression

Caution

FIXME

Creating Dataset Instance

Dataset takes the following when created:

SparkSession
QueryExecution
Encoder for the type T of the records

Note	You can also create a `Dataset` using LogicalPlan that is immediately executed using `SessionState`.

Internally, Dataset requests QueryExecution to analyze itself.

Dataset initializes the internal registries and counters.

Is Dataset Local? — `isLocal` Method



isLocal: Boolean

isLocal: Boolean

isLocal flag is enabled (i.e. true) when operators like collect or take could be run locally, i.e. without using executors.

Internally, isLocal checks whether the logical query plan of a Dataset is LocalRelation.

Is Dataset Streaming? — `isStreaming` method



isStreaming: Boolean

isStreaming: Boolean

isStreaming is enabled (i.e. true) when the logical plan is streaming.

Internally, isStreaming takes the Dataset’s logical plan and gives whether the plan is streaming or not.

Queryable

Caution

FIXME

`withNewRDDExecutionId` Internal Method



withNewRDDExecutionId[U](body: => U): U

withNewRDDExecutionId[U](body: => U): U

withNewRDDExecutionId executes the input body action under new execution id.

Caution

FIXME What’s the difference between withNewRDDExecutionId and withNewExecutionId?

Note	`withNewRDDExecutionId` is used when Dataset.foreach and Dataset.foreachPartition actions are used.

Creating DataFrame (For Logical Query Plan and SparkSession) — `ofRows` Internal Factory Method



ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame

ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame

Note	`ofRows` is part of `Dataset` Scala object that is marked as a `private[sql]` and so can only be accessed from code in `org.apache.spark.sql` package.

ofRows returns DataFrame (which is the type alias for Dataset[Row]). ofRows uses RowEncoder to convert the schema (based on the input logicalPlan logical plan).

Internally, ofRows prepares the input logicalPlan for execution and creates a Dataset[Row] with the current SparkSession, the QueryExecution and RowEncoder.

Note

ofRows is used when:

DataFrameReader is requested to load data from a data source
Dataset is requested to execute checkpoint, mapPartitionsInR, untyped transformations and set-based typed transformations
RelationalGroupedDataset is requested to create a DataFrame from aggregate expressions, flatMapGroupsInR and flatMapGroupsInPandas
SparkSession is requested to create a DataFrame from a BaseRelation, createDataFrame, internalCreateDataFrame, sql and table
CacheTableCommand, CreateTempViewUsing, InsertIntoDataSourceCommand and SaveIntoDataSourceCommand logical commands are executed (run)
DataSource is requested to writeAndRead (for a CreatableRelationProvider)
FrequentItems is requested to singlePassFreqItems
StatFunctions is requested to crossTabulate and summary
Spark Structured Streaming’s DataStreamReader is requested to load
Spark Structured Streaming’s DataStreamWriter is requested to start
Spark Structured Streaming’s FileStreamSource is requested to getBatch
Spark Structured Streaming’s MemoryStream is requested to toDF

Tracking Multi-Job Structured Query Execution (PySpark) — `withNewExecutionId` Internal Method



withNewExecutionId[U](body: => U): U

withNewExecutionId[U](body: => U): U

withNewExecutionId executes the input body action under new execution id.

Note	`withNewExecutionId` sets a unique execution id so that all Spark jobs belong to the `Dataset` action execution.

Note

withNewExecutionId is used exclusively when Dataset is executing Python-based actions (i.e. collectToPython, collectAsArrowToPython and toPythonIterator) that are not of much interest in this gitbook.

Feel free to contact me at jacek@japila.pl if you think I should re-consider my decision.

Executing Action Under New Execution ID — `withAction` Internal Method



withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U)

withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U)

withAction requests QueryExecution for the optimized physical query plan and resets the metrics of every physical operator (in the physical plan).

withAction requests SQLExecution to execute the input action with the executable physical plan (tracked under a new execution id).

In the end, withAction notifies ExecutionListenerManager that the name action has finished successfully or with an exception.

Note	`withAction` uses SparkSession to access ExecutionListenerManager.

Note	`withAction` is used when `Dataset` is requested for the following: Computing the logical plan (and executing a logical command or their `Union`) Dataset operators: collect, count, head and toLocalIterator

Creating Dataset Instance (For LogicalPlan and SparkSession) — `apply` Internal Factory Method



apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T]

apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T]

Note	`apply` is part of `Dataset` Scala object that is marked as a `private[sql]` and so can only be accessed from code in `org.apache.spark.sql` package.

apply…FIXME

Note	`apply` is used when: `Dataset` is requested to execute typed transformations and set-based typed transformations Spark Structured Streaming’s `MemoryStream` is requested to `toDS`

Collecting All Rows From Spark Plan — `collectFromPlan` Internal Method



collectFromPlan(plan: SparkPlan): Array[T]

collectFromPlan(plan: SparkPlan): Array[T]

collectFromPlan…FIXME

Note	`collectFromPlan` is used for Dataset.head, Dataset.collect and Dataset.collectAsList operators.

`selectUntyped` Internal Method



selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

selectUntyped…FIXME

Note	`selectUntyped` is used exclusively when Dataset.select typed transformation is used.

Helper Method for Typed Transformations — `withTypedPlan` Internal Method



withTypedPlan[U: Encoder](logicalPlan: LogicalPlan): Dataset[U]

withTypedPlan[U: Encoder](logicalPlan: LogicalPlan): Dataset[U]

withTypedPlan…FIXME

Note	`withTypedPlan` is annotated with Scala’s @inline annotation that requests the Scala compiler to try especially hard to inline it.

Note	`withTypedPlan` is used in the `Dataset` typed transformations, i.e. withWatermark, joinWith, hint, as, filter, limit, sample, dropDuplicates, filter, map, repartition, repartitionByRange, coalesce and sort with sortWithinPartitions (through the sortInternal internal method).

Helper Method for Set-Based Typed Transformations — `withSetOperator` Internal Method



withSetOperator[U: Encoder](logicalPlan: LogicalPlan): Dataset[U]

withSetOperator[U: Encoder](logicalPlan: LogicalPlan): Dataset[U]

withSetOperator…FIXME

Note	`withSetOperator` is annotated with Scala’s @inline annotation that requests the Scala compiler to try especially hard to inline it.

Note	`withSetOperator` is used in the `Dataset` typed transformations, i.e. union, unionByName, intersect and except.

`sortInternal` Internal Method



sortInternal(global: Boolean, sortExprs: Seq[Column]): Dataset[T]

sortInternal(global: Boolean, sortExprs: Seq[Column]): Dataset[T]

sortInternal creates a Dataset with Sort unary logical operator (and the logicalPlan as the child logical plan).



val nums = Seq((0, "zero"), (1, "one")).toDF("id", "name")
// Creates a Sort logical operator:
// - descending sort direction for id column (specified explicitly)
// - name column is wrapped with ascending sort direction
val numsSorted = nums.sort('id.desc, 'name)
val logicalPlan = numsSorted.queryExecution.logical
scala> println(logicalPlan.numberedTreeString)
00 'Sort ['id DESC NULLS LAST, 'name ASC NULLS FIRST], true
01 +- Project [_1#11 AS id#14, _2#12 AS name#15]
02    +- LocalRelation [_1#11, _2#12]

val nums = Seq((0, "zero"), (1, "one")).toDF("id", "name")

// Creates a Sort logical operator:

// - descending sort direction for id column (specified explicitly)

// - name column is wrapped with ascending sort direction

val numsSorted = nums.sort('id.desc, 'name)

val logicalPlan = numsSorted.queryExecution.logical

scala> println(logicalPlan.numberedTreeString)

00 'Sort ['id DESC NULLS LAST, 'name ASC NULLS FIRST], true

01 +- Project [_1#11 AS id#14, _2#12 AS name#15]

02 +- LocalRelation [_1#11, _2#12]

Internally, sortInternal firstly builds ordering expressions for the given sortExprs columns, i.e. takes the sortExprs columns and makes sure that they are SortOrder expressions already (and leaves them untouched) or wraps them into SortOrder expressions with Ascending sort direction.

In the end, sortInternal creates a Dataset with Sort unary logical operator (with the ordering expressions, the given global flag, and the logicalPlan as the child logical plan).

Note	`sortInternal` is used for the sort and sortWithinPartitions typed transformations in the Dataset API (with the only change of the `global` flag being enabled and disabled, respectively).

Helper Method for Untyped Transformations and Basic Actions — `withPlan` Internal Method



withPlan(logicalPlan: LogicalPlan): DataFrame

withPlan(logicalPlan: LogicalPlan): DataFrame

withPlan simply uses ofRows internal factory method to create a DataFrame for the input LogicalPlan and the current SparkSession.

Note	`withPlan` is annotated with Scala’s @inline annotation that requests the Scala compiler to try especially hard to inline it.

Note	`withPlan` is used in the `Dataset` untyped transformations (i.e. join, crossJoin and select) and basic actions (i.e. createTempView, createOrReplaceTempView, createGlobalTempView and createOrReplaceGlobalTempView).

SparkSessionExtensions

2011-01-28admin阅读(2814)

SparkSessionExtensions

SparkSessionExtensions is an interface that a Spark developer can use to extend a SparkSession with custom query execution rules and a relational entity parser.

As a Spark developer, you use Builder.withExtensions method (while building a new SparkSession) to access the session-bound SparkSessionExtensions.

Table 1. SparkSessionExtensions API

Method

Description

injectCheckRule



injectCheckRule(builder: SparkSession => LogicalPlan => Unit): Unit

injectCheckRule(builder: SparkSession => LogicalPlan => Unit): Unit

injectOptimizerRule

Registering a custom operator optimization rule



injectOptimizerRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectOptimizerRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectParser



injectParser(builder: (SparkSession, ParserInterface) => ParserInterface): Unit

injectParser(builder: (SparkSession, ParserInterface) => ParserInterface): Unit

injectPlannerStrategy



injectPlannerStrategy(builder: SparkSession => Strategy): Unit

injectPlannerStrategy(builder: SparkSession => Strategy): Unit

injectPostHocResolutionRule



injectPostHocResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectPostHocResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectResolutionRule



injectResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

SparkSessionExtensions is an integral part of SparkSession (and is indirectly required to create one).

Table 2. SparkSessionExtensions’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`optimizerRules`	Collection of `RuleBuilder` functions (i.e. `SparkSession ⇒ Rule[LogicalPlan]`) Used when `SparkSessionExtensions` is requested to: Associate custom operator optimization rules with SparkSession Register a custom operator optimization rule

Associating Custom Operator Optimization Rules with SparkSession — `buildOptimizerRules` Method



buildOptimizerRules(session: SparkSession): Seq[Rule[LogicalPlan]]

buildOptimizerRules(session: SparkSession): Seq[Rule[LogicalPlan]]

buildOptimizerRules gives the optimizerRules logical rules that are associated with the input SparkSession.

Note	`buildOptimizerRules` is used exclusively when `BaseSessionStateBuilder` is requested for the custom operator optimization rules to add to the base Operator Optimization batch.

Registering Custom Check Analysis Rule (Builder) — `injectCheckRule` Method



injectCheckRule(builder: SparkSession => LogicalPlan => Unit): Unit

injectCheckRule(builder: SparkSession => LogicalPlan => Unit): Unit

injectCheckRule…FIXME

Registering Custom Operator Optimization Rule (Builder) — `injectOptimizerRule` Method



injectOptimizerRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectOptimizerRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectOptimizerRule simply registers a custom operator optimization rule (as a RuleBuilder function) to the optimizerRules internal registry.

Registering Custom Parser (Builder) — `injectParser` Method



injectParser(builder: (SparkSession, ParserInterface) => ParserInterface): Unit

injectParser(builder: (SparkSession, ParserInterface) => ParserInterface): Unit

injectParser…FIXME

Registering Custom Planner Strategy (Builder) — `injectPlannerStrategy` Method



injectPlannerStrategy(builder: SparkSession => Strategy): Unit

injectPlannerStrategy(builder: SparkSession => Strategy): Unit

injectPlannerStrategy…FIXME

Registering Custom Post-Hoc Resolution Rule (Builder) — `injectPostHocResolutionRule` Method



injectPostHocResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectPostHocResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectPostHocResolutionRule…FIXME

Registering Custom Resolution Rule (Builder) — `injectResolutionRule` Method



injectResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectResolutionRule…FIXME

implicits Object — Implicits Conversions

2011-01-28admin阅读(1640)

implicits Object — Implicits Conversions

implicits object gives implicit conversions for converting Scala objects (incl. RDDs) into a Dataset, DataFrame, Columns or supporting such conversions (through Encoders).

Name Description

localSeqToDatasetHolder

Creates a DatasetHolder with the input Seq[T] converted to a Dataset[T] (using SparkSession.createDataset).



implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T]

implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T]

Encoders

Encoders for primitive and object types in Scala and Java (aka boxed types)

StringToColumn

Converts $"name" into a Column



implicit class StringToColumn(val sc: StringContext)

implicit class StringToColumn(val sc: StringContext)

rddToDatasetHolder



implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T]

implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T]

symbolToColumn



implicit def symbolToColumn(s: Symbol): ColumnName

implicit def symbolToColumn(s: Symbol): ColumnName

implicits object is defined inside SparkSession and hence requires that you build a SparkSession instance first before importing implicits conversions.



import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
import spark.implicits._

scala> val ds = Seq("I am a shiny Dataset!").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val df = Seq("I am an old grumpy DataFrame!").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> val df = Seq("I am an old grumpy DataFrame with text column!").toDF("text")
df: org.apache.spark.sql.DataFrame = [text: string]

val rdd = sc.parallelize(Seq("hello, I'm a very low-level RDD"))
scala> val ds = rdd.toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

import org.apache.spark.sql.SparkSession

val spark: SparkSession = ...

import spark.implicits._

scala> val ds = Seq("I am a shiny Dataset!").toDS

ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val df = Seq("I am an old grumpy DataFrame!").toDF

df: org.apache.spark.sql.DataFrame = [value: string]

scala> val df = Seq("I am an old grumpy DataFrame with text column!").toDF("text")

df: org.apache.spark.sql.DataFrame = [text: string]

val rdd = sc.parallelize(Seq("hello, I'm a very low-level RDD"))

scala> val ds = rdd.toDS

ds: org.apache.spark.sql.Dataset[String] = [value: string]

Tip	In Scala REPL-based environments, e.g. `spark-shell`, use `:imports` to know what imports are in scope.



scala> :help imports

show import history, identifying sources of names

scala> :imports
 1) import org.apache.spark.SparkContext._ (69 terms, 1 are implicit)
 2) import spark.implicits._       (1 types, 67 terms, 37 are implicit)
 3) import spark.sql               (1 terms)
 4) import org.apache.spark.sql.functions._ (354 terms)

scala> :help imports

show import history, identifying sources of names

scala> :imports

1) import org.apache.spark.SparkContext._ (69 terms, 1 are implicit)

2) import spark.implicits._ (1 types, 67 terms, 37 are implicit)

3) import spark.sql (1 terms)

4) import org.apache.spark.sql.functions._ (354 terms)

implicits object extends SQLImplicits abstract class.

`DatasetHolder` Scala Case Class

DatasetHolder is a Scala case class that, when created, takes a Dataset[T].

DatasetHolder is created (implicitly) when rddToDatasetHolder and localSeqToDatasetHolder implicit conversions are used.

DatasetHolder has toDS and toDF methods that simply return the Dataset[T] (it was created with) or a DataFrame (using Dataset.toDF operator), respectively.



toDS(): Dataset[T]
toDF(): DataFrame
toDF(colNames: String*): DataFrame

toDS(): Dataset[T]

toDF(): DataFrame

toDF(colNames: String*): DataFrame

Builder — Building SparkSession using Fluent API

2011-01-28admin阅读(1521)

Builder — Building SparkSession using Fluent API

Builder is the fluent API to create a SparkSession.

Table 1. Builder API

Method

Description

appName



appName(name: String): Builder

appName(name: String): Builder

config



config(conf: SparkConf): Builder
config(key: String, value: Boolean): Builder
config(key: String, value: Double): Builder
config(key: String, value: Long): Builder
config(key: String, value: String): Builder

config(conf: SparkConf): Builder

config(key: String, value: Boolean): Builder

config(key: String, value: Double): Builder

config(key: String, value: Long): Builder

config(key: String, value: String): Builder

enableHiveSupport

Enables Hive support



enableHiveSupport(): Builder

enableHiveSupport(): Builder

getOrCreate

Gets the current SparkSession or creates a new one.



getOrCreate(): SparkSession

getOrCreate(): SparkSession

master



master(master: String): Builder

master(master: String): Builder

withExtensions

Access to the SparkSessionExtensions



withExtensions(f: SparkSessionExtensions => Unit): Builder

withExtensions(f: SparkSessionExtensions => Unit): Builder

Builder is available using the builder object method of a SparkSession.



import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
  .appName("My Spark Application")  // optional and will be autogenerated if not specified
  .master("local[*]")               // only for demo and testing purposes, use spark-submit instead
  .enableHiveSupport()              // self-explanatory, isn't it?
  .config("spark.sql.warehouse.dir", "target/spark-warehouse")
  .withExtensions { extensions =>
    extensions.injectResolutionRule { session =>
      ...
    }
    extensions.injectOptimizerRule { session =>
      ...
    }
  }
  .getOrCreate

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder

.appName("My Spark Application") // optional and will be autogenerated if not specified

.master("local[*]") // only for demo and testing purposes, use spark-submit instead

.enableHiveSupport() // self-explanatory, isn't it?

.config("spark.sql.warehouse.dir", "target/spark-warehouse")

.withExtensions { extensions =>

extensions.injectResolutionRule { session =>

...

}

extensions.injectOptimizerRule { session =>

...

}

.getOrCreate

Note	You can have multiple `SparkSession`s in a single Spark application for different data catalogs (through relational entities).

Table 2. Builder’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`extensions`	SparkSessionExtensions Used when…FIXME
`options`	Used when…FIXME

Getting Or Creating SparkSession Instance — `getOrCreate` Method



getOrCreate(): SparkSession

getOrCreate(): SparkSession

getOrCreate…FIXME

Enabling Hive Support — `enableHiveSupport` Method



enableHiveSupport(): Builder

enableHiveSupport(): Builder

enableHiveSupport enables Hive support, i.e. running structured queries on Hive tables (and a persistent Hive metastore, support for Hive serdes and Hive user-defined functions).

Note	You do not need any existing Hive installation to use Spark’s Hive support. `SparkSession` context will automatically create `metastore_db` in the current directory of a Spark application and a directory configured by spark.sql.warehouse.dir. Refer to SharedState.

Internally, enableHiveSupport makes sure that the Hive classes are on CLASSPATH, i.e. Spark SQL’s org.apache.hadoop.hive.conf.HiveConf, and sets spark.sql.catalogImplementation internal configuration property to hive.

`withExtensions` Method



withExtensions(f: SparkSessionExtensions => Unit): Builder

withExtensions(f: SparkSessionExtensions => Unit): Builder

withExtensions simply executes the input f function with the SparkSessionExtensions.

`appName` Method



appName(name: String): Builder

appName(name: String): Builder

appName…FIXME

`config` Method



config(conf: SparkConf): Builder
config(key: String, value: Boolean): Builder
config(key: String, value: Double): Builder
config(key: String, value: Long): Builder
config(key: String, value: String): Builder

config(conf: SparkConf): Builder

config(key: String, value: Boolean): Builder

config(key: String, value: Double): Builder

config(key: String, value: Long): Builder

config(key: String, value: String): Builder

config…FIXME

`master` Method



master(master: String): Builder

master(master: String): Builder

master…FIXME

SparkSession — The Entry Point to Spark SQL

2011-01-28admin阅读(1548)

SparkSession — The Entry Point to Spark SQL

SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.

As a Spark developer, you create a SparkSession using the SparkSession.builder method (that gives you access to Builder API that you use to configure the session).



import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
  .appName("My Spark Application")  // optional and will be autogenerated if not specified
  .master("local[*]")               // only for demo and testing purposes, use spark-submit instead
  .enableHiveSupport()              // self-explanatory, isn't it?
  .config("spark.sql.warehouse.dir", "target/spark-warehouse")
  .withExtensions { extensions =>
    extensions.injectResolutionRule { session =>
      ...
    }
    extensions.injectOptimizerRule { session =>
      ...
    }
  }
  .getOrCreate

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder

.appName("My Spark Application") // optional and will be autogenerated if not specified

.master("local[*]") // only for demo and testing purposes, use spark-submit instead

.enableHiveSupport() // self-explanatory, isn't it?

.config("spark.sql.warehouse.dir", "target/spark-warehouse")

.withExtensions { extensions =>

extensions.injectResolutionRule { session =>

...

}

extensions.injectOptimizerRule { session =>

...

}

.getOrCreate

Once created, SparkSession allows for creating a DataFrame (based on an RDD or a Scala Seq), creating a Dataset, accessing the Spark SQL services (e.g. ExperimentalMethods, ExecutionListenerManager, UDFRegistration), executing a SQL query, loading a table and the last but not least accessing DataFrameReader interface to load a dataset of the format of your choice (to some extent).

You can enable Apache Hive support with support for an external Hive metastore.

Note

spark object in spark-shell (the instance of SparkSession that is auto-created) has Hive support enabled.

In order to disable the pre-configured Hive support in the spark object, use spark.sql.catalogImplementation internal configuration property with in-memory value (that uses InMemoryCatalog external catalog instead).



$ spark-shell --conf spark.sql.catalogImplementation=in-memory

$ spark-shell --conf spark.sql.catalogImplementation=in-memory

You can have as many SparkSessions as you want in a single Spark application. The common use case is to keep relational entities separate logically in catalogs per SparkSession.

In the end, you stop a SparkSession using SparkSession.stop method.



spark.stop

spark.stop

Method Description

active



active: SparkSession

active: SparkSession

(New in 2.4.0)

builder



builder(): Builder

builder(): Builder

Object method to create a Builder to get the current SparkSession instance or create a new one.

catalog



catalog: Catalog

catalog: Catalog

Access to the current metadata catalog of relational entities, e.g. database(s), tables, functions, table columns, and temporary views.

clearActiveSession



clearActiveSession(): Unit

clearActiveSession(): Unit

Object method

clearDefaultSession



clearDefaultSession(): Unit

clearDefaultSession(): Unit

Object method



close(): Unit

close(): Unit

conf



conf: RuntimeConfig

conf: RuntimeConfig

Access to the current runtime configuration

createDataFrame



createDataFrame(rdd: RDD[_], beanClass: Class[_]): DataFrame
createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame

createDataFrame(rdd: RDD[_], beanClass: Class[_]): DataFrame

createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame

createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame

createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame

createDataset



createDataset[T : Encoder](data: RDD[T]): Dataset[T]
createDataset[T : Encoder](data: Seq[T]): Dataset[T]

createDataset[T : Encoder](data: RDD[T]): Dataset[T]

createDataset[T : Encoder](data: Seq[T]): Dataset[T]

emptyDataFrame



emptyDataFrame: DataFrame

emptyDataFrame: DataFrame

emptyDataset



emptyDataset[T: Encoder]: Dataset[T]

emptyDataset[T: Encoder]: Dataset[T]

experimental



experimental: ExperimentalMethods

experimental: ExperimentalMethods

Access to the current ExperimentalMethods

getActiveSession



getActiveSession: Option[SparkSession]

getActiveSession: Option[SparkSession]

Object method

getDefaultSession



getDefaultSession: Option[SparkSession]

getDefaultSession: Option[SparkSession]

Object method

implicits



import spark.implicits._

import spark.implicits._

Implicits conversions

listenerManager



listenerManager: ExecutionListenerManager

listenerManager: ExecutionListenerManager

Access to the current ExecutionListenerManager

newSession



newSession(): SparkSession

newSession(): SparkSession

Creates a new SparkSession

range



range(end: Long): Dataset[java.lang.Long]
range(start: Long, end: Long): Dataset[java.lang.Long]
range(start: Long, end: Long, step: Long): Dataset[java.lang.Long]
range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset[java.lang.Long]

range(end: Long): Dataset[java.lang.Long]

range(start: Long, end: Long): Dataset[java.lang.Long]

range(start: Long, end: Long, step: Long): Dataset[java.lang.Long]

range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset[java.lang.Long]

Creates a Dataset[java.lang.Long]

read



read: DataFrameReader

read: DataFrameReader

Access to the current DataFrameReader to load data from external data sources

sessionState



sessionState: SessionState

sessionState: SessionState

Access to the current SessionState

Internally, sessionState clones the optional parent SessionState (if given when creating the SparkSession) or creates a new SessionState using BaseSessionStateBuilder as defined by spark.sql.catalogImplementation configuration property:

in-memory (default) for org.apache.spark.sql.internal.SessionStateBuilder
hive for org.apache.spark.sql.hive.HiveSessionStateBuilder

setActiveSession



setActiveSession(session: SparkSession): Unit

setActiveSession(session: SparkSession): Unit

Object method

setDefaultSession



setDefaultSession(session: SparkSession): Unit

setDefaultSession(session: SparkSession): Unit

Object method

sharedState



sharedState: SharedState

sharedState: SharedState

Access to the current SharedState

sparkContext



sparkContext: SparkContext

sparkContext: SparkContext

Access to the underlying SparkContext

sql



sql(sqlText: String): DataFrame

sql(sqlText: String): DataFrame

“Executes” a SQL query

sqlContext



sqlContext: SQLContext

sqlContext: SQLContext

Access to the underlying SQLContext

stop



stop(): Unit

stop(): Unit

Stops the associated SparkContext

table



table(tableName: String): DataFrame

table(tableName: String): DataFrame

Loads data from a table

time



time[T](f: => T): T

time[T](f: => T): T

Executes a code block and prints out (to standard output) the time taken to execute it

udf



udf: UDFRegistration

udf: UDFRegistration

Access to the current UDFRegistration

version



version: String

version: String

Returns the version of Apache Spark

Note	baseRelationToDataFrame acts as a mechanism to plug `BaseRelation` object hierarchy in into LogicalPlan object hierarchy that `SparkSession` uses to bridge them.

Creating SparkSession Using Builder Pattern — `builder` Object Method



builder(): Builder

builder(): Builder

builder creates a new Builder that you use to build a fully-configured SparkSession using a fluent API.



import org.apache.spark.sql.SparkSession
val builder = SparkSession.builder

import org.apache.spark.sql.SparkSession

val builder = SparkSession.builder

Tip	Read about Fluent interface design pattern in Wikipedia, the free encyclopedia.

Accessing Version of Spark — `version` Method



version: String

version: String

version returns the version of Apache Spark in use.

Internally, version uses spark.SPARK_VERSION value that is the version property in spark-version-info.properties properties file on CLASSPATH.

Creating Empty Dataset (Given Encoder) — `emptyDataset` Operator



emptyDataset[T: Encoder]: Dataset[T]

emptyDataset[T: Encoder]: Dataset[T]

emptyDataset creates an empty Dataset (assuming that future records being of type T).



scala> val strings = spark.emptyDataset[String]
strings: org.apache.spark.sql.Dataset[String] = [value: string]

scala> strings.printSchema
root
 |-- value: string (nullable = true)

scala> val strings = spark.emptyDataset[String]

strings: org.apache.spark.sql.Dataset[String] = [value: string]

scala> strings.printSchema

root

|-- value: string (nullable = true)

emptyDataset creates a LocalRelation logical query plan.

Creating Dataset from Local Collections or RDDs — `createDataset` methods



createDataset[T : Encoder](data: RDD[T]): Dataset[T]
createDataset[T : Encoder](data: Seq[T]): Dataset[T]

createDataset[T : Encoder](data: RDD[T]): Dataset[T]

createDataset[T : Encoder](data: Seq[T]): Dataset[T]

createDataset is an experimental API to create a Dataset from a local Scala collection, i.e. Seq[T], Java’s List[T], or a distributed RDD[T].



scala> val one = spark.createDataset(Seq(1))
one: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> one.show
+-----+
|value|
+-----+
|    1|
+-----+

scala> val one = spark.createDataset(Seq(1))

one: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> one.show

+-----+

|value|

+-----+

| 1|

+-----+

createDataset creates a LocalRelation (for the input data collection) or LogicalRDD (for the input RDD[T]) logical operators.

Tip

You may want to consider implicits object and toDS method instead.



val spark: SparkSession = ...
import spark.implicits._

scala> val one = Seq(1).toDS
one: org.apache.spark.sql.Dataset[Int] = [value: int]

val spark: SparkSession = ...

import spark.implicits._

scala> val one = Seq(1).toDS

one: org.apache.spark.sql.Dataset[Int] = [value: int]

Internally, createDataset first looks up the implicit expression encoder in scope to access the AttributeReferences (of the schema).

Note	Only unresolved expression encoders are currently supported.

The expression encoder is then used to map elements (of the input Seq[T]) into a collection of InternalRows. With the references and rows, createDataset returns a Dataset with a LocalRelation logical query plan.

Creating Dataset With Single Long Column — `range` Operator



range(end: Long): Dataset[java.lang.Long]
range(start: Long, end: Long): Dataset[java.lang.Long]
range(start: Long, end: Long, step: Long): Dataset[java.lang.Long]
range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset[java.lang.Long]

range(end: Long): Dataset[java.lang.Long]

range(start: Long, end: Long): Dataset[java.lang.Long]

range(start: Long, end: Long, step: Long): Dataset[java.lang.Long]

range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset[java.lang.Long]

range family of methods create a Dataset of Long numbers.



scala> spark.range(start = 0, end = 4, step = 2, numPartitions = 5).show
+---+
| id|
+---+
|  0|
|  2|
+---+

scala> spark.range(start = 0, end = 4, step = 2, numPartitions = 5).show

+---+

| id|

+---+

| 0|

| 2|

+---+

Note	The three first variants (that do not specify `numPartitions` explicitly) use SparkContext.defaultParallelism for the number of partitions `numPartitions`.

Internally, range creates a new Dataset[Long] with Range logical plan and Encoders.LONG encoder.

Creating Empty DataFrame — `emptyDataFrame` method



emptyDataFrame: DataFrame

emptyDataFrame: DataFrame

emptyDataFrame creates an empty DataFrame (with no rows and columns).

It calls createDataFrame with an empty RDD[Row] and an empty schema StructType(Nil).

Creating DataFrames from Local Collections or RDDs — `createDataFrame` Method



createDataFrame(rdd: RDD[_], beanClass: Class[_]): DataFrame
createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
// private[sql]
createDataFrame(rowRDD: RDD[Row], schema: StructType, needsConversion: Boolean): DataFrame

createDataFrame(rdd: RDD[_], beanClass: Class[_]): DataFrame

createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame

createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame

createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame

// private[sql]

createDataFrame(rowRDD: RDD[Row], schema: StructType, needsConversion: Boolean): DataFrame

createDataFrame creates a DataFrame using RDD[Row] and the input schema. It is assumed that the rows in rowRDD all match the schema.

Caution

FIXME

Executing SQL Queries (aka SQL Mode) — `sql` Method



sql(sqlText: String): DataFrame

sql(sqlText: String): DataFrame

sql executes the sqlText SQL statement and creates a DataFrame.

Note

sql is imported in spark-shell so you can execute SQL statements as if sql were a part of the environment.



scala> :imports
 1) import spark.implicits._       (72 terms, 43 are implicit)
 2) import spark.sql               (1 terms)

scala> :imports

1) import spark.implicits._ (72 terms, 43 are implicit)

2) import spark.sql (1 terms)



scala> sql("SHOW TABLES")
res0: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]

scala> sql("DROP TABLE IF EXISTS testData")
res1: org.apache.spark.sql.DataFrame = []

// Let's create a table to SHOW it
spark.range(10).write.option("path", "/tmp/test").saveAsTable("testData")

scala> sql("SHOW TABLES").show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| testdata|      false|
+---------+-----------+

scala> sql("SHOW TABLES")

res0: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]

scala> sql("DROP TABLE IF EXISTS testData")

res1: org.apache.spark.sql.DataFrame = []

// Let's create a table to SHOW it

spark.range(10).write.option("path", "/tmp/test").saveAsTable("testData")

scala> sql("SHOW TABLES").show

+---------+-----------+

|tableName|isTemporary|

+---------+-----------+

| testdata| false|

+---------+-----------+

Internally, sql requests the current ParserInterface to execute a SQL query that gives a LogicalPlan.

Note	`sql` uses `SessionState` to access the current `ParserInterface`.

sql then creates a DataFrame using the current SparkSession (itself) and the LogicalPlan.

Tip

spark-sql is the main SQL environment in Spark to work with pure SQL statements (where you do not have to use Scala to execute them).



spark-sql> show databases;
default
Time taken: 0.028 seconds, Fetched 1 row(s)

spark-sql> show databases;

default

Time taken: 0.028 seconds, Fetched 1 row(s)

Accessing UDFRegistration — `udf` Attribute



udf: UDFRegistration

udf: UDFRegistration

udf attribute gives access to UDFRegistration that allows registering user-defined functions for SQL-based queries.



val spark: SparkSession = ...
spark.udf.register("myUpper", (s: String) => s.toUpperCase)

val strs = ('a' to 'c').map(_.toString).toDS
strs.registerTempTable("strs")

scala> sql("SELECT *, myUpper(value) UPPER FROM strs").show
+-----+-----+
|value|UPPER|
+-----+-----+
|    a|    A|
|    b|    B|
|    c|    C|
+-----+-----+

val spark: SparkSession = ...

spark.udf.register("myUpper", (s: String) => s.toUpperCase)

val strs = ('a' to 'c').map(_.toString).toDS

strs.registerTempTable("strs")

scala> sql("SELECT *, myUpper(value) UPPER FROM strs").show

+-----+-----+

|value|UPPER|

+-----+-----+

| a| A|

| b| B|

| c| C|

+-----+-----+

Internally, it is simply an alias for SessionState.udfRegistration.

Loading Data From Table — `table` Method



table(tableName: String): DataFrame (1)
// private[sql]
table(tableIdent: TableIdentifier): DataFrame

table(tableName: String): DataFrame (1)

// private[sql]

table(tableIdent: TableIdentifier): DataFrame

Parses tableName to a TableIdentifier and calls the other table

table creates a DataFrame (wrapper) from the input tableName table (but only if available in the session catalog).



scala> spark.catalog.tableExists("t1")
res1: Boolean = true

// t1 exists in the catalog
// let's load it
val t1 = spark.table("t1")

scala> spark.catalog.tableExists("t1")

res1: Boolean = true

// t1 exists in the catalog

// let's load it

val t1 = spark.table("t1")

Accessing Metastore — `catalog` Attribute



catalog: Catalog

catalog: Catalog

catalog attribute is a (lazy) interface to the current metastore, i.e. data catalog (of relational entities like databases, tables, functions, table columns, and views).

Tip	All methods in `Catalog` return `Datasets`.



scala> spark.catalog.listTables.show
+------------------+--------+-----------+---------+-----------+
|              name|database|description|tableType|isTemporary|
+------------------+--------+-----------+---------+-----------+
|my_permanent_table| default|       null|  MANAGED|      false|
|              strs|    null|       null|TEMPORARY|       true|
+------------------+--------+-----------+---------+-----------+

scala> spark.catalog.listTables.show

+------------------+--------+-----------+---------+-----------+

+------------------+--------+-----------+---------+-----------+

+------------------+--------+-----------+---------+-----------+

Internally, catalog creates a CatalogImpl (that uses the current SparkSession).

Accessing DataFrameReader — `read` method



read: DataFrameReader

read: DataFrameReader

read method returns a DataFrameReader that is used to read data from external storage systems and load it into a DataFrame.



val spark: SparkSession = // create instance
val dfReader: DataFrameReader = spark.read

val spark: SparkSession = // create instance

val dfReader: DataFrameReader = spark.read

Getting Runtime Configuration — `conf` Attribute



conf: RuntimeConfig

conf: RuntimeConfig

conf returns the current RuntimeConfig.

Internally, conf creates a RuntimeConfig (when requested the very first time and cached afterwards) with the SQLConf of the SessionState.

`readStream` method



readStream: DataStreamReader

readStream: DataStreamReader

readStream returns a new DataStreamReader.

`streams` Attribute



streams: StreamingQueryManager

streams: StreamingQueryManager

streams attribute gives access to StreamingQueryManager (through SessionState).



val spark: SparkSession = ...
spark.streams.active.foreach(println)

val spark: SparkSession = ...

spark.streams.active.foreach(println)

`experimentalMethods` Attribute



experimental: ExperimentalMethods

experimental: ExperimentalMethods

experimentalMethods is an extension point with ExperimentalMethods that is a per-session collection of extra strategies and Rule[LogicalPlan]s.

Note	`experimental` is used in SparkPlanner and SparkOptimizer. Hive and Structured Streaming use it for their own extra strategies and optimization rules.

Creating SparkSession Instance — `newSession` method



newSession(): SparkSession

newSession(): SparkSession

newSession creates (starts) a new SparkSession (with the current SparkContext and SharedState).



scala> val newSession = spark.newSession
newSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@122f58a

scala> val newSession = spark.newSession

newSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@122f58a

Stopping SparkSession — `stop` Method



stop(): Unit

stop(): Unit

stop stops the SparkSession, i.e. stops the underlying SparkContext.

Create DataFrame from BaseRelation — `baseRelationToDataFrame` Method



baseRelationToDataFrame(baseRelation: BaseRelation): DataFrame

baseRelationToDataFrame(baseRelation: BaseRelation): DataFrame

Internally, baseRelationToDataFrame creates a DataFrame from the input BaseRelation wrapped inside LogicalRelation.

Note	LogicalRelation is an logical plan adapter for `BaseRelation` (so `BaseRelation` can be part of a logical plan).

Note

baseRelationToDataFrame is used when:

DataFrameReader loads data from a data source that supports multiple paths
DataFrameReader loads data from an external table using JDBC
TextInputCSVDataSource creates a base Dataset (of Strings)
TextInputJsonDataSource creates a base Dataset (of Strings)

Creating SessionState Instance — `instantiateSessionState` Internal Method



instantiateSessionState(className: String, sparkSession: SparkSession): SessionState

instantiateSessionState(className: String, sparkSession: SparkSession): SessionState

instantiateSessionState finds the className that is then used to create and build a BaseSessionStateBuilder.

instantiateSessionState may report an IllegalArgumentException while instantiating the class of a SessionState:



Error while instantiating '[className]'

Error while instantiating '[className]'

Note	`instantiateSessionState` is used exclusively when `SparkSession` is requested for SessionState per spark.sql.catalogImplementation configuration property (and one is not available yet).

`sessionStateClassName` Internal Method



sessionStateClassName(conf: SparkConf): String

sessionStateClassName(conf: SparkConf): String

sessionStateClassName gives the name of the class of the SessionState per spark.sql.catalogImplementation, i.e.

org.apache.spark.sql.hive.HiveSessionStateBuilder for hive
org.apache.spark.sql.internal.SessionStateBuilder for in-memory

Note	`sessionStateClassName` is used exclusively when `SparkSession` is requested for the SessionState (and one is not available yet).

Creating DataFrame From RDD Of Internal Binary Rows and Schema — `internalCreateDataFrame` Internal Method



internalCreateDataFrame(
  catalystRows: RDD[InternalRow],
  schema: StructType,
  isStreaming: Boolean = false): DataFrame

internalCreateDataFrame(

catalystRows: RDD[InternalRow],

schema: StructType,

isStreaming: Boolean = false): DataFrame

internalCreateDataFrame creates a DataFrame with a LogicalRDD.

Note	`internalCreateDataFrame` is used when: `DataFrameReader` is requested to create a DataFrame from Dataset of JSONs or CSVs `SparkSession` is requested to create a DataFrame from RDD of rows `InsertIntoDataSourceCommand` logical command is executed

Creating SparkSession Instance

SparkSession takes the following when created:

Spark Core’s SparkContext
Optional SharedState
Optional SessionState
SparkSessionExtensions

`clearActiveSession` Object Method



clearActiveSession(): Unit

clearActiveSession(): Unit

clearActiveSession…FIXME

`clearDefaultSession` Object Method



clearDefaultSession(): Unit

clearDefaultSession(): Unit

clearDefaultSession…FIXME

Accessing ExperimentalMethods — `experimental` Method



experimental: ExperimentalMethods

experimental: ExperimentalMethods

experimental…FIXME

`getActiveSession` Object Method



getActiveSession: Option[SparkSession]

getActiveSession: Option[SparkSession]

getActiveSession…FIXME

`getDefaultSession` Object Method



getDefaultSession: Option[SparkSession]

getDefaultSession: Option[SparkSession]

getDefaultSession…FIXME

Accessing ExecutionListenerManager — `listenerManager` Method



listenerManager: ExecutionListenerManager

listenerManager: ExecutionListenerManager

listenerManager…FIXME

Accessing SessionState — `sessionState` Lazy Attribute



sessionState: SessionState

sessionState: SessionState

sessionState…FIXME

`setActiveSession` Object Method



setActiveSession(session: SparkSession): Unit

setActiveSession(session: SparkSession): Unit

setActiveSession…FIXME

`setDefaultSession` Object Method



setDefaultSession(session: SparkSession): Unit

setDefaultSession(session: SparkSession): Unit

setDefaultSession…FIXME

Accessing SharedState — `sharedState` Method



sharedState: SharedState

sharedState: SharedState

sharedState…FIXME

Measuring Duration of Executing Code Block — `time` Method



time[T](f: => T): T

time[T](f: => T): T

time…FIXME

Fundamentals of Spark SQL Application Development

2011-01-28admin阅读(1606)

Fundamentals of Spark SQL Application Development

Development of a Spark SQL application requires the following steps:

Setting up Development Environment (IntelliJ IDEA, Scala and sbt)
Specifying Library Dependencies
Creating SparkSession
Loading Data from Data Sources
Processing Data Using Dataset API
Saving Data to Persistent Storage
Deploying Spark Application to Cluster (using spark-submit)

Dataset API vs SQL

2011-01-28admin阅读(1220)

Dataset API vs SQL

Spark SQL supports two “modes” to write structured queries: Dataset API and SQL.

It turns out that some structured queries can be expressed easier using Dataset API, but there are some that are only possible in SQL. In other words, you may find mixing Dataset API and SQL modes challenging yet rewarding.

You could at some point consider writing structured queries using Catalyst data structures directly hoping to avoid the differences and focus on what is supported in Spark SQL, but that could quickly become unwieldy for maintenance (i.e. finding Spark SQL developers who could be comfortable with it as well as being fairly low-level and therefore possibly too dependent on a specific Spark SQL version).

This section describes the differences between Spark SQL features to develop Spark applications using Dataset API and SQL mode.

RuntimeReplaceable Expressions are only available using SQL mode by means of SQL functions like nvl, nvl2, ifnull, nullif, etc.
Column.isin and SQL IN predicate with a subquery (and In Predicate Expression)

Datasets vs DataFrames vs RDDs

2011-01-28admin阅读(1228)

Datasets vs DataFrames vs RDDs

Many may have been asking yourself why they should be using Datasets rather than the foundation of all Spark – RDDs using case classes.

This document collects advantages of Dataset vs RDD[CaseClass] to answer the question Dan has asked on twitter:

“In #Spark, what is the advantage of a DataSet over an RDD[CaseClass]?”

Saving to or Writing from Data Sources

With Dataset API, loading data from a data source or saving it to one is as simple as using SparkSession.read or Dataset.write methods, appropriately.

Accessing Fields / Columns

You select columns in a datasets without worrying about the positions of the columns.

In RDD, you have to do an additional hop over a case class and access fields by name.

上一页
1
···
54
55
56
57
58
下一页
共 58 页

spark-sql 第57页

Row

Creating Row — apply Factory Method

Field Access by Index — apply and get methods

Get Field As Type — getAs method

Schema

Row Object

Pattern Matching on Row

DataFrame — Dataset of Rows with RowEncoder

Features of DataFrame

SQLContext, spark, and Spark shell

Creating DataFrames from Scratch

Using toDF

Creating DataFrame using Case Classes in Scala

Custom DataFrame Creation using createDataFrame

Loading data from structured files

Creating DataFrame from CSV file

Creating DataFrame from CSV files using spark-csv module

Reading Data from External Data Sources (read method)

Querying DataFrame

Using Query DSL

Using SQL

Filtering

Handling data in Avro format

Example Datasets

Dataset — Structured Query with Data Encoder

Getting Input Files of Relations (in Structured Query) — inputFiles Method

resolve Internal Method

Creating Dataset Instance

Is Dataset Local? — isLocal Method

Is Dataset Streaming? — isStreaming method

Queryable

withNewRDDExecutionId Internal Method

Creating DataFrame (For Logical Query Plan and SparkSession) — ofRows Internal Factory Method

Tracking Multi-Job Structured Query Execution (PySpark) — withNewExecutionId Internal Method

Executing Action Under New Execution ID — withAction Internal Method

Creating Dataset Instance (For LogicalPlan and SparkSession) — apply Internal Factory Method

Collecting All Rows From Spark Plan — collectFromPlan Internal Method

selectUntyped Internal Method

Helper Method for Typed Transformations — withTypedPlan Internal Method

Helper Method for Set-Based Typed Transformations — withSetOperator Internal Method

sortInternal Internal Method

Helper Method for Untyped Transformations and Basic Actions — withPlan Internal Method

Further Reading and Watching

SparkSessionExtensions

Associating Custom Operator Optimization Rules with SparkSession — buildOptimizerRules Method

Registering Custom Check Analysis Rule (Builder) — injectCheckRule Method

Registering Custom Operator Optimization Rule (Builder) — injectOptimizerRule Method

Registering Custom Parser (Builder) — injectParser Method

Registering Custom Planner Strategy (Builder) — injectPlannerStrategy Method

Registering Custom Post-Hoc Resolution Rule (Builder) — injectPostHocResolutionRule Method

Registering Custom Resolution Rule (Builder) — injectResolutionRule Method

implicits Object — Implicits Conversions

DatasetHolder Scala Case Class

Builder — Building SparkSession using Fluent API

Getting Or Creating SparkSession Instance — getOrCreate Method

Enabling Hive Support — enableHiveSupport Method

withExtensions Method

appName Method

config Method

master Method

SparkSession — The Entry Point to Spark SQL

Creating SparkSession Using Builder Pattern — builder Object Method

Accessing Version of Spark — version Method

Creating Empty Dataset (Given Encoder) — emptyDataset Operator

Creating Dataset from Local Collections or RDDs — createDataset methods

Creating Dataset With Single Long Column — range Operator

Creating Empty DataFrame — emptyDataFrame method

Creating DataFrames from Local Collections or RDDs — createDataFrame Method

Executing SQL Queries (aka SQL Mode) — sql Method

Accessing UDFRegistration — udf Attribute

Loading Data From Table — table Method

Accessing Metastore — catalog Attribute

Accessing DataFrameReader — read method

Getting Runtime Configuration — conf Attribute

readStream method

streams Attribute

experimentalMethods Attribute

Creating SparkSession Instance — newSession method

Stopping SparkSession — stop Method

Creating Row — `apply` Factory Method

Field Access by Index — `apply` and `get` methods

Get Field As Type — `getAs` method

Getting Input Files of Relations (in Structured Query) — `inputFiles` Method

`resolve` Internal Method

Is Dataset Local? — `isLocal` Method

Is Dataset Streaming? — `isStreaming` method

`withNewRDDExecutionId` Internal Method

Creating DataFrame (For Logical Query Plan and SparkSession) — `ofRows` Internal Factory Method

Tracking Multi-Job Structured Query Execution (PySpark) — `withNewExecutionId` Internal Method

Executing Action Under New Execution ID — `withAction` Internal Method

Creating Dataset Instance (For LogicalPlan and SparkSession) — `apply` Internal Factory Method

Collecting All Rows From Spark Plan — `collectFromPlan` Internal Method

`selectUntyped` Internal Method

Helper Method for Typed Transformations — `withTypedPlan` Internal Method

Helper Method for Set-Based Typed Transformations — `withSetOperator` Internal Method

`sortInternal` Internal Method

Helper Method for Untyped Transformations and Basic Actions — `withPlan` Internal Method

Associating Custom Operator Optimization Rules with SparkSession — `buildOptimizerRules` Method

Registering Custom Check Analysis Rule (Builder) — `injectCheckRule` Method

Registering Custom Operator Optimization Rule (Builder) — `injectOptimizerRule` Method

Registering Custom Parser (Builder) — `injectParser` Method

Registering Custom Planner Strategy (Builder) — `injectPlannerStrategy` Method

Registering Custom Post-Hoc Resolution Rule (Builder) — `injectPostHocResolutionRule` Method

Registering Custom Resolution Rule (Builder) — `injectResolutionRule` Method

`DatasetHolder` Scala Case Class

Getting Or Creating SparkSession Instance — `getOrCreate` Method

Enabling Hive Support — `enableHiveSupport` Method

`withExtensions` Method

`appName` Method

`config` Method

`master` Method

Creating SparkSession Using Builder Pattern — `builder` Object Method

Accessing Version of Spark — `version` Method

Creating Empty Dataset (Given Encoder) — `emptyDataset` Operator

Creating Dataset from Local Collections or RDDs — `createDataset` methods

Creating Dataset With Single Long Column — `range` Operator

Creating Empty DataFrame — `emptyDataFrame` method

Creating DataFrames from Local Collections or RDDs — `createDataFrame` Method

Executing SQL Queries (aka SQL Mode) — `sql` Method

Accessing UDFRegistration — `udf` Attribute

Loading Data From Table — `table` Method

Accessing Metastore — `catalog` Attribute

Accessing DataFrameReader — `read` method

Getting Runtime Configuration — `conf` Attribute

`readStream` method

`streams` Attribute

`experimentalMethods` Attribute

Creating SparkSession Instance — `newSession` method

Stopping SparkSession — `stop` Method

Create DataFrame from BaseRelation — `baseRelationToDataFrame` Method

Creating SessionState Instance — `instantiateSessionState` Internal Method

`sessionStateClassName` Internal Method

Creating DataFrame From RDD Of Internal Binary Rows and Schema — `internalCreateDataFrame` Internal Method

`clearActiveSession` Object Method

`clearDefaultSession` Object Method

Accessing ExperimentalMethods — `experimental` Method

`getActiveSession` Object Method

`getDefaultSession` Object Method

Accessing ExecutionListenerManager — `listenerManager` Method

Accessing SessionState — `sessionState` Lazy Attribute

`setActiveSession` Object Method

`setDefaultSession` Object Method

Accessing SharedState — `sharedState` Method

Measuring Duration of Executing Code Block — `time` Method