关注 spark技术分享,
撸spark源码 玩spark最佳实践

DataSource API — Managing Datasets in External Data Sources

DataSource API — Managing Datasets in External Data Sources

Reading Datasets

Spark SQL can read data from external storage systems like files, Hive tables and JDBC databases through DataFrameReader interface.

You use SparkSession to access DataFrameReader using read operation.

DataFrameReader is an interface to create DataFrames (aka Dataset[Row]) from files, Hive tables or tables using JDBC.

As of Spark 2.0, DataFrameReader can read text files using textFile methods that return Dataset[String] (not DataFrames).

There are two operation modes in Spark SQL, i.e. batch and streaming (part of Spark Structured Streaming).

You can access DataStreamReader for reading streaming datasets through SparkSession.readStream method.

The available methods in DataStreamReader are similar to DataFrameReader.

Saving Datasets

Spark SQL can save data to external storage systems like files, Hive tables and JDBC databases through DataFrameWriter interface.

You use write method on a Dataset to access DataFrameWriter.

DataFrameWriter is an interface to persist a Datasets to an external storage system in a batch fashion.

You can access DataStreamWriter for writing streaming datasets through Dataset.writeStream method.

The available methods in DataStreamWriter are similar to DataFrameWriter.

赞(0) 打赏
未经允许不得转载:spark技术分享 » DataSource API — Managing Datasets in External Data Sources
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏