关注 spark技术分享,
撸spark源码 玩spark最佳实践

Dynamic Partition Inserts

Dynamic Partition Inserts

Partitioning uses partitioning columns to divide a dataset into smaller chunks (based on the values of certain columns) that will be written into separate directories.

With a partitioned dataset, Spark SQL can load only the parts (partitions) that are really needed (and avoid doing filtering out unnecessary data on JVM). That leads to faster load time and more efficient memory consumption which gives a better performance overall.

With a partitioned dataset, Spark SQL can also be executed over different subsets (directories) in parallel at the same time.

Dynamic Partition Inserts is a feature of Spark SQL that allows for executing INSERT OVERWRITE TABLE SQL statements over partitioned HadoopFsRelations that limits what partitions are deleted to overwrite the partitioned table (and its partitions) with new data.

Dynamic partitions are the partition columns that have no values defined explicitly in the PARTITION clause of INSERT OVERWRITE TABLE SQL statements (in the partitionSpec part).

Static partitions are the partition columns that have values defined explicitly in the PARTITION clause of INSERT OVERWRITE TABLE SQL statements (in the partitionSpec part).

Note
INSERT OVERWRITE TABLE SQL statement is translated into InsertIntoTable logical operator.

Dynamic Partition Inserts is only supported in SQL mode (for INSERT OVERWRITE TABLE SQL statements).

Dynamic Partition Inserts is not supported for non-file-based data sources, i.e. InsertableRelations.

With Dynamic Partition Inserts, the behaviour of OVERWRITE keyword is controlled by spark.sql.sources.partitionOverwriteMode configuration property (default: static). The property controls whether Spark should delete all the partitions that match the partition specification regardless of whether there is data to be written to or not (static) or delete only those partitions that will have data written into (dynamic).

When the dynamic overwrite mode is enabled Spark will only delete the partitions for which it has data to be written to. All the other partitions remain intact.

Spark now writes data partitioned just as Hive would — which means only the partitions that are touched by the INSERT query get overwritten and the others are not touched.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Dynamic Partition Inserts
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏