Using Spark SQL to update data in Hive using ORC files-spark技术分享

Using Spark SQL to update data in Hive using ORC files

The example has showed up on Spark’s users mailing list.

Caution

FIXME Offer a complete working solution in Scala
FIXME Load ORC files into dataframe
- val df = hiveContext.read.format("orc").load(to/path)

Solution was to use Hive in ORC format with partitions:

A table in Hive stored as an ORC file (using partitioning)
Using SQLContext.sql to insert data into the table
Using SQLContext.sql to periodically run ALTER TABLE…CONCATENATE to merge your many small files into larger files optimized for your HDFS block size
- Since the CONCATENATE command operates on files in place it is transparent to any downstream processing
Hive solution is just to concatenate the files
- it does not alter or change records.
- it’s possible to update data in Hive using ORC format
- With transactional tables in Hive together with insert, update, delete, it does the “concatenate ” for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc)
- Alternatively, use Hbase with Phoenix as the SQL layer on top
- Hive was originally not designed for updates, because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way.

Criteria: