Using Spark SQL to update data in Hive using ORC files
The example has showed up on Spark’s users mailing list.
Caution
|
Solution was to use Hive in ORC format with partitions:
-
A table in Hive stored as an ORC file (using partitioning)
-
Using
SQLContext.sql
to insert data into the table -
Using
SQLContext.sql
to periodically runALTER TABLE…CONCATENATE
to merge your many small files into larger files optimized for your HDFS block size-
Since the
CONCATENATE
command operates on files in place it is transparent to any downstream processing
-
-
Hive solution is just to concatenate the files
-
it does not alter or change records.
-
it’s possible to update data in Hive using ORC format
-
With transactional tables in Hive together with insert, update, delete, it does the “concatenate ” for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc)
-
Alternatively, use Hbase with Phoenix as the SQL layer on top
-
Hive was originally not designed for updates, because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way.
-
Criteria:
-
Spark Streaming jobs are receiving a lot of small events (avg 10kb)
-
Events are stored to HDFS, e.g. for Pig jobs
-
There are a lot of small files in HDFS (several millions)