关注 spark技术分享,
撸spark源码 玩spark最佳实践

Using Spark SQL to update data in Hive using ORC files

Using Spark SQL to update data in Hive using ORC files

The example has showed up on Spark’s users mailing list.

  • FIXME Offer a complete working solution in Scala

  • FIXME Load ORC files into dataframe

    • val df = hiveContext.read.format("orc").load(to/path)

Solution was to use Hive in ORC format with partitions:

  • A table in Hive stored as an ORC file (using partitioning)

  • Using SQLContext.sql to insert data into the table

  • Using SQLContext.sql to periodically run ALTER TABLE…​CONCATENATE to merge your many small files into larger files optimized for your HDFS block size

    • Since the CONCATENATE command operates on files in place it is transparent to any downstream processing

  • Hive solution is just to concatenate the files

    • it does not alter or change records.

    • it’s possible to update data in Hive using ORC format

    • With transactional tables in Hive together with insert, update, delete, it does the “concatenate ” for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc)

    • Alternatively, use Hbase with Phoenix as the SQL layer on top

    • Hive was originally not designed for updates, because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way.


  • Spark Streaming jobs are receiving a lot of small events (avg 10kb)

  • Events are stored to HDFS, e.g. for Pig jobs

  • There are a lot of small files in HDFS (several millions)

赞(0) 打赏
未经允许不得转载:spark技术分享 » Using Spark SQL to update data in Hive using ORC files
分享到: 更多 (0)




