Hive file format comparison

Apache Hive 0.11+

in this post I’d like to compare the different file formats for Hive, as well as the different execution times for queries depending on the file format, compression and execution engine. As for the data, I’m using the Uber data set that I also used in my last post. I’m aware that the following query stats are not exhaustive and you may get to different results in your environment or with other table formats. Also you could try different serdes for Hive as well as consider compression. Still, it gives you some idea that both the file format and the execution engine play an important role for Hive’s query performance. However, when choosing a file format, you may also consider data management aspects. For example, if you get your source files in CSV format, than you will likely process the files in this format at least during the first process step.

As test queries I used the query to measure the total trip time (query 1) and the query to find all trips ending at San Francisco airport (query 2) from my last post. Here is the result for the file formats I tested:

image

Here is some information about the different file formats being used here:

File format Description
textfile separated text file (for example tab separated fields)
rcfile internal hive format (binary)
orc columnar storage format (highly compressed, binary, introduced with Hive 0.11)
Link: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
parquet columnar storage format (compressed, binary)
Link: http://parquet.incubator.apache.org/
Parquet is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
avro serialization file format from Apache Avro (contains schema and data, tools available for processing).
Link: http://avro.apache.org/
Avro is supported in Hive since version 0.9.

For Table create/Write Time I measured a “create table as select” (CTAS) into the specific format. As you can see, the resulting size of the table depends a lot on the file format. The columnar Orc file format compresses the data in a very efficient way:

image

Using Tez as the execution engine (set hive.execution.engine=tez) results in a much better performance compared to map reduce (set hive.execution.engine=mr). The total time for the two queries is shown in this table:

image

In map reduce mode, query time does not seem to depend too much on the file format being used:

image

However, when running the queries in Tez, you’ll see a significant difference between file formats like Parquet and Orc (with Orc being about 30% faster)

image

Advertisements
This entry was posted in Allgemein. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s