It started with an innocent tweet in response to a blog post on how to optimize top-k queries.
My colleague, Shrikanth, pointed out that Hive does not, in fact, have this optimization.
After a couple of months, I finally got a chance to implement the optimization and that is the topic of this blog.
Here’s an example of a top-k query:
SELECT * FROM T ORDER BY a DESC LIMIT 10
The user is interested in knowing the rows of T with the top 10 values of column a. You can imagine other variants where T is a derived table or a view that could encompass other computations or aggregations.
How does Hive execute such queries?
In simple terms, the above query is executed using the following steps.
- A number of map tasks read parts of table T
- Each map task sorts its portion of the data on a and writes it to disk
- There is a single reducer that reads data from all the mappers and merges these in order of a
- The reducer invokes a limit operator on the merged stream which allows only 10 rows to pass-through
The query is interested in only the top 10 rows yet all rows of T are being sorted, written to disk and transmitted to the reducer, and merged. If T is very large, this can take a very long time.
Every row in the final top 10 must have been part of the top 10 rows out of some mapper. This key observation allows us to restrict each mapper to sending only their top 10 rows. This substantially reduces the I/O cost involved in executing the above query. The way we implemented this solution inside the Qubole Data Service was split between Hive and Hadoop. In Hadoop, we introduced a new parameter “map.sort.limitrecords” which limits the number of records each mapper outputs. Hive creates a map-reduce job corresponding to this query. In Hive, we identified the top-K query pattern and set this parameter in the created job to the appropriate value. The advantage of this approach is that regular map-reduce jobs that want the top-k property can also take advantage of this optimization by setting this parameter.
We demonstrate the performance improvements using a synthetic dataset with 10 million rows that has 1 million unique entries for column v. We compare the performance of two queries:
Q1: SELECT v, COUNT(*) c FROM T GROUP BY v ORDER BY c DESC LIMIT 10
Q2: SELECT * FROM T ORDER BY v LIMIT 10
The first query is interested in an attribute that is repeated most frequently whereas the second query is interested in top-10 rows. Both queries benefit from the optimization, albeit to different degrees. Q1 seems an improvement of about 10% while Q2 is improved by 7.5x! Note that the first query spends a long time in the grouping phase and the effect of the optimization is subdued. The second query benefits considerably from the optimization.
We are helping get a similar optimization in Apache Hive tracked here. This optimization is available as part of the Qubole Data Service. Signup today to get an account and give it a ride! If these sorts of problems interest you, ping us at [email protected]. We’re hiring!