What's new in Impala, August 14th 2015

Henry Robinson / Fri 14 August 2015

Welcome to the first of an ongoing series of posts where we take a look at what's been going on in Impala internals in the last week or so.

Analytic window functions: PERCENT_RANK() and friends

Impala learnt about some new window functions - PERCENT_RANK(), NTILE() and CUME_DIST() - this week in the patch for IMPALA-2081. One of the cool things about these functions is that they're all implemented by rewriting queries in terms of existing window functions. See this method as an example of how PERCENT_RANK() can be rewritten in terms of rank() and count(). The benefit of this approach is that there's no new backend machinery to implement and debug, and any future optimisations made to analytic function evaluation will improve all these frontend functions!

Nested-loop join (briefly!)

One of the key moving parts for nested-types is the ability to perform nested loop joins between two tuple sources. The first implementation of NLJ landed in trunk last week. This also allows Impala to execute more varieties of JOIN, including outer joins without an equality predicate (e.g. SELECT * FROM student s, lunch_items i where s.lunch_money <= i.cost).

We found a bug in the way NLJ handles transferring its memory to other nodes, so that commit is reverted from Impala for now - but it should show up again in the next few days!