What's new in Impala, August 21st 2015

Henry Robinson / Fri 21 August 2015

This week's round-up of new code going into Impala.

Stats:

  • 29 commits in the week ending August 21st.
  • 134 files changed, 7171 insertions(+), 2048 deletions(-)

Nested-loop join for real

After last week's false start, we committed the nested-loop join implementation for real last week. As mentioned in the last update, this is a critical building-block for nested-types support. Next steps for NLJ will include using codegen to speed up the inner-loop.

CASCADE added to DROP DB

This week also saw the addition of the CASCADE keyword to DROP DB statements. This standard bit of SQL functionality allows you to delete a database and all the tables inside it in one go.

This patch is particularly great for two reasons - first, it's a useful feature that was missing from Impala for a long time. Second, it's the first patch contributed by the newest intern on the Impala team at Cloudera, Vlad. Welcome Vlad!

Query generator for Hive!

One of the best quality-testing tools that we have for Impala is the automatic query generator. This tool helps us find bugs or idiosyncracies in Impala's query execution engine by randomly producing complex SQL queries and running them against both Impala and some gold-standard database (we use PostgreSQL as our default gold-standard, given the incredible reputation of its SQL implementation fidelity). If Impala gives different results to the gold-standard, it often means there's a bug in Impala, and indeed we've found lots of obscure or otherwise hard-to-discover bugs during long stress runs from the query generator.

We realised that, of course, more SQL engines could benefit from standing on the shoulders of giants in the same way. So this week saw the first commit in a series to bring the query generator to bear on Apache Hive.

There's a lot more work to do, but this is a terrific start, and the resulting improvements in quality and compatibility between Hive and Impala should help a lot of users out.