What's new in Impala, September 5th, 2015

Henry Robinson / Sat 05 September 2015

A shift from feature-based work to bug fixing this week as the team were heads down on stabilising Impala for the upcoming 2.3.0 release.

Stats:

  • 23 commits in the week ending September 5th. (commit 24e68)
  • 122 files changed, 7248 insertions(+), 981 deletions(-)

New Parquet tutorial

There's a new tutorial in the Impala documentation about dealing with 'unknown schema'. Impala allows you to flexibly manage Parquet files where you don't a priori know the schema, by inferring the columns and their types from the files themselves. That means you can load the data (into HDFS) before knowing its precise schema! This tutorial shows how.

Lots of bug fixes

As expected in this stage of a release cycle, a lot of bug fixes were committed as attention turns to building a high-quality 2.3.0 release.

That includes fixes for the Impala shell on CentOS 7.0, a fix to ensure that Thrift RPC endpoints can accepts TLS v1.1 and v1.2 connections, and a fix for a particularly egregious bug that caused empty strings to 'poison' a hash function over several values to always return the same value. That last fix could, in some cases, lead to skew when Impala was hashing the output of an operator to load-balance amongst upstream consumers: hashing all inputs to the same value of course means that the load is completely unbalanced as a result. That problem has been fixed, but the longer-term plan is to move to a hash function that doesn't have this kind of 'collapsing' behaviour.

Explicit column names in WITH clauses

We did manage to sneak in a couple of features though. IMPALA-898 adds explicit column aliases to WITH clauses (i.e. common table expressions), so that you can easily rename columns either in the output, or when you refer to them in the body of the main query statement.