A shift from feature-based work to bug fixing this week as the team were heads down on stabilising Impala for the upcoming 2.3.0 release.
- 23 commits in the week ending September 5th. (commit 24e68)
- 122 files changed, 7248 insertions(+), 981 deletions(-)
New Parquet tutorial
There's a new tutorial in the Impala documentation about dealing with 'unknown schema'. Impala allows you to flexibly manage Parquet files where you don't a priori know the schema, by inferring the columns and their types from the files themselves. That means you can load the data (into HDFS) before knowing its precise schema! This tutorial shows how.
Lots of bug fixes
As expected in this stage of a release cycle, a lot of bug fixes were committed as attention turns to building a high-quality 2.3.0 release.
That includes fixes for the Impala shell on CentOS 7.0, a fix to ensure that Thrift RPC endpoints can accepts TLS v1.1 and v1.2 connections, and a fix for a particularly egregious bug that caused empty strings to 'poison' a hash function over several values to always return the same value. That last fix could, in some cases, lead to skew when Impala was hashing the output of an operator to load-balance amongst upstream consumers: hashing all inputs to the same value of course means that the load is completely unbalanced as a result. That problem has been fixed, but the longer-term plan is to move to a hash function that doesn't have this kind of 'collapsing' behaviour.
Explicit column names in
We did manage to sneak in a couple of features
adds explicit column aliases to
WITH clauses (i.e. common table
so that you can easily rename columns either in the output, or when you refer to them in
the body of the main query statement.