Using a simple max(time) to find out what is the latest entry is highly discouraged as it leads to a full table scan:
> explain select max(time) from state_vectors_data4;
...
partitions=22001/22001 files=22913 size=4.51TB
If there's no maintenance or technical problems, we can safely assume that the table is continuously filled with new data and the most recent entry is not older than 6 hours.
For this reason, getting the most recent time and partition becomes much more efficient:
> explain select max(time) t from state_vectors_data4 where hour > unix_timestamp(now()) - 6*3600;
...
partitions=2/22002 files=2 size=784.44MB
To make use of this value in your query, we could use the
WITH clause (common table expression, see
docs
), which does what we want in an elegant way:
> explain with curr as (select max(time) t from state_vectors_data4 where hour > unix_timestamp(now()) - 6*3600) select time,icao24,lat,lon FROM curr,state_vectors_data4 WHERE time = t and hour > unix_timestamp(now()) - 6*3600;
Btw, you don't need to GROUP BY here, because there are no nested records involved.
The query takes around 30s for me
... we should be able to do better, so let's take the naiive approach:
> select max(time) t, max(hour) h from state_vectors_data4 where hour > unix_timestamp(now()) - 6*3600;
...
Fetched 1 row(s) in 0.82s
> select time,icao24,lat,lon FROM curr,state_vectors_data4 WHERE time = 1532591999 and hour = 1532588400;
...
Fetched 5836 row(s) in 1.86s
That's much faster, but two queries. It's faster, because common table expressions involve joins, which tend to be slow.
In the end, the following query has equal performance using
IN and a sub query:
select time,icao24,lat,lon FROM state_vectors_data4 WHERE time in (
select max(time) from state_vectors_data4 where hour > unix_timestamp(now()) - 6*3600
) and hour > unix_timestamp(now()) - 6*3600;
...
Fetched 5836 row(s) in 2.17s
Impala is a bit picky when it comes to query optimizations. We have seen many situations where queries were slow although expected otherwise.
It's always worth using
EXPLAIN to check what's actually happening.