×

Notice

The forum is in read only mode.

Query for recent time

6 years 3 months ago #513 by labeeba
I have the following query which specifies the time and hour values equal to 1532556000

SELECT time,icao24,lat,lon FROM state_vectors_data4 WHERE time>= 1532556000 AND hour>= 1532556000 GROUP BY time,icao24,lat,lon,velocity,heading,vertrate,callsign,onground,alert,spi,squawk,baroaltitude,geoaltitude,lastposupdate,lastcontact ORDER BY time asc;

Is it possible to write a query which retrieves data for the most recent time instead? such as time= MAX(time)?
6 years 3 months ago #514 by fuchs
Replied by fuchs on topic Query for recent time
Using a simple max(time) to find out what is the latest entry is highly discouraged as it leads to a full table scan:
> explain select max(time) from state_vectors_data4;
...
partitions=22001/22001 files=22913 size=4.51TB

If there's no maintenance or technical problems, we can safely assume that the table is continuously filled with new data and the most recent entry is not older than 6 hours.
For this reason, getting the most recent time and partition becomes much more efficient:
> explain select max(time) t from state_vectors_data4 where hour > unix_timestamp(now()) - 6*3600;
...
partitions=2/22002 files=2 size=784.44MB

To make use of this value in your query, we could use the WITH clause (common table expression, see docs ), which does what we want in an elegant way:
> explain with curr as (select max(time) t from state_vectors_data4 where hour > unix_timestamp(now()) - 6*3600) select time,icao24,lat,lon FROM curr,state_vectors_data4 WHERE time = t and hour > unix_timestamp(now()) - 6*3600;

Btw, you don't need to GROUP BY here, because there are no nested records involved.
The query takes around 30s for me :( ... we should be able to do better, so let's take the naiive approach:
> select max(time) t, max(hour) h from state_vectors_data4 where hour > unix_timestamp(now()) - 6*3600;
...
Fetched 1 row(s) in 0.82s

> select time,icao24,lat,lon FROM curr,state_vectors_data4 WHERE time = 1532591999 and hour = 1532588400;
...
Fetched 5836 row(s) in 1.86s

That's much faster, but two queries. It's faster, because common table expressions involve joins, which tend to be slow.

In the end, the following query has equal performance using IN and a sub query:
select time,icao24,lat,lon FROM state_vectors_data4 WHERE time in (
  select max(time) from state_vectors_data4 where hour > unix_timestamp(now()) - 6*3600
) and hour > unix_timestamp(now()) - 6*3600;
...
Fetched 5836 row(s) in 2.17s

Impala is a bit picky when it comes to query optimizations. We have seen many situations where queries were slow although expected otherwise.
It's always worth using EXPLAIN to check what's actually happening.
The following user(s) said Thank You: labeeba
6 years 3 months ago #515 by labeeba
Replied by labeeba on topic Query for recent time
Thank you so much!

This is what my query looks like now:

select time,icao24,lat,lon,GROUP_CONCAT(CAST(serials.item AS STRING),',') FROM state_vectors_data4, state_vectors_data4.serials WHERE time in (
select max(time) from state_vectors_data4,state_vectors_data4.serials where hour > unix_timestamp(now()) - 6*3600
) and hour > unix_timestamp(now()) - 6*3600 GROUP BY time,icao24,lat,lon,velocity,heading,vertrate,callsign,onground,alert,spi,squawk,baroaltitude,geoaltitude,lastposupdate,lastcontact ORDER BY time asc;

I am trying to develop a real-time web app that plots lat and lon values to display areas of the sky covered by atleast 4 receivers (for multilateration purposes).

If there are any additional query optimizations or other helpful resources for developing real-time apps using OpenSky data , please let me know.
Also, if I used the OpenSky REST or Python API instead, is it possible to retrieve the same data as the above query?
Powered by Kunena Forum
This website uses cookies to offer you the best experience of our services. By using this website you agree to our privacy policy!