A Quick Guide To OpenSky's Impala Shell

Besides our public API, we also grant free access to our full dataset over an SQL-like query interface to researchers and partners. If you don't have access to this free premium service yet, but good reasons to get it, you can apply for access by submitting this application form (account+login required). Note that this interface is mainly intended to serve university-affiliated researchers, governmental organisations, or aviation authorities as a data source for aviation-related research or incident investigation. If you are a private or commercial entity, please contact us for a license.

Since processing large queries is expensive in terms of resources and cluster energy consumption, we reserve the right to reject applications which lack sufficient reason to gain access. We will also revoke access rights in case of misuse.

The technology which runs behind this interface is Cloudera Impala. Similar to good old relational databases, Impala allows you to filter, aggregate, and combine data. Impala is backed with an HDFS cluster and is capable of processing the terabytes of Mode S/ADS-B communication data we collected over the years.

This tutorial will show you how to use our Impala shell and will provide some examples on how to write OpenSky-specific queries, how results look like, and which data is available. During this tutorial, I will assume that you are familiar with the Impala documentation and that you have some basic knowledge/experience with relational databases such as MySQL.

Note that there is also a Python interface for the Impala shell available in Xavier Olive's github repository.

1. Connecting to the shell

You can access the Impala shell over SSH with any SSH client. If you are on a Windows machine, use clients such as Putty for example. The address of the server is data.opensky-network.org and the Impala shell is running on port 2230. Use the same credentials to login as on our website. If you are on a Unix-like system (Linux/Mac OSX), you can access the shell by running the following command in a terminal:

ssh -p 2230 -l USERNAME data.opensky-network.org

Don’t forget to replace “USERNAME” with your username. If access to the Impala shell is activated for your account, you’ll find yourself confronted with a prompt like this:

[hadoop-2:21000] >

This means that everything went alright and you are successfully logged in. Note that if you don't have the rights to access the Impala shell, the server will close the connection without any notification.

2. Exploring the database

Let’s start with exploring the database and see which data is available. To get a list of all available tables, run the following command:

[hadoop-2:21000] > show tables;
Query: show tables
+--------------------------+
| name                     |
+--------------------------+
| acas_data4               |
| allcall_replies_data4    |
| identification_data4     |
| operational_status_data4 |
| position_data4           |
| rollcall_replies_data4   |
| sensor_visibility        |
| sensor_visibility_data3  |
| state_vectors            |
| state_vectors_data3      |
| state_vectors_data4      |
| velocity_data4           |
+--------------------------+
Fetched 12 row(s) in 0.04s

If you are familiar with SSR Mode S downlink communications, the table names should ring a bell. I won’t cover every single table in this tutorial as it’s rather intended as a quick guide on how to use this interface and the data. Those of you who are interested in the low-level tables (e.g. in rollcall replies) should know their meanings anyway. For more information and a comprehensible introduction to SSR Mode S, I recommend reading the part on SSR Mode S of Christian Wolff’s radar tutorial. However, depending on what you intend to do, you might not need this knowledge as using the state vectors is pretty straightforward as you'll see in a minute and doesn't require detailed knowledge about the underlying technology.

The first important thing to observe is the ending of the tables. They indicate the version of the batch layer implementation which generated the data. As of this writing, we are running the 4th implementation and you can consider everything else deprecated and buggy and it might be removed in the future.

From my experience, most of you are interested in what we call state vectors. These are basically summaries of all other tables containing most common tracking information. I’ll cover this table as an example for how to explore the database. You’ll be able to apply the same methodology to the other tables as well. Let’s first have a closer look at the table structure:

[hadoop-2:21000] > describe state_vectors_data4;
Query: describe state_vectors_data4
+---------------+------------+-----------------------------+
| name          | type       | comment                     |
+---------------+------------+-----------------------------+
| time          | int        | Inferred from Parquet file. |
| icao24        | string     | Inferred from Parquet file. |
| lat           | double     | Inferred from Parquet file. |
| lon           | double     | Inferred from Parquet file. |
| velocity      | double     | Inferred from Parquet file. |
| heading       | double     | Inferred from Parquet file. |
| vertrate      | double     | Inferred from Parquet file. |
| callsign      | string     | Inferred from Parquet file. |
| onground      | boolean    | Inferred from Parquet file. |
| alert         | boolean    | Inferred from Parquet file. |
| spi           | boolean    | Inferred from Parquet file. |
| squawk        | string     | Inferred from Parquet file. |
| baroaltitude  | double     | Inferred from Parquet file. |
| geoaltitude   | double     | Inferred from Parquet file. |
| lastposupdate | double     | Inferred from Parquet file. |
| lastcontact   | double     | Inferred from Parquet file. |
| serials       | array<int> | Inferred from Parquet file. |
| hour          | int        |                             |
+---------------+------------+-----------------------------+
Fetched 18 row(s) in 0.05s

Units

Units are unified across all tables, so if you find columns with the same content (e.g. timestamps) in other tables, they’ll have the same unit. In general, we are using the metric system in our data as it’s straight-forward and easily understood by non-aviation experts. That means that speed is always in meters per second, distances or lengths (e.g. altitude) are in meters, time in seconds and so on.

The Columns

Let’s go through the columns from top to bottom for a particular (yet random) state vector:

[hadoop-2:21000] > select * from state_vectors_data4 where hour=1480759200 order by rand() limit 1;
Query: select * from state_vectors_data4 where hour=1480759200 order by rand() limit 1
+------------+--------+-------------------+--------------------+-------------------+-------------------+----------+----------+----------+-------+-------+--------+-------------------+-------------------+----------------+----------------+------------+
| time       | icao24 | lat               | lon                | velocity          | heading           | vertrate | callsign | onground | alert | spi   | squawk | baroaltitude      | geoaltitude       | lastposupdate  | lastcontact    | hour       |
+------------+--------+-------------------+--------------------+-------------------+-------------------+----------+----------+----------+-------+-------+--------+-------------------+-------------------+----------------+----------------+------------+
| 1480760792 | a0d724 | 37.89463883739407 | -88.93331113068955 | 190.8504039695975 | 265.8263544365708 | 0        | UPS858   | false    | false | false | 7775   | 9144              | 9342.120000000001 | 1480760704.359 | 1480760709.997 | 1480759200 |
+------------+--------+-------------------+--------------------+-------------------+-------------------+----------+----------+----------+-------+-------+--------+-------------------+-------------------+----------------+----------------+------------+
Fetched 1 row(s) in 1.48s

time

This column contains the unix (aka POSIX or epoch) timestamp for which the state vector was valid. You’ll find one state vector per second for each aircraft which was active within the coverage of OpenSky at that particular second. For more information on how these state vectors are generated, please refer to the API documentation. In the above example, the time is 1480760792 which means that we are looking at a state vector that was valid on Saturday, 03-Dec-16 10:26:32 UTC. Tip: There are online tools available for converting unix timestamps, just use Google.

icao24

This column contains the 24-bit ICAO transponder ID which can be used to track specific airframes over different flights. This ID should never change during a registration period of an airframe, which doesn't change very often. So if you are looking for a particular aircraft, try to find out its 24-bit transponder ID and filter by this column. In our data, it’s represented as a 6 digit hexadecimal number (string). In our case, we are looking at the state of an aircraft using the transponder ID a0d724. If you look it up on databases like airframes.org, you’ll find out that this transponder ID is used by an Airbus A306 owned by UPS. You will find this column in all tables.

lat/lon

These column contain the last known latitude and longitude of the aircraft. Coordinates are stored as decimal WGS84 coordinates. So here is what we know so far: On Saturday, 03-Dec-16 at 10:26:32 UTC, the UPS aircraft with transponder ID a0d724 was at position 37.89463883739407,-88.93331113068955. If you look it up on Google maps, it’s somewhere in Illinois in the US.

velocity

This column contains the speed over ground of the aircraft in meters per second. In our example, the UPS aircraft flew over Illinois at a speed of 190.8504039695975 meters per second.

heading

This column represents the direction of movement (track angle) as the clockwise angle from the geographic north. Just a little side note for the aviation experts: you might want to complain that the term “heading” is not perfectly correct (if you are one of them you know what I mean) and something like “track” or “track angle” might be more correct. You are absolutely right and this might change in future releases, but for legacy reasons it’s called “heading” for now. In our case, the aircraft flew into the direction 265.8263544365708° clockwise from geographic north, or in other words, to the west.

vertrate

This column contains the vertical speed of the aircraft in meters per second. A negative number indicates that the aircraft was descending, a positive number indicates a ascend respectively. In the above example, the UPS aircraft was neither ascending nor descending.

callsign

This column contains the callsign that was broadcast by the aircraft. Most airlines indicate the airline and the flight number in the callsign, but there is no unified system. In our example, the callsign indicates that this state vector belongs to UPS flight 858. By looking up the flightnumber on services like flightaware.com, you’ll find out that this flight goes from Lousville to Phoenix every day.

onground

This flag indicates whether the aircraft is broadcasting surface positions (true) or airborne positions (false). Our UPS aircraft was airborne.

alert/spi

These two flags are special indicators used in ATC. If you need them, you’ll know what they mean.

squawk

This 4-digit octal number is another transponder code which is used by ATC and pilots for identification purposes and indication of emergencies. Usually, ATC assigns squawks to aircraft when they enter their airspace via radio. In the above example, the UPS flight was assigned squawk “7775”. See e.g. Wikipedia for a list of special purpose squawks.

baroaltitude/geoaltitude

These two columns indicate the aircraft’s altitudel. As the names suggest, baroaltitude is the altitude measured by the barometer and depends on factors such as weather, whereas geoaltitude is determined using the GNSS (GPS) sensor. In our case, the aircraft was flying at a geometric altitude (or height) of 9342.12 meters and a barometric altitude of 9144 meters. That makes a difference of almost 200 meters. You are likely to observe similar differences for aircraft in spatial and temporal vicinity. Note that due to its importance in aviation, barometric altitude will almost always be present, while the geometric altitude depends on the equipage of the aircraft.

lastposupdate

This unix timestamp indicates the age of the position. The position of the state vector above was already 87.64 seconds old at the time when the state vector was created (time) and should not be used any longer.

lastcontact

This unix timestamp indicates the time at which OpenSky received the last signal of the aircraft. As long as the aircraft is flying in an airspace which is well-covered by OpenSky’s receivers this timestamp should never be older than 1-2 seconds compared to the state vectors timestamp (time). Apparently, OpenSky’s coverage in Illinois was not too good in December 2016 since the last contact indicates that the aircraft left the covered airspace already 82 seconds ago. OpenSky continues generating state vectors for 300 seconds after the last contact. Depending on your application, you can filter state vectors which are, e.g., older than 15 by adding a WHERE-clause to your query saying “WHERE time-lastcontact<=15”. The relationship between the three timestamps explained so far is time > lastcontact >= lastposupdate.

hour

Since batch layer version 4, the data is processed and partitioned in hourly batches. This timestamp marks the beginning of the hour to which the data belongs.

The Other Tables

There are only a couple of things left to say concerning the other tables. In most of the other tables you will find the columns mintime, maxtime, and msgcount. These columns are used to summarize messages which were received multiple times. As the names suggest, the respective message was received at mintime for the first time and at maxtime for the last time. In total, the respective row in the table summarizes msgcount duplicates that were received between mintime and maxtime.

If you want to learn more about the individual columns of each table, you should refer to the standard describing the respective Mode S downlink formats. Specifically, the values of the SSR Mode S tables acas_data4, allcall_replies_data4, and rollcall_replies_data4 are specified in ICAO’s Annex 10 Volume 4 and the ADS-B tables identification_data4, operational_status_data4, position_data4, and velocity_data4 are defined in RCTA’s DO-260B.

For the sake of completeness, let's go quickly through one example line, say, from the allcall_replies_data4 table:

[hadoop-2:21000] > SELECT * FROM allcall_replies_data4 LIMIT 1;
Query: select * FROM allcall_replies_data4 LIMIT 1
+----------------+----------------+----------------+----------+--------+--------------+--------------+------------+
| rawmsg         | mintime        | maxtime        | msgcount | icao24 | capabilities | interrogator | hour       |
+----------------+----------------+----------------+----------+--------+--------------+--------------+------------+
| 5d400fbab21825 | 1478293200.038 | 1478293636.342 | 36       | 400fba | 5            | 00003a       | 1478293200 |
+----------------+----------------+----------------+----------+--------+--------------+--------------+------------+
Fetched 1 row(s) in 0.32s

This row means that on Friday, 04-Nov-16 between 21:00:00 and 21:07:16 UTC, OpenSky received 36 replies from the aircraft using the transponder ID 400fba to all-call interrogations from an interrogator with the ID 00003a. The aircraft reported a capability of 5, which means according to ICAO Annex 10 Vol. 4, section 3.1.2.5.2.2.1 that the transponder level is 2 or above and the aircraft is airborne.

3. Boosting Query Performance

Alright, you just learned which information you can get from the state_vectors_data4 table. Now it’s time to talk about partitioning. As you might have noticed in the above example query, I added a WHERE-clause for the hour column. This is extremely important if you want your queries to terminate within a reasonable time (seconds) due to the following facts.

Impala is a distributed query engine and does not offer index structures for query optimization. As such, all your queries will make it process data in a sequential manner. As you can imagine, our dataset is quite large and sequential filtering/searching of the full dataset is super expensive. To overcome this limitation and to enable efficient queries, the data is partitioned in hour-batches by the hour-field. As soon as you provide a WHERE-clause for a specific hour or a couple of hours, Impala knows exactly where to look at instead of searching through the whole dataset.

Here is a simple demonstration of the difference. Let’s try to answer the following question: How many state vectors of the UPS aircraft from the previous section (icao24 a0d724) have been received on Saturday, December 3rd between 10:15 and 11:30 UTC. A straightforward but WRONG way to translate this question into an Impala query would be:

  1. Saturday, December 3rd, 10:15 UTC in unix time: 1480760100
  2. Saturday, December 3rd, 11:30 UTC in unix time: 1480764600
  3. The wrong query:
    SELECT COUNT(*) FROM state_vectors_data4 WHERE icao24='a0d724' AND time>=1480760100 AND time<=1480764600;

Producing the result (2973) took about 40 seconds. While this is still a reasonable amount of time, queries are usually more complex and you will rather observe processing times in the order of hours if you don’t make use of partitioning. The RIGHT way would have been:

  1. Saturday, December 3rd, 10:15 UTC in unix time: 1480760100
  2. The “hour” in which the interval starts: 1480760100-(1480760100%3600) = 1480759200
  3. Saturday, December 3rd, 11:30 UTC in unix time: 1480764600
  4. The “hour” in which the interval ends: 1480764600-(1480764600%3600) = 1480762800
  5. The right query:
    SELECT COUNT(*) FROM state_vectors_data4 WHERE icao24='a0d724' AND time>=1480760100 AND time<=1480764600 AND hour>=1480759200 AND hour<=1480762800;

This time it only took 1.17 seconds to produce the result. That’s quite a difference, isn’t it? Before you run any query you should always check how many partitions will be considered by Impala using the EXPLAIN directive. For the bad example above, it says:

[hadoop-2:21000] > EXPLAIN SELECT COUNT(*) FROM state_vectors_data4 WHERE icao24='a0d724' AND time>=1480760100 AND time<=1480764600;
Query: explain SELECT COUNT(*) FROM state_vectors_data4 WHERE icao24='a0d724' AND time>=1480760100 AND time<=1480764600
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=58.00MB VCores=1                           |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| opensky.state_vectors_data4                                                        |
|                                                                                    |
| 03:AGGREGATE [FINALIZE]                                                            |
| |  output: count:merge(*)                                                          |
| |                                                                                  |
| 02:EXCHANGE [UNPARTITIONED]                                                        |
| |                                                                                  |
| 01:AGGREGATE                                                                       |
| |  output: count(*)                                                                |
| |                                                                                  |
| 00:SCAN HDFS [opensky.state_vectors_data4]                                         |
|    partitions=1075/1075 files=8600 size=114.13GB                                   |
|    predicates: time >= 1480760100, time <= 1480764600, icao24 = 'a0d724'           |
+------------------------------------------------------------------------------------+
Fetched 15 row(s) in 0.10s

As you can see, the query scans all 1075 partitions. Where the good query only scans 2 partitions:

[hadoop-2:21000] > EXPLAIN SELECT COUNT(*) FROM state_vectors_data4 WHERE icao24='a0d724' AND time>=1480760100 AND time<=1480764600 AND hour>=1480759200 AND hour<=1480762800;
Query: EXPLAIN SELECT COUNT(*) FROM state_vectors_data4 WHERE icao24='a0d724' AND time>=1480760100 AND time<=1480764600 AND hour>=1480759200 AND hour<=1480762800
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=74.00MB VCores=1                           |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| opensky.state_vectors_data4                                                        |
|                                                                                    | | 03:AGGREGATE [FINALIZE]                                                            | | |  output: count:merge(*)                                                          | | |                                                                                  | | 02:EXCHANGE [UNPARTITIONED]                                                        | | |                                                                                  | | 01:AGGREGATE                                                                       | | |  output: count(*)                                                                | | |                                                                                  | | 00:SCAN HDFS [opensky.state_vectors_data4]                                         | |    partitions=2/1075 files=16 size=360.58MB                                        | |    predicates: time >= 1480760100, time <= 1480764600, icao24 = 'a0d724'           | +------------------------------------------------------------------------------------+ Fetched 15 row(s) in 0.03s

 

Important: To summarize this section, you should ALWAYS have a WHERE-clause for the hour-column in your query. Note that you can also define a range of hours.

4. Storing Results & Tools

Since you are accessing the Impala shell via SSH, the result set can only be displayed in your terminal as there is no programming language binding for that. However, oftentimes you want to retrieve the resultset for postprocessing or generating plots. One way to do that, is piping the whole output to a file and then transform it, e.g. into a CSV file. Dumping everything to a file can be done e.g. using the Linux command script:

script -f -c "ssh -p 2230 -l USERNAME data.opensky-network.org" log.txt

This will produce a file called log.txt which contains all the output you see on your terminal during the Impala session. To convert it into a CSV, you could run something like this in your terminal (if you're using bash):

cat log.txt | grep "^|.*" | sed -e 's/\s*|\s*/,/g' -e 's/^,\|,$//g' -e 's/NULL//g' | awk '!seen[$0]++' >> log.csv

Other tools

  • ONERA's Xavier Olive wrote a Python script for extracting trajectories from the Impala database. It is available in this github repository.

5. Useful Query Examples

  • Get 1 position (the latest one) per aircraft every minute:
    SELECT * FROM state_vectors_data4 v JOIN (SELECT QUOTIENT(time, 60) AS minute, MAX(time) AS recent, icao24 FROM state_vectors_data4 WHERE hour=1480762800 GROUP BY icao24, minute) AS m ON v.icao24=m.icao24 AND v.time=m.recent WHERE v.hour=1480762800;
  • Get the last position of each aircraft seen by the receiver with the serial 1344390019 during the hour 1480762800:
    SELECT * FROM state_vectors_data4 v JOIN (SELECT MAX(time) AS recent, count(*) as cnt, icao24 FROM state_vectors_data4, state_vectors_data4.serials s WHERE hour=1480762800 AND s.ITEM=1344390019 GROUP BY icao24) AS m ON v.icao24=m.icao24 AND v.time=m.recent WHERE cnt>30 AND v.hour=1480762800;

    Note: the filter "AND cnt>30" is necessary because the data provided by the receivers is noisy and in some cases receivers get associated with state vectors based on noise. We do our best to prevent that but it still happens. Filters on the count are just a band-aid and do not work 100%, but they perform quite well. You might also want to apply filters on latitude and longitude to further reduce noise.

  • Count aircraft which arrived or departed or crossed Frankfurt airport during a certain hour:
    SELECT COUNT(DISTINCT icao24) FROM state_vectors_data4 WHERE lat<=50.07 AND lat>=49.98 AND lon<=8.62 AND lon>=8.48 AND hour=1493892000;

6. Further Reading

7. Conclusion

Now you are good to go! Have fun exploring our dataset and let us know if you find interesting or peculiar things. If you have any feedback or questions regarding this guide, just contact me (schaefer[at]opensky-network.org).

8. Credits

We would like to thank the following people for their feedback, corrigenda, and otherwise useful comments that helped improve this guide:

  • Thomas Quilling for his note on the ICAO 24-bit ID's dependency on the registration of an airframe.
  • EUROCONTROL's Enrico Spinielli for his useful improvement of the bash command for storing the results.
  • ONERA's Xavier Olive for his beautiful Python tool for extracting trajectories from Impala