Using Distinct

The DISTINCT keyword has a bunch of interesting use-cases in ANSI SQL and offers some additional functionality when used in PostgreSQL. Let's dive in and take a look with an example (you can find the full source code to reproduce this example on GitHub here)

The Setup

We've got several traffic cameras setup near roads and have been capturing license plates for cars driving by. The captures are stored in the table captures.

select *
from captures;
camera_id|capture_ts         |license|
---------|-------------------|-------|
        1|2018-12-01 04:05:06|YLE395 |
        1|2018-12-03 12:17:35|374HQR |
        2|2018-12-11 16:01:32|1NC1QD |
        3|2018-12-02 18:55:00|CC73IQ |
        1|2018-12-10 12:20:10|VV59JO |
        2|2018-12-07 07:12:06|AA27IJ |
        2|2018-12-07 03:19:40|YY17HC |
        1|2018-12-04 22:10:15|NR422L |
        1|2018-12-02 08:22:13|YLE395 |
        3|2018-12-01 17:01:48|1NC1QD |

10 rows

SELECT DISTINCT with one column

We want to obtain a list of all the unique license plates that we've captured. We can use SELECT DISTINCT here.

select distinct license
from captures;
license|
-------|
CC73IQ |
VV59JO |
AA27IJ |
NR422L |
YLE395 |
374HQR |
1NC1QD |
YY17HC |

8 rows

The duplicate entries for YLE395 and 1NC1QD have been removed.

SELECT DISTINCT with multiple columns

Notice that camera 1 caught YLE395 twice. Once on 1st December, and again on the 2nd December. SELECT DISTINCT can be used to remove duplicate rows from the output where multiple columns match up. So we can obtain a list of all the unique combinations of camera and license plate as follows:

select distinct camera_id, license
from captures;
camera_id|license|
---------|-------|
        3|CC73IQ |
        1|374HQR |
        1|YLE395 |
        3|1NC1QD |
        2|1NC1QD |
        2|YY17HC |
        2|AA27IJ |
        1|NR422L |
        1|VV59JO |

9 rows

We now have 9 rows instead of 10. The important thing to note is that SELECT DISTINCT removes duplicate rows from what you've selected, so if you specify more than one column in your SELECT clause, all those columns have to match for the row to be considered a duplicate. Here there was only one row that had the same camera_id and license. This is why you only include DISTINCT once in your SELECT clause — you don't add it for each column.

Using DISTINCT inside aggregate functions

Let's say you want to get a count of the number of unique license plates you've collected. You can use the COUNT(expression) function which returns the number of rows where expression is not NULL and combine it with DISTINCT.

select count(distinct license)
from captures;
count|
-----|
    8|

1 row

You can use DISTINCT inside all the other aggregate functions like SUM, AVG, etc though more commonly you'll just use it with COUNT.

PostgreSQL: DISTINCT ON

PostgreSQL supports some additionally functionality in the form of DISTINCT ON which gives you the ability to keep the first row in a set of rows. To demonstrate why this is useful let's say we're trying to find for each camera, the most recent license plate it captured. We can do this with:

select distinct on (camera_id) camera_id, capture_ts, license
from captures
order by camera_id, capture_ts desc;
camera_id|capture_ts         |license|
---------|-------------------|-------|
        1|2018-12-10 12:20:10|VV59JO |
        2|2018-12-11 16:01:32|1NC1QD |
        3|2018-12-02 18:55:00|CC73IQ |

3 rows

Let's break down how this works. First, let's remove only the DISTINCT ON clause:

select camera_id, capture_ts, license
from captures
order by camera_id, capture_ts desc;
camera_id|capture_ts         |license|
---------|-------------------|-------|
        1|2018-12-10 12:20:10|VV59JO | -- this row is kept
        1|2018-12-04 22:10:15|NR422L |
        1|2018-12-03 12:17:35|374HQR |
        1|2018-12-02 08:22:13|YLE395 |
        1|2018-12-01 04:05:06|YLE395 |
        2|2018-12-11 16:01:32|1NC1QD | -- this row is kept
        2|2018-12-07 07:12:06|AA27IJ |
        2|2018-12-07 03:19:40|YY17HC |
        3|2018-12-02 18:55:00|CC73IQ | -- this row is kept
        3|2018-12-01 17:01:48|1NC1QD |

10 rows

What adding DISTINCT ON (camera_id) does is keep that first encountered row for each unique camera_id as you can see in the results above. And because of the ORDER BY clause, the kept rows contain the most recent license captured for each camera. So ORDER BY must be used and additionally the DISTINCT ON expression must match the leftmost ORDER BY expression (camera_id in our case). Imagine if you didn't order by camera_id first - the output row ordering would be unpredictable and so when DISTINCT ON selects the first row for each camera you'd always get different results.

DISTINCT ON is a powerful tool. Without DISTINCT ON we would have to use either a more complex subquery or window function to achieve the same thing.

As with all my blog posts, you can find the full source code on GitHub here.