Search and destroy duplicate rows in PostgreSQL

I was working on a project where we needed to aggregate information on employees from 10 different tables and make the resulting table clear (no duplicate rows), containing full information on people working in the big company.

While making this I understood that the emergence of duplicates (or duplicate rows) is inevitable when you work with a large amount of data aggregating several tables into one. Fortunately PostgreSQL has some features that are extremely useful while working with detection and elimination of duplicates.

I want to put your attention on these features and help you to never have problems with duplicates.

Duplicate or Duplicate row is a row in a table looking exactly or almost exactly like some another row (original row) in this table.
So we can deal with absolutely identical rows and almost identical rows. For example theirs ids can differ but all other properties are exactly the same.

So, what can you do with the duplicates?
For absolutely identical rows:

Find them
Delete them

For almost identical rows (identical except for one or more properties):

Combine information from duplicate rows into one row
Select one of the rows according to some criteria and delete the remaining ones.

That is what my article is about.

1) How to find duplicates?

Imagine you have a table containing some data on employees of a company. For example, in this table we are dealing with personal data about employees including their first name, last name, position, department and date of the beginning of a contract in these department on these position.

+----+-----------+-----------+------------+---------------+-------------+
| id | firstname | lastname  | startdate  | position      | department  |
+----+-----------+-----------+------------+---------------+-------------+
| 1  | Olivier   | Le Blanc  | 2010-03-01 | PDG           | RTM         |
| 2  | Maria     | Green     | 2016-06-01 | Intern        | STP/RMP     |
| 3  | Maria     | Green     | 2016-11-01 | RH            | STP/RMP     |
| 5  | Maria     | Green     | 2017-07-07 | DRH           | STP/RMP     |
| 4  | Paul      | Jones     | 2017-01-01 | Developer     | RTM/FMP     |
| 6  | Paul      | Jones     | 2017-06-01 | Project Chief | RTM/BSO     |
+----+-----------+-----------+------------+---------------+-------------+

In order to find duplicates we face two problems:

Count the number of rows in each group.
Find duplicate rows and theirs ids

Here is the fastest way to split rows into categories and to display those that have more than one row in it.

SELECT
  firstname,
  lastname,
  count(*)
FROM people
GROUP BY
  firstname,
  lastname
HAVING count(*) > 1;

+-----------+-----------+-------+
| firstname | lastname  | count |
+-----------+-----------+-------+
| Maria     | Green     |   3   |
| Paul      | Jones     |   2   |
+-----------+-----------+-------+

Count(*) counts the number of rows in each group.
In GROUP BY we can add the criterias (properties) by which we are looking for duplicates.
The result is a table (firstname, lastname, count) containing the properties according which the groups were defined and the number of rows per group.

Now we want to display duplicate rows with all information.

SELECT * FROM
  (SELECT *, count(*)
  OVER
    (PARTITION BY
      firstname,
      lastname
    ) AS count
  FROM people) tableWithCount
  WHERE tableWithCount.count > 1;

+----+-----------+----------+--------------+---------------+------------+---------+
| id | firstname | lastname |  startdate   | position      | department |  count  |
+----+-----------+----------+--------------+---------------+------------+---------+
| 2  | Maria     | Green    |  2016-06-01  | Intern        | STP/RMP    |    3    |
| 3  | Maria     | Green    |  2016-11-01  | RH            | STP/RMP    |    3    |
| 5  | Maria     | Green    |  2017-07-07  | DRH           | STP/RMP    |    3    |
| 4  | Paul      | Jones    |  2017-01-01  | Developer     | RTM/FMP    |    2    |
| 6  | Paul      | Jones    |  2017-06-01  | Project Chief | RTM/BSO    |    2    |
+----+-----------+----------+--------------+---------------+------------+---------+

PARTITION BY divides into groups and disposes all rows that are presented one after another.
Using PARTITION BY and ‘count > 1’ we can extract rows having duplicates.
The result is a table with columns (id, firstname, lastname, startdate, position, department, count) where we see all the duplicate rows including the original row.

By the way, through the PARTITION BY it is possible to simplify a whole class of tasks of analytics and billing. Instead of count(*) we can use any function like MEAN, MAX, MIN, SUM… and calculate a value per group. Mean salary is a good example.

2) How to delete duplicates?

The next question that inevitably arises: how to get rid of duplicates?
Here is the most efficient and fastest way to select data without unnecessary duplicates:

For absolutely identical rows:

SELECT DISTINCT * FROM people;

For almost identical rows:

SELECT DISTINCT ON (firstname, lastname) * FROM people

In the case of almost identical rows we need to list all properties on the basis of which we are looking for duplicates.

Thus, if we want to remove duplicated data from a table, we can use the following method :

DELETE FROM people WHERE people.id NOT IN 
(SELECT id FROM (
    SELECT DISTINCT ON (firstname, lastname) *
  FROM people));

For those who have read this article up to this point, here is a very cool tip of PostgreSQL to keep your code clean and readable.

WITH unique AS
    (SELECT DISTINCT ON (firstname, lastname) * FROM people)
DELETE FROM people WHERE people.id NOT IN (SELECT id FROM unique);

A very useful thing for complex queries where without named subqueries you can break your entire brain, conjuring with joins and brackets of subqueries. This incredibly useful feature is called Common Table Expression. By the way, there is a possibility to use multiple subqueries and one subquery can be based on another subquery. You can learn more here.

WITH some_name AS
 (SELECT DISTINCT ON (firstname, lastname) * FROM people),
some_another_name AS (SELECT id, position, department FROM some_name)
SELECT * FROM some_another_name WHERE ... ;

3) How to combine duplicate rows in one single row

Now we come to something more interesting. We want to make sure that each category has only one row but we don’t want to lose any information. The best way to do this is to remove duplicates while merging their records into one row. For example, we want to have only one row per person, but for which both position values and department values are written into one cell in the following way ‘value 1 / value 2 / …‘. This is easily accomplished by using the function of concatenation ‘string_agg’.

SELECT
  firstname,
  lastname,
  string_agg(position, ' / ') AS positions,
  string_agg(department, ' / ') AS departments
FROM people
GROUP BY
  firstname,
  lastname;

+-----------+----------+---------------------------+-----------------------------+
| firstname | lastname | positions                 | departments                 |
+-----------+----------+---------------------------+-----------------------------+
| Maria     | Green    | Intern / RH / DRH         | STP/RMP / STP/RMP / STP/RMP |
| Olivier   | Le Blanc | PDG                       | RTM                         |
| Paul      | Jones    | Developer / Project chief | RTM/FMP / RTM/BSO           |
+-----------+----------+---------------------------+-----------------------------+

GROUP BY separates data on categories.
string_agg() aggregates information from duplicate rows
No matter how many duplicates we have, in the end we’ll have just three rows with combined information.

4) How to delete unwanted duplicates and save exactly what you want

Now let’s imagine that for every employee there are two properties indicating the start date and the end date of a contract.
If some person changed several positions in the company, there are few corresponding lines in the table. For each employee we need to find a row corresponding to the last contract, not taking into account the previous contracts. That is, in fact, find a contract with the latest start date.

You can do this as follows:

SELECT id, firstname, lastname, startdate, position FROM
  (SELECT id, firstname, lastname, startdate, position,
     ROW_NUMBER() OVER 
(PARTITION BY (firstname, lastname) ORDER BY startdate DESC) rn
   FROM people
  ) tmp WHERE rn = 1;

+----+------------+----------+--------------+---------------+
| id | firstname  | lastname |  startdate   | position      |
+----+------------+----------+--------------+---------------+
| 5  | Maria      | Green    |  2017-07-07  | DRH           |
| 1  | Olivier    | Le Blanc |  2010-03-01  | PDG           |
| 6  | Paul       | Jones    |  2017-06-01  | Project Chief |
+----+------------+----------+--------------+---------------+

PARTITION BY divides into groups and ORDER BY sorts them by descending order.
ROW_NUMBER() assigns an integer number to every row in each category.
To have rows with the latest date we simply choose those with row number equals to 1.
Notice that we need to have some name for a query in brасkets. It’s better to use a common table expression WITH … AS