Four PostgreSQL Tips

21 Feb 2016

PostgreSQL is an open source object-relational database used to power many production web applications. While many web applications interact with relational databases through Object Relational Mappers (ORM), direct SQL queries via a command line interface or graphical client are still common. When writing these queries, these four tips may come in handy.

All examples will assume the presence of a simple users table:

    Column     |            Type
---------------+-----------------------------
 id            | integer
 first_name    | character varying(255)
 last_name     | character varying(255)
 email         | character varying(255)
 created_at    | timestamp without time zone
 updated_at    | timestamp without time zone
 password      | character varying
 registered    | boolean
 registered_at | timestamp without time zone
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)

1. Finding Duplicate Rows

A common mechanism for defending against duplicate rows in database tables are unique indexes. However, at the time of table creation, a unique index or two may have been forgotten. Duplicates in a table must be removed before a unique index may be added.

A great way to detect duplicates in PostgreSQL is by using window functions. Window functions are similar to aggregates; but, instead of grouping rows for the response, it maintains each row’s individuality.

Desired query: Find all duplicate users with the same first_name, last_name, and email, returning duplicate ids only (do not return the oldest id).

SELECT id from (
  SELECT id,
  ROW_NUMBER() OVER(
    PARTITION BY first_name,
                 last_name,
                 email
              ORDER BY id
    ) AS user_row_number
  FROM users
) duplicates
WHERE
duplicates.user_row_number > 1

This query will identify all the rows of the users table which share the same defined columns and return the primary key (id) of rows after the first via the duplicates.user_row_number > 1 condition. The result of this query can then be fed into a DELETE query to remove the duplicates.

The ROW_NUMBER() built-in function returns an incremented value assigned to each row that matches the criteria in the provided OVER clause. This OVER clause is followed by the window function: PARTITION, which divides matching rows into groups or partitions using the columns or functions which follow it. Learn more about window functions at the official PostgreSQL documentation page.

2. `interval`

Ever had to scope the results of a database query to a certain time range? So has every developer ever. Luckily, PostgreSQL provides a very readable convenience function: interval.

If a rolling count of user sign ups in the last 24 hours is desired, the following query makes it possible:

SELECT *
FROM users
WHERE created_at >= now() - interval '1 day'

The interval function accepts a variety of input types. Days, Hours, Minutes, Weeks, and more options are available.

interval subtraction can be used on any timestamp:

SELECT *
FROM users
WHERE created_at >=
  '2016-02-19 12:00:00'::timestamp - interval '3 weeks'

The :: notation casts the string 2016-02-19 12:00:00 to a timestamp, a requirement when manipulating time ranges with interval.

3. Remote Query Execution

A very powerful feature of PostreSQL, and more specifically the psql command line interface, is the ability to run remote queries. With the --command or -c flag, a query may be passed to any PostgreSQL server and executed.

A unique use for this feature is the ability to transfer a query’s results directly from one environment to another.

An extensive analysis of the users table for all users that signed up in January is requested. This analysis has the potential to be very taxing on a production database, resulting in adverse effects on end users. Additionally, no follower or read only database exists to alleviate this problem. Production data must be transfered to a local environment so that the analysis can run side effect free.

To accomplish this, the psql client has the ability to execute a query and return its result to stdout. Conversely, file contents or query results from stdin may also be read within the psql client.

Utilizing both aspects of this behaviour could look something like:

psql <REMOTE POSTGRESQL URL> \
-c "copy (
      SELECT *
      FROM users
      WHERE date_trunc('month', created_at) = '2016-01-01'
    ) to stdout" \
| psql <LOCAL POSTGRESQL DATABASE> \
-c "copy users from stdin"

As an alternative to pg_dump, this pattern is a quick solution for loading specific data from one database to another.

The copy command comes in two variants: copy, which will be run by the PostgreSQL backend (user “postgres”) and \copy which runs as the current user. Depending on permissions and what copy is attempting to do, one may be more appropriate than the other.

4. Referencing Columns by `SELECT` Position

If one were to issue a SELECT query with a GROUP BY or ORDER BY clause, the columns in these clauses can be referenced by their position in the SELECT section.

Desired query: Retrieve a count of all users which share the same first_name and were created_at on the same hour of the same day then order by the hour of creation and then first_name.

Without numeric referencing

SELECT first_name, date_trunc('hour', created_at), count(*)
FROM users
GROUP BY first_name, date_trunc('hour', created_at)
ORDER BY date_trunc('hour', created_at), first_name;

This query will satisfy the request, returning an ordered list users data grouped by their first_name and hour of the created_at.

With numeric referencing

Starting with 1, each value in the SELECT clause can be referenced by its numerical position. This can help reduce typos and copy pasting issues:

SELECT first_name, date_trunc('hour', created_at), count(*)
FROM users
GROUP BY 1, 2
ORDER BY 2, 1;

As queries become more complicated with references or additional aggregate functions, numerical referencing becomes even more valuable.

Use with Caution

While handy, referencing columns by their position is not always appropriate. Adding columns to the SELECT statement can invalidate previous referencing and cause unwanted behaviour. Additionally, with a significantly large number of columns in a SELECT statement, readability can suffer when referring to them solely by numbers.

As with anything, the use of numerical referencing should be deliberate and thoughtful.

A Year of Commits