Four PostgreSQL Tips
21 Feb 2016PostgreSQL is an open source object-relational database used to power many production web applications. While many web applications interact with relational databases through Object Relational Mappers (ORM), direct SQL queries via a command line interface or graphical client are still common. When writing these queries, these four tips may come in handy.
All examples will assume the presence of a simple users table:
Column | Type
---------------+-----------------------------
id | integer
first_name | character varying(255)
last_name | character varying(255)
email | character varying(255)
created_at | timestamp without time zone
updated_at | timestamp without time zone
password | character varying
registered | boolean
registered_at | timestamp without time zone
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
1. Finding Duplicate Rows
A common mechanism for defending against duplicate rows in database tables are unique indexes. However, at the time of table creation, a unique index or two may have been forgotten. Duplicates in a table must be removed before a unique index may be added.
A great way to detect duplicates in PostgreSQL is by using window functions. Window functions are similar to aggregates; but, instead of grouping rows for the response, it maintains each row’s individuality.
Desired query: Find all duplicate users
with the same first_name
, last_name
, and email
, returning duplicate ids
only (do not return the oldest id
).
SELECT id from (
SELECT id,
ROW_NUMBER() OVER(
PARTITION BY first_name,
last_name,
email
ORDER BY id
) AS user_row_number
FROM users
) duplicates
WHERE
duplicates.user_row_number > 1
This query will identify all the rows of the users
table which share the same defined columns and return the primary key (id
) of rows after the first via the duplicates.user_row_number > 1
condition. The result of this query can then be fed into a DELETE
query to remove the duplicates.
The ROW_NUMBER()
built-in function returns an incremented value assigned to each row that matches the criteria in the provided OVER
clause. This OVER
clause is followed by the window function: PARTITION
, which divides matching rows into groups or partitions using the columns or functions which follow it. Learn more about window functions at the official PostgreSQL documentation page.
2. interval
Ever had to scope the results of a database query to a certain time range? So has every developer ever. Luckily, PostgreSQL provides a very readable convenience function: interval
.
If a rolling count of user
sign ups in the last 24 hours is desired, the following query makes it possible:
SELECT *
FROM users
WHERE created_at >= now() - interval '1 day'
The interval
function accepts a variety of input types. Days, Hours, Minutes, Weeks, and more options are available.
interval
subtraction can be used on any timestamp
:
SELECT *
FROM users
WHERE created_at >=
'2016-02-19 12:00:00'::timestamp - interval '3 weeks'
The ::
notation casts the string 2016-02-19 12:00:00
to a timestamp
, a requirement when manipulating time ranges with interval
.
3. Remote Query Execution
A very powerful feature of PostreSQL, and more specifically the psql
command line interface, is the ability to run remote queries. With the --command
or -c
flag, a query may be passed to any PostgreSQL server and executed.
A unique use for this feature is the ability to transfer a query’s results directly from one environment to another.
An extensive analysis of the users
table for all users
that signed up in January is requested. This analysis has the potential to be very taxing on a production database, resulting in adverse effects on end users. Additionally, no follower or read only database exists to alleviate this problem. Production data must be transfered to a local environment so that the analysis can run side effect free.
To accomplish this, the psql
client has the ability to execute a query and return its result to stdout
. Conversely, file contents or query results from stdin
may also be read within the psql
client.
Utilizing both aspects of this behaviour could look something like:
psql <REMOTE POSTGRESQL URL> \
-c "copy (
SELECT *
FROM users
WHERE date_trunc('month', created_at) = '2016-01-01'
) to stdout" \
| psql <LOCAL POSTGRESQL DATABASE> \
-c "copy users from stdin"
As an alternative to pg_dump
, this pattern is a quick solution for loading specific data from one database to another.
The copy
command comes in two variants: copy
, which will be run by the PostgreSQL backend (user “postgres”) and \copy
which runs as the current user. Depending on permissions and what copy
is attempting to do, one may be more appropriate than the other.
4. Referencing Columns by SELECT
Position
If one were to issue a SELECT
query with a GROUP BY
or ORDER BY
clause, the columns in these clauses can be referenced by their position in the SELECT
section.
Desired query: Retrieve a count of all users which share the same first_name
and were created_at
on the same hour of the same day then order by the hour of creation and then first_name
.
Without numeric referencing
SELECT first_name, date_trunc('hour', created_at), count(*)
FROM users
GROUP BY first_name, date_trunc('hour', created_at)
ORDER BY date_trunc('hour', created_at), first_name;
This query will satisfy the request, returning an ordered list users
data grouped by their first_name
and hour of the created_at
.
With numeric referencing
Starting with 1, each value in the SELECT
clause can be referenced by its numerical position. This can help reduce typos and copy pasting issues:
SELECT first_name, date_trunc('hour', created_at), count(*)
FROM users
GROUP BY 1, 2
ORDER BY 2, 1;
As queries become more complicated with references or additional aggregate functions, numerical referencing becomes even more valuable.
Use with Caution
While handy, referencing columns by their position is not always appropriate. Adding columns to the SELECT
statement can invalidate previous referencing and cause unwanted behaviour. Additionally, with a significantly large number of columns in a SELECT
statement, readability can suffer when referring to them solely by numbers.
As with anything, the use of numerical referencing should be deliberate and thoughtful.