Four PostgreSQL Tips21 Feb 2016
PostgreSQL is an open source object-relational database used to power many production web applications. While many web applications interact with relational databases through Object Relational Mappers (ORM), direct SQL queries via a command line interface or graphical client are still common. When writing these queries, these four tips may come in handy.
All examples will assume the presence of a simple users table:
Column | Type ---------------+----------------------------- id | integer first_name | character varying(255) last_name | character varying(255) email | character varying(255) created_at | timestamp without time zone updated_at | timestamp without time zone password | character varying registered | boolean registered_at | timestamp without time zone Indexes: "users_pkey" PRIMARY KEY, btree (id)
1. Finding Duplicate Rows
A common mechanism for defending against duplicate rows in database tables are unique indexes. However, at the time of table creation, a unique index or two may have been forgotten. Duplicates in a table must be removed before a unique index may be added.
A great way to detect duplicates in PostgreSQL is by using window functions. Window functions are similar to aggregates; but, instead of grouping rows for the response, it maintains each row’s individuality.
Desired query: Find all duplicate
users with the same
ids only (do not return the oldest
SELECT id from ( SELECT id, ROW_NUMBER() OVER( PARTITION BY first_name, last_name, email ORDER BY id ) AS user_row_number FROM users ) duplicates WHERE duplicates.user_row_number > 1
This query will identify all the rows of the
users table which share the same defined columns and return the primary key (
id) of rows after the first via the
duplicates.user_row_number > 1 condition. The result of this query can then be fed into a
DELETE query to remove the duplicates.
ROW_NUMBER() built-in function returns an incremented value assigned to each row that matches the criteria in the provided
OVER clause. This
OVER clause is followed by the window function:
PARTITION, which divides matching rows into groups or partitions using the columns or functions which follow it. Learn more about window functions at the official PostgreSQL documentation page.
Ever had to scope the results of a database query to a certain time range? So has every developer ever. Luckily, PostgreSQL provides a very readable convenience function:
If a rolling count of
user sign ups in the last 24 hours is desired, the following query makes it possible:
SELECT * FROM users WHERE created_at >= now() - interval '1 day'
interval function accepts a variety of input types. Days, Hours, Minutes, Weeks, and more options are available.
interval subtraction can be used on any
SELECT * FROM users WHERE created_at >= '2016-02-19 12:00:00'::timestamp - interval '3 weeks'
:: notation casts the string
2016-02-19 12:00:00 to a
timestamp, a requirement when manipulating time ranges with
3. Remote Query Execution
A very powerful feature of PostreSQL, and more specifically the
psql command line interface, is the ability to run remote queries. With the
-c flag, a query may be passed to any PostgreSQL server and executed.
A unique use for this feature is the ability to transfer a query’s results directly from one environment to another.
An extensive analysis of the
users table for all
users that signed up in January is requested. This analysis has the potential to be very taxing on a production database, resulting in adverse effects on end users. Additionally, no follower or read only database exists to alleviate this problem. Production data must be transfered to a local environment so that the analysis can run side effect free.
To accomplish this, the
psql client has the ability to execute a query and return its result to
stdout. Conversely, file contents or query results from
stdin may also be read within the
Utilizing both aspects of this behaviour could look something like:
psql <REMOTE POSTGRESQL URL> \ -c "copy ( SELECT * FROM users WHERE date_trunc('month', created_at) = '2016-01-01' ) to stdout" \ | psql <LOCAL POSTGRESQL DATABASE> \ -c "copy users from stdin"
As an alternative to
pg_dump, this pattern is a quick solution for loading specific data from one database to another.
copy command comes in two variants:
copy, which will be run by the PostgreSQL backend (user “postgres”) and
\copy which runs as the current user. Depending on permissions and what
copy is attempting to do, one may be more appropriate than the other.
4. Referencing Columns by
If one were to issue a
SELECT query with a
GROUP BY or
ORDER BY clause, the columns in these clauses can be referenced by their position in the
Desired query: Retrieve a count of all users which share the same
first_name and were
created_at on the same hour of the same day then order by the hour of creation and then
Without numeric referencing
SELECT first_name, date_trunc('hour', created_at), count(*) FROM users GROUP BY first_name, date_trunc('hour', created_at) ORDER BY date_trunc('hour', created_at), first_name;
This query will satisfy the request, returning an ordered list
users data grouped by their
first_name and hour of the
With numeric referencing
Starting with 1, each value in the
SELECT clause can be referenced by its numerical position. This can help reduce typos and copy pasting issues:
SELECT first_name, date_trunc('hour', created_at), count(*) FROM users GROUP BY 1, 2 ORDER BY 2, 1;
As queries become more complicated with references or additional aggregate functions, numerical referencing becomes even more valuable.
Use with Caution
While handy, referencing columns by their position is not always appropriate. Adding columns to the
SELECT statement can invalidate previous referencing and cause unwanted behaviour. Additionally, with a significantly large number of columns in a
SELECT statement, readability can suffer when referring to them solely by numbers.
As with anything, the use of numerical referencing should be deliberate and thoughtful.