Postgres Database Design Notes

I love projects where I’m learning something new, don’t you? When I get in a project, I want to be learning new things on a regular basis, otherwise I grow restless. Lately, I’ve been on a pretty big project for a client designing a backend system (GraphQL + Node.js + Postgres). I can’t go into too much detail, but we’ve leveraged Postgres as our main data store and in the process we’ve had to level up on our Postgres skills. This post consists of random notes about PG that I didn’t know a few months ago.


String types (varchar versus text)

In other databases, you have to declare varchar(255), if you want a variable string column that has a max of 255 characters. You always have to be careful that you don’t exceed this restriction, otherwise the database will alert you in no uncertain terms (aka, it will blow up). With PG, you can forget all that trash, because under the hood, varchar and text are the same. Yeah, I know crazy.

Tip: There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead. – PG docs

Lesson Learned: Use TEXT, unless you have a specific need to limit the characters

CREATE TABLE people (
    first_name        TEXT NOT NULL,
    last_name         TEXT NOT NULL
);

ENUMs

The debate rages on, “To use ENUMs, or NOT to use ENUMs”. When you use ENUMs, data integrity is enforced, this is a good thing. Think about allowing a free form TEXT field like colors. You might get values like this in your database.

Grey
GRAY
gray
grey

If your solution is “the app will lower/upper case the value”, then you are essentially making your app the only consumer of the database. What happens when another process starts using your database? I view this as a very limited database design mindset when you only view the database as being the backend to one type of frontend. You should design your database as a database that exists to serve and protect the data it houses. Too many times I see developers treating their database as dumb data stores, and hold much of the logic in the app that could easily reside in the database. When you go down this route, you may be leveraging your strengths (app development), but losing out for very powerful tools that reside in the database engine. It also requires you to implement logic in multiple places rather than just depend on an intelligent data store.

These are functions your database can easily handle without breaking a sweat

  • Transforms: lowercase / uppercase values on insert.
  • Atomic Updates: automatically update a timestamp e.g. updated_at.
  • Constraints: For example, if you only want to allow three addresses for a person.
  • Defaults: Values, create them in the database and allow them to be overriden by your app.

OK, that was a rabbit trail, back to the main subject ENUMs

Let’s look at a couple of notes about ENUM types. First, we’ll start with this ENUM type.

CREATE TYPE phone_type  as ENUM ('MOBILE', 'WORK', 'HOME')
  1. You can add to an ENUM type, but not remove. Add another enum value with ALTER TYPE item_status ADD VALUE IF NOT EXISTS 'OTHER'. Just know that OTHER is now permenantly in that type. It cann’t be easily removed.

  2. Altering an ENUM type must be done outside of a transaction. Nuff said.

  3. Get a list of values in an ENUM as rows in a query

    SELECT unnest(enum_range(NULL::phone_type));
  4. Use ENUMs to order by. Since ENUMs are really just integers under the hood, you and order by them. This means the order they are declared is importent (from least to greatest). So now you can order by ENUM like

    SELECT * FROM phones ORDER BY phone_type

    or

    SELECT * FROM movies WHERE rating <= PG-13

Setting an is_default field

It's a trap

This is part of learning what you really want from your database. We had a many to many table, let’s call it people_addresses for example. We wanted to know the first entry and then set the is_default = true for that record. Sounds simple right?

  1. The first attempt solving this solution was to have the application do a check for any record for a peron’s ID in the people_addresses table, if no record was found, then we set is_default = true when inserting that relation. It worked, but in reality, was subject to race conditions from other applications. It would work fine enough for low volume sites.

  2. One of our talented engineers decided to use a table trigger and automagically set the is_default tag. So this was the trigger that seemed to work during testing

        CREATE OR REPLACE FUNCTION people_address_is_default()
        RETURNS trigger AS $$
        DECLARE
          is_default boolean;
          BEGIN
            is_default := (SELECT NOT EXISTS(SELECT person FROM people_addresses WHERE person = NEW.person));
            NEW.is_default = is_default;
            RETURN NEW;
          END;
        $$ language 'plpgsql';

    You’ll notice that the table trigger will check if another record exists and automatically set the is_default value. Pretty slick. The only problem we saw was another race condition (see a pattern). When the client would call the same endpoint with different values. Each insert ran in it’s own transaction and therefor the is_default was set to true.

  3. Stepping back and looking at the problem, we really didn’t care about an is_default, we mostly cared about about detecting the first insert. This can be done with a timestamp. So instead we decided to use a created_at field with a default value of now(). The only drawback now is we have to derive the is_default value by looking at other records and determining if this is the oldest timestamp.

    Postgres timestamp precision? During the discussions about using a timestamp for determining the first row inserted, we wondered, “What precision does PG use for storing timestamps? Milliseconds? Nanoseconds?”. The correct answer is a microsecond. Just for reference, 1 millisecond (ms) = 1000 microseconds (μs). We should be pretty safe using microseconds to determine which record was inserted first.

    From their own documentation on Date/Time Types

    Note: When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values.