I love projects where I’m learning something new, don’t you? When I get in a project, I want to be learning new things on a regular basis, otherwise I grow restless. Lately, I’ve been on a pretty big project for a client designing a backend system (GraphQL + Node.js + Postgres). I can’t go into too much detail, but we’ve leveraged Postgres as our main data store and in the process we’ve had to level up on our Postgres skills. This post consists of random notes about PG that I didn’t know a few months ago.
In other databases, you have to declare
varchar(255), if you want a variable string column that has a max of 255 characters. You always have to be careful that you don’t exceed this restriction, otherwise the database will alert you in no uncertain terms (aka, it will blow up). With PG, you can forget all that trash, because under the hood, varchar and text are the same. Yeah, I know, crazy.
Tip: There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead. – PG docs
Lesson Learned: Use
TEXT, unless you have a specific need to limit the characters
CREATE TABLE people ( first_name TEXT NOT NULL, last_name TEXT NOT NULL );
The debate rages on, “To use ENUMs, or NOT use ENUMs”. When you use ENUMs, data integrity is enforced, this is a good thing. Think about allowing a free form
TEXT field like colors. You might get values like this in your database.
Grey GRAY gray grey
If your solution is “the app will lower/upper case the value”, then you are essentially making your app the only consumer of the database. What happens when another process starts using your database? I view this as a very limited database design mindset when you only view the database as being the backend to one type of frontend. You should design your database as a database that exists to serve and protect the data it houses. Too many times I see developers treating their database as dumb data stores, and hold much of the logic in the app that could easily reside in the database. When you go down this route, you may be leveraging your strengths (app development), but losing out on very powerful tools that reside in the database engine. It also requires you to implement logic in multiple places rather than just depend on an intelligent data store.
These are functions your database can easily handle without breaking a sweat
- Transforms: lowercase / uppercase values on insert ✅
- Atomic Updates: automatically update a timestamp e.g.
- Constraints: For example, if you only want to allow three addresses for a person ✅
- Defaults: Values, create them in the database and allow them to be overriden by your app ✅
OK, that was a rabbit trail, back to the main subject ENUMs
Let’s look at a couple of notes about ENUM types. First, we’ll start with this ENUM type.
CREATE TYPE phone_type as ENUM ('MOBILE', 'WORK', 'HOME')
You can add to an ENUM type, but not remove. Add another enum value with
ALTER TYPE item_status ADD VALUE IF NOT EXISTS 'OTHER'. Just know that
OTHERis now permenantly in that type. It cann’t be easily removed.
Altering an ENUM type must be done outside of a transaction. Nuff said.
Get a list of values in an ENUM as rows in a query
Use ENUMs to order by. Since ENUMs are really just integers under the hood, you and order by them. This means the order they are declared is importent (from least to greatest). So now you can order by ENUM like
SELECT * FROM phones ORDER BY phone_type
SELECT * FROM movies WHERE rating <= PG-13
This is part of learning what you really want from your database. We had a many to many table, let’s call it
people_addresses for example. We wanted to know the first entry and then set the
is_default = true for that record. Sounds simple right?
The first attempt solving this solution was to have the application do a check for any record for a peron’s ID in the
people_addressestable, if no record was found, then we set
is_default = truewhen inserting that relation. It worked, but in reality, was subject to race conditions from other applications. It would work fine enough for low volume sites.
One of our talented engineers decided to use a table trigger and automagically set the
is_defaulttag. So this was the trigger that seemed to work during testing
CREATE OR REPLACE FUNCTION people_address_is_default() RETURNS trigger AS $$ DECLARE is_default boolean; BEGIN is_default := (SELECT NOT EXISTS(SELECT person FROM people_addresses WHERE person = NEW.person)); NEW.is_default = is_default; RETURN NEW; END; $$ language 'plpgsql';
You’ll notice that the table trigger will check if another record exists and automatically set the
is_defaultvalue. Pretty slick. The only problem we saw was another race condition (see a pattern). When the client would call the same endpoint with different values. Each insert ran in it’s own transaction and therefor the
is_defaultwas set to true.
Stepping back and looking at the problem, we really didn’t care about an
is_default, we mostly cared about about detecting the first insert. This can be done with a timestamp. So instead we decided to use a
created_atfield with a default value of
now(). The only drawback now is we have to derive the
is_defaultvalue by looking at other records and determining if this is the oldest timestamp.
Postgres timestamp precision? During the discussions about using a timestamp for determining the first row inserted, we wondered, “What precision does PG use for storing timestamps? Milliseconds? Nanoseconds?”. The correct answer is a microsecond. Just for reference, 1 millisecond (ms) = 1000 microseconds (μs). We should be pretty safe using microseconds to determine which record was inserted first.
From their own documentation on Date/Time Types
Note: When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values.
UNION ALL to be a powerful way to combine multiple tables together that had similar types of content into a single view. For argument sake, we didn’t go the route of complete normalization because of some difficultly setting up the data. So we would create a view that looked like this (a poor example, but simple none the less).
CREATE VIEW people ( id first_name last_name date_of_birth ) AS SELECT id, first_name, last_name, date_of_birth FROM adults UNION All SELECT id, first_name, last_name, date_of_birth FROM children UNION All SELECT id, first_name, last_name, date_of_birth FROM teachers
This worked well to have a unified table for the app to connect and get the data it needed, but has some drawbacks.
- This view is not going to be performant since we just pull several tables together. It would be hard to maintain a good index on this type of structure.
- When this view gets large, it’s a pain in the butt to added/remote fields since you need to recreate the entire view again. It’s error prone, and easy to accidentally remove a field.
- It’s a short cut to good database design (something that can suffer when moving quickly)
This pattern works well as you are discovering your data requirements, but should moved to more solid structures as your understanding of how data interacts developes.
I’m going to breifly touch on
IDs, but this subject deserves it’s own blog post (look for it soon).
- Most concrete table types should have an
idfield. It’s just easier to lookup, select, join, and delete.
- NEVER use auto-increment
- Did you read that correctly? NEVER use auto-increment. It works fine when you have a small number of tables, and you’re the only one developing on a database. But as soon as you need to create pre-existing seed data; dump and import; or create unique ids across shards; auto-increment will become the bane of your existence. I’m a big fan of UUIDs, a big fan.
- Naming your
IDs. I think it’s too meta to name your id’s like
person_id. I worked on project once where all the tables had fields named
account.account_created_at. Incorporating the table name in the field is super lame.
That’s all for now. Look for new blog posts soon that will dive into performance and that promised field naming post about IDs.