Smokes your problems, coughs fresh air.

Finding the right UUID generation algorithm for

Early during the development of the PostgreSQL backend for—that is our amazing managed MQTT hosting service!—, I decided on using UUIDs rather than auto-incrementing integers wherever I needed or wanted surrogate keys. I am one of those people who prefers the use of natural keys where their use is … natural, but I certainly have nothing _against_ surrogate keys, only to their overuse, which usually results from an over-reliance on ORMs (which I do have something against). There are a couple of advantages to using UUIDs over auto-incrementing integers (available in PostgreSQL via sequences):

  1. It doesn’t matter in which application layer the UUID is generated, as long as the resulting UUID is compatible with RFC 4122.
  2. Using UUIDs with a random component avoids some types of lock contention issues where multiple SQL sessions simultaneously want to get or insert the next integer in sequence.
  3. When UUIDs are random enough, they can be used as an extra layer of security (through obscurity) or one-off security tokens (such as for account validation).

PostgreSQL has a native UUID type, allowing me to store UUIDs in efficient binary form, while making it very easy to go back and forth to a string representation. Postgres’ native UUID type has a canonical text representation in which the the UUID’s 128 bits are represented as 32 hexadecimal digits punctuated by dashes. Here’s an example:

flashmq=# select gen_random_uuid()::text;
(1 row)

When casting a text UUID, PostgreSQL is quite forgiving about the upper versus lower case and the placement (or complete absence) of the hyphens, and even allows curly braces to surround the UUID. (It is important to remember that types in PostgreSQL do have a canonical text representation, though not for the purpose of this present article.)

The UUID format defined in RFC 4122 is the so-called OSF DCE (or “Leach–Salz”) variant. Within this variant, there are a number of subtypes (henceforth called “versions”, to be consistent with the backward-compatible language in the RFC).

  1. UUID version 1 combines a device’s 48-bit MAC address combined with a 60-bit timestamp.
  2. UUID version 2 is a bit of rarely-used permutation of UUIDv1.
  3. UUID version 3 is an MD5 hash of a namespace identifier and a name within that namespace. This being a hashing algorithm means that the same namespace + name combination should always yield the same UUID. The namespace identifier is itself a UUID, and a list of predefined namespace UUIDs are included in Appendix C of RFC 4122.
  4. UUID version 4 is almost fully random, except for the first 6 to 7 bits which indicate it being a version 4 UUID and which variant thereof.
  5. UUID version 5 is like UUIDv3 except that the hashing algorithm used is SHA-1 instead of MD5.

Natively, PostgreSQL (as of version 15) offers one UUID generation function:

  • gen_random_uuid() can be used to generate version 4 UUIDs.

In addition, the Postgres standard distribution is equipped with the uuid-ossp contrib module, which sports a couple more UUID generation functions:

  • uuid_generate_v1() generates—you guessed it—a version 1 UUID, built from a timestamp and the computer’s MAC address.
  • uuid_generate_v1mc() also generates a UUIDv1, but with a random multicast MAC address.
  • uuid_generate_v3(namespace uuid, name text) “[g]enerates a version 3 UUID in the given namespace using the specified input name.” The hashing algorithm used is MD5.
  • uuid_generate_v4() seems to me to behave the same as the native gen_random_uuid().
  • uuid_generate_v5(namespace uuid, name text) is like uuid_generate_v3(), but does the hashing using SHA-1 instead of MD5, to produce a UUIDv5 instead of a UUIDv3.

From the beginning, I had “simply” opted for version 4 UUIDs, generated using gen_random_uuid() in each surrogate key column’s default expression. In so doing, I also stumbled upon another advantage of UUIDs over good old integer-based surrogate keys:

  1. UUIDs are much easier to create and relate predictable seed or test data in various types of SQL scripts. You can simply generate them and copy-paste them to your script to override the surrogate key column’s default. And then reuse them in foreign keys, etc.

Soon after deciding on using UUIDs as surrogate keys, I learned about a potential disadvantage too: random (version 4) UUIDs—the ones that I was using—cause index fragmentation; subsequently generated values (and thus subsequently inserted rows) may or may not—it is random, after all!—need to be lexically sorted before or after the previously generated UUID.

This issue cannot be simply solved by switching to UUIDv1 (or the uuid_generate_v1mc() variant of the UUIDv1 generation algorithm, from the aforementioned uuid-ossp contrib module). The reason not is that the UUIDv1 timestamp is encoded in such a way that the version 1 UUIDs cannot be sorted sensibly by there byte content. Of course, PostgreSQL being PostgreSQL, one could index on an expression using a function (like uuid_v1_timestamp() from pg-uuid-ext) that extracts the timestamp portion of the v. 1 UUID in such a way that it is sortable. But, so I learned: there are alternatives to UUID version 1 through 5—a couple too many even.

The blog post which introduced me to the UUID sorting problem was written by Tomas Vondra, who also created a Postgres extension with two sequential UUID generation functions: one which relies on sequences, which I wanted to avoid, and another one that is partially time-based: uuid_time_nextval().

The choice would have been easier if Tomas’ sequential-uuids extension had been the only game in town. Instead, there is, for example, also the semi-formal, cummunity-driven ulid spec, with “ulid” standing for “Universally Unique Lexicographically Sortable Identifier”. (I hate that they write their acronym in lowercase in their branding, by the way, which is why, going forward, I will stubbornly yell ULID in all uppercase. 😤) Like Tomas Vondra’s sequential UUIDs:

  • ULIDs are binary compatible with UUIDs; and
  • ULIDs start with a UNIX epoch-based timestamp encoded as a 48-bit integer, sortable both on the byte level and in their text representation.

There are a couple of differences too:

  • The canonical text representation of ULIDs never contains punctuation characters, which supposedly makes them more “URL safe”. (Though I fail to see how hyphens would make a UUID URL unsafe; perhaps the curly braces could, somehow, in some circumstances. 🤷)
  • The use of Crockford’s base32 instead of base16 hexadecimal digits requires only 26 characters for the canonical text representation of a ULID rather than the 36 characters required for UUIDs.

The ULID spec contains a list of ULID implementations in a variety of languages, and the project provides a reference JavaScript implementation. The list doesn’t point to a PostgreSQL implementation, but there are, in fact a number:

  • Scoville’s pgsql-ulid project contains functions to convert ULIDs to UUIDs and visa versa—ulid_to_uuid() and uuid_to_ulid(), respectively—plus a function—parse_ulid() to convert a canonical ULID to a bytea.

    Function signature Return type
    ulid_to_uuid(text) uuid
    uuid_to_ulid(uuid) text
    parse_ulid(text) bytea
  • Geckoboard’s pgulid project provides a generate_ulid() function that returns a text ULID.

    Function signature Return type
    generate_ulid() text
  • Vadosware’s pg_idkit Postgres extension offers a ULID generation (idkit_ulid_generate()) function among a whole bunch of other UUID-ish generator functions.

    Function signature Return type
    idkit_ulid_generate() ?

    I find pg_idkit‘s dependency on Rust a bit much, personally, if I’m just going to use one function, for which I really don’t see the advantage of introducing a complete language runtime + development tooling to compile what could have just been a plain C function.

  • I adore Rust, but dislike extraneous dependencies. For that same reason, I’m reluctant to use spa5k’s uids-postgres extension, which provides 3 ULID generation functions:

    Function signature Return type
    generate_ulid() text
    generate_ulid_bytes() bytea
    generate_ulid_from_string() text

Some of the aforementioned extension contain support for more sequential UUID types than just ULIDs. Some (like spa5k’s uids-postgres) also support KSUIDs, for example. Then, there a couple more extensions, that don’t support ULIDs, but do support, other sequential UUIDs; Modular Finance’s pg-xid, for example, specifically supports another contending sequential UUID spec: XID.

😵 🥴

Overwhelmed by the choices, I let the matter rest for some months, until I encountered the pg_uuidv7 extension, through which I learned about 3 proposed successors to UUID v. 1 through 5:

  1. UUID version 6 very much resembles UUIDv1, but with the timestamp bytes encoded in sortable order. The time stamp is still based off of a Gregorian timestamp source.
  2. UUID version 7 simplifies UUIDv1 by using a UNIX timestamp instead of Gregorian-based timestamp. The least significant 62 bits may be used either for sub-second precision, some sort of sequence counter or filled with pseudo-random data. 👑
  3. UUID version 8 seems to me to be an extra flexible variant, with freedom of which type of timestamp to use, freedom of the precision of that timestamp, and freedom of how to fill the rest of available (less significant) bits after the timestamp.

Knowing that a likely successors to UUIDv1 through v5 are on the horizon, the choice for UUIDv7 seems to me to be the most obvious choice. And pg_uuidv7 seems to me the most straight-forward implementation thereof, because:

  • pg_uuidv7 only supports UUIDv7, which is the only UUID format I need for
  • pg_uuidv7 is implemented in C and thus introduces no additional dependencies.
  • The naming of uuid_generate_v7() is consistent with uuid_generate_v1() through uuid_generate_v5 from the standard uuid-ossp contrib module, which will make it easy for me to swap to that contrib module if/when that function is added there. (Should I do a PR myself based on the uuid_generate_v7() implementation by @fboulnois?)

Before I go on to the full comparison table below, let me highlight a pure PL/pgSQL implementation of UUIDv6 and UUIDv7 as well: this gist by @fabiolimace.

Overview of sequential UUID generation projects for PostgreSQL
Project Ext. Lang. Sequential UUID generation algorithm

UUIDv7 Other
sequential-uuids C 🦆
uids-postgres Rust
pgulid PL/pgSQL
pg-xid PL/pgSQL
pg_idkit Rust 🦆
pg_uuidv7 C’s gist PL/pgSQL

In case you’re wondering, the enemy (Microsoft SQL Server) ships with the non-standard NEWSEQUENTIALID() function.

Semantic version parsing and comparison within PostgreSQL: pg_text_semver

I needed to compare semantic versions within Postgres. I tried the existing semver extension, but when I found that it was a bit problematic to adjust to be fully compliant with the semantic versioning spec, I decided to write a simple domain-based alternative myself: pg_text_semver.

(Version parsing was part of my effort for more robust pg_restore of FlashMQ extension data. I might write more about that in the future.)

pg_extra_time: some data-time functions, operators and casts that I’m surprise are not already part of Postgres

My newest PostgreSQL extension—pg_extra_time—contains the date-time functions, operators and casts that I was surprised to not find in the standard library. Maybe one day…

Again, this extension is a shoot-off from the PostgreSQL work I’m doing for the backend of the high-performant MQTT hosting service, that is due to launch later this quarter.

Like most of the PostgreSQL extensions, pg_extra_time is also available via PGXN—The PostgreSQL Extension Network:

pg_html5_email_address: Postgres email validation in sync. with the HTML5 spec.

Yesterday, I released a tiny PostgreSQL extension that I use to validate email address in the backend of pg_html5_email_address validates email addresses in the same way that modern web browsers do for <input type="email"> elements.

The advantage of having email validation in your data backend to be in sync with the HTML5 spec is twofold:

  1. The HTML5 spec has a more reasonable definition of what constitutes a valid email address than RFC 5322.
  2. It’s convenient to have you HTML5 frontend and PostgreSQL backend follow the same validation rules. Since the backend of our MQTT hosting service is based on PostgREST and the frontend on Vue.js, we find it important that both PostgreSQL and HTML5 + JS have the same ideas about validation.

pg_html5_email_address is available on PGXN:

pg_safer_settings: dealing with PostgreSQL settings more … safely

Yet another PostgreSQL extension that emerged from the FlashMQ PostgreSQL backend is: pg_safer_settings. pg_safer_settings offers some functions and a mechanisms to make working with settings in PostgreSQL a bit safer. Its defining features are:

  1. easy access to database and role-level settings, bypassing session-local and transaction-local settings; and
  2. a type-safe configuration table pattern that allows for IMMUTABLE access to configuration values, and fine-grained access control.

pg_role_fkey_trigger_functions: establishing and maintaining foreign key-like relationships to database ROLEs

Another release from the long list of PostgreSQL extensions that I’m splitting off from the database schemas that make up our rock-solid MQTT hosting platform issss: pg_role_fkey_trigger_functions. That extension name is quite the mouthful. From its readme (again courtesy of pg_readme):

The pg_role_fkey_trigger_functions PostgreSQL extension offers a bunch of trigger functions to help establish and/or maintain referential integrity for columns that reference PostgreSQL ROLE NAMEs.

Many extensions are yet to come, because I’ve found that dependencies in themselves aren’t bad, as long as they are used side by side, and not needlessly stacked on top of each other. That they are no longer 100% internal to the migrations belonging to the app doesn’t worry my, because:

  1. I still maintain them myself anyway, and
  2. the migrations can (and will) use specific extension versions, so that, in fact, no control and cohesion is lost, even while some decoupling and modularity is being won.

pg_rowalesce: COALESCE(), but for rows

Another part of my spin-off PostgreSQL extension series is pg_rowalesce. From its (generated with pg_readme):

The pg_rowalesce PostgreSQL extensions its defining feature is the rowalesce() function. rowalesce() is like coalesce(), but for rows and other composite types. From its arbitrary number of argument rows, for each field/column, rowalesce() takes the value from the first row for which that particular field/column has a not null value.

pg_mockable: inlineable mocking of PostgreSQL functions, like now(), or then()

pg_mockable is one of the spin-off PostgreSQL extensions that I’ve created for the backend of our high-performance MQTT hosting platform: FlashMQ. It’s basic premise is that you (through one of the helper function that the extension provides) create a mockable wrapper function around a function that you want to mock. That wrapper function will be a one-statement SQL function, so that in many cases it will be inlineable.

Then, when during testing you need to mock (for instance now() and its entire family), you can call the mockable.mock() function, which will replace that wrapper function to return the second argument fed to mockable.mock(), until mock.unmock() is called to restore the original wrapper function.

« Older posts

© 2023 BigSmoke

Theme by Anders NorenUp ↑