Finding the right UUID generation algorithm for FlashMQ.com

Early during the development of the PostgreSQL backend for FlashMQ.com—that is our amazing managed MQTT hosting service!—, I decided on using UUIDs rather than auto-incrementing integers wherever I needed or wanted surrogate keys. I am one of those people who prefers the use of natural keys where their use is … natural, but I certainly have nothing _against_ surrogate keys, only to their overuse, which usually results from an over-reliance on ORMs (which I do have something against). There are a couple of advantages to using UUIDs over auto-incrementing integers (available in PostgreSQL via sequences):

It doesn’t matter in which application layer the UUID is generated, as long as the resulting UUID is compatible with RFC 4122.
Using UUIDs with a random component avoids some types of lock contention issues where multiple SQL sessions simultaneously want to get or insert the next integer in sequence.
When UUIDs are random enough, they can be used as an extra layer of security (through obscurity) or one-off security tokens (such as for account validation).

PostgreSQL has a native UUID type, allowing me to store UUIDs in efficient binary form, while making it very easy to go back and forth to a string representation. Postgres’ native UUID type has a canonical text representation in which the the UUID’s 128 bits are represented as 32 hexadecimal digits punctuated by dashes. Here’s an example:

flashmq=# select gen_random_uuid()::text;
           gen_random_uuid
--------------------------------------
 61a1d6e3-9338-4716-8ee9-2e4574527757
(1 row)

When casting a text UUID, PostgreSQL is quite forgiving about the upper versus lower case and the placement (or complete absence) of the hyphens, and even allows curly braces to surround the UUID. (It is important to remember that types in PostgreSQL do have a canonical text representation, though not for the purpose of this present article.)

The UUID format defined in RFC 4122 is the so-called OSF DCE (or “Leach–Salz”) variant. Within this variant, there are a number of subtypes (henceforth called “versions”, to be consistent with the backward-compatible language in the RFC).

UUID version 1 combines a device’s 48-bit MAC address combined with a 60-bit timestamp.
UUID version 2 is a bit of rarely-used permutation of UUIDv1.
UUID version 3 is an MD5 hash of a namespace identifier and a name within that namespace. This being a hashing algorithm means that the same namespace + name combination should always yield the same UUID. The namespace identifier is itself a UUID, and a list of predefined namespace UUIDs are included in Appendix C of RFC 4122.
UUID version 4 is almost fully random, except for the first 6 to 7 bits which indicate it being a version 4 UUID and which variant thereof.
UUID version 5 is like UUIDv3 except that the hashing algorithm used is SHA-1 instead of MD5.

Natively, PostgreSQL (as of version 15) offers one UUID generation function:

gen_random_uuid() can be used to generate version 4 UUIDs.

In addition, the Postgres standard distribution is equipped with the uuid-ossp contrib module, which sports a couple more UUID generation functions:

uuid_generate_v1() generates—you guessed it—a version 1 UUID, built from a timestamp and the computer’s MAC address.
uuid_generate_v1mc() also generates a UUIDv1, but with a random multicast MAC address.
uuid_generate_v3(namespace uuid, name text) “[g]enerates a version 3 UUID in the given namespace using the specified input name.” The hashing algorithm used is MD5.
uuid_generate_v4() seems to me to behave the same as the native gen_random_uuid().
uuid_generate_v5(namespace uuid, name text) is like uuid_generate_v3(), but does the hashing using SHA-1 instead of MD5, to produce a UUIDv5 instead of a UUIDv3.

From the beginning, I had “simply” opted for version 4 UUIDs, generated using gen_random_uuid() in each surrogate key column’s default expression. In so doing, I also stumbled upon another advantage of UUIDs over good old integer-based surrogate keys:

UUIDs are much easier to create and relate predictable seed or test data in various types of SQL scripts. You can simply generate them and copy-paste them to your script to override the surrogate key column’s default. And then reuse them in foreign keys, etc.

Soon after deciding on using UUIDs as surrogate keys, I learned about a potential disadvantage too: random (version 4) UUIDs—the ones that I was using—cause index fragmentation; subsequently generated values (and thus subsequently inserted rows) may or may not—it is random, after all!—need to be lexically sorted before or after the previously generated UUID.

This issue cannot be simply solved by switching to UUIDv1 (or the uuid_generate_v1mc() variant of the UUIDv1 generation algorithm, from the aforementioned uuid-ossp contrib module). The reason not is that the UUIDv1 timestamp is encoded in such a way that the version 1 UUIDs cannot be sorted sensibly by there byte content. Of course, PostgreSQL being PostgreSQL, one could index on an expression using a function (like uuid_v1_timestamp() from pg-uuid-ext) that extracts the timestamp portion of the v. 1 UUID in such a way that it is sortable. But, so I learned: there are alternatives to UUID version 1 through 5—a couple too many even.

The blog post which introduced me to the UUID sorting problem was written by Tomas Vondra, who also created a Postgres extension with two sequential UUID generation functions: one which relies on sequences, which I wanted to avoid, and another one that is partially time-based: uuid_time_nextval().

The choice would have been easier if Tomas’ sequential-uuids extension had been the only game in town. Instead, there is, for example, also the semi-formal, cummunity-driven ulid spec, with “ulid” standing for “Universally Unique Lexicographically Sortable Identifier”. (I hate that they write their acronym in lowercase in their branding, by the way, which is why, going forward, I will stubbornly yell ULID in all uppercase. 😤) Like Tomas Vondra’s sequential UUIDs:

ULIDs are binary compatible with UUIDs; and
ULIDs start with a UNIX epoch-based timestamp encoded as a 48-bit integer, sortable both on the byte level and in their text representation.

There are a couple of differences too:

The canonical text representation of ULIDs never contains punctuation characters, which supposedly makes them more “URL safe”. (Though I fail to see how hyphens would make a UUID URL unsafe; perhaps the curly braces could, somehow, in some circumstances. 🤷)
The use of Crockford’s base32 instead of base16 hexadecimal digits requires only 26 characters for the canonical text representation of a ULID rather than the 36 characters required for UUIDs.

The ULID spec contains a list of ULID implementations in a variety of languages, and the project provides a reference JavaScript implementation. The list doesn’t point to a PostgreSQL implementation, but there are, in fact a number:

Scoville’s pgsql-ulid project contains functions to convert ULIDs to UUIDs and visa versa—ulid_to_uuid() and uuid_to_ulid(), respectively—plus a function—parse_ulid() to convert a canonical ULID to a bytea.

Function signature Return type

ulid_to_uuid(text) uuid

uuid_to_ulid(uuid) text

parse_ulid(text) bytea
Geckoboard’s pgulid project provides a generate_ulid() function that returns a text ULID.

Function signature Return type

generate_ulid() text
Vadosware’s pg_idkit Postgres extension offers a ULID generation (idkit_ulid_generate()) function among a whole bunch of other UUID-ish generator functions.

Function signature Return type

idkit_ulid_generate() ?

I find pg_idkit‘s dependency on Rust a bit much, personally, if I’m just going to use one function, for which I really don’t see the advantage of introducing a complete language runtime + development tooling to compile what could have just been a plain C function.
I adore Rust, but dislike extraneous dependencies. For that same reason, I’m reluctant to use spa5k’s uids-postgres extension, which provides 3 ULID generation functions:

Function signature Return type

generate_ulid() text

generate_ulid_bytes() bytea

generate_ulid_from_string() text

Function signature	Return type
`ulid_to_uuid(text)`	`uuid`
`uuid_to_ulid(uuid)`	`text`
`parse_ulid(text)`	`bytea`

Function signature	Return type
`generate_ulid()`	`text`

Function signature	Return type
`idkit_ulid_generate()`	?

Function signature	Return type
`generate_ulid()`	text
`generate_ulid_bytes()`	bytea
`generate_ulid_from_string()`	text

Some of the aforementioned extension contain support for more sequential UUID types than just ULIDs. Some (like spa5k’s uids-postgres) also support KSUIDs, for example. Then, there a couple more extensions, that don’t support ULIDs, but do support, other sequential UUIDs; Modular Finance’s pg-xid, for example, specifically supports another contending sequential UUID spec: XID.

😵 🥴

Overwhelmed by the choices, I let the matter rest for some months, until I encountered the pg_uuidv7 extension, through which I learned about 3 proposed successors to UUID v. 1 through 5:

UUID version 6 very much resembles UUIDv1, but with the timestamp bytes encoded in sortable order. The time stamp is still based off of a Gregorian timestamp source.
UUID version 7 simplifies UUIDv1 by using a UNIX timestamp instead of Gregorian-based timestamp. The least significant 62 bits may be used either for sub-second precision, some sort of sequence counter or filled with pseudo-random data. 👑
UUID version 8 seems to me to be an extra flexible variant, with freedom of which type of timestamp to use, freedom of the precision of that timestamp, and freedom of how to fill the rest of available (less significant) bits after the timestamp.

Knowing that a likely successors to UUIDv1 through v5 are on the horizon, the choice for UUIDv7 seems to me to be the most obvious choice. And pg_uuidv7 seems to me the most straight-forward implementation thereof, because:

pg_uuidv7 only supports UUIDv7, which is the only UUID format I need for FlashMQ.com.
pg_uuidv7 is implemented in C and thus introduces no additional dependencies.
The naming of uuid_generate_v7() is consistent with uuid_generate_v1() through uuid_generate_v5 from the standard uuid-ossp contrib module, which will make it easy for me to swap to that contrib module if/when that function is added there. (Should I do a PR myself based on the uuid_generate_v7() implementation by @fboulnois?)

Before I go on to the full comparison table below, let me highlight a pure PL/pgSQL implementation of UUIDv6 and UUIDv7 as well: this gist by @fabiolimace.

Overview of sequential UUID generation projects for PostgreSQL
Project	Ext.	Lang.	Sequential UUID generation algorithm
Project	Ext.	Lang.	ULID	KSUID	XID	UUIDv6	UUIDv7	Other
`sequential-uuids`	✅	C						🦆
`uids-postgres`	✅	Rust	✅	✅
`pgulid`		PL/pgSQL	✅
`pg-xid`		PL/pgSQL			✅
`pg_idkit`	✅	Rust	✅	✅	✅	✅	✅	🦆
`pg_uuidv7`	✅	C					✅
https://gist.github.com/fabiolimace’s gist		PL/pgSQL				✅	✅

In case you’re wondering, the enemy (Microsoft SQL Server) ships with the non-standard NEWSEQUENTIALID() function.

Finding the right UUID generation algorithm for FlashMQ.com

Categories

Tags/Keywords

Recent Posts

Recent Comments