Smokes your problems, coughs fresh air.

Tag: C

Finding the right UUID generation algorithm for FlashMQ.com

Early during the development of the PostgreSQL backend for FlashMQ.com—that is our amazing managed MQTT hosting service!—, I decided on using UUIDs rather than auto-incrementing integers wherever I needed or wanted surrogate keys. I am one of those people who prefers the use of natural keys where their use is … natural, but I certainly have nothing _against_ surrogate keys, only to their overuse, which usually results from an over-reliance on ORMs (which I do have something against). There are a couple of advantages to using UUIDs over auto-incrementing integers (available in PostgreSQL via sequences):

  1. It doesn’t matter in which application layer the UUID is generated, as long as the resulting UUID is compatible with RFC 4122.
  2. Using UUIDs with a random component avoids some types of lock contention issues where multiple SQL sessions simultaneously want to get or insert the next integer in sequence.
  3. When UUIDs are random enough, they can be used as an extra layer of security (through obscurity) or one-off security tokens (such as for account validation).

PostgreSQL has a native UUID type, allowing me to store UUIDs in efficient binary form, while making it very easy to go back and forth to a string representation. Postgres’ native UUID type has a canonical text representation in which the the UUID’s 128 bits are represented as 32 hexadecimal digits punctuated by dashes. Here’s an example:

flashmq=# select gen_random_uuid()::text;
           gen_random_uuid
--------------------------------------
 61a1d6e3-9338-4716-8ee9-2e4574527757
(1 row)

When casting a text UUID, PostgreSQL is quite forgiving about the upper versus lower case and the placement (or complete absence) of the hyphens, and even allows curly braces to surround the UUID. (It is important to remember that types in PostgreSQL do have a canonical text representation, though not for the purpose of this present article.)

The UUID format defined in RFC 4122 is the so-called OSF DCE (or “Leach–Salz”) variant. Within this variant, there are a number of subtypes (henceforth called “versions”, to be consistent with the backward-compatible language in the RFC).

  1. UUID version 1 combines a device’s 48-bit MAC address combined with a 60-bit timestamp.
  2. UUID version 2 is a bit of rarely-used permutation of UUIDv1.
  3. UUID version 3 is an MD5 hash of a namespace identifier and a name within that namespace. This being a hashing algorithm means that the same namespace + name combination should always yield the same UUID. The namespace identifier is itself a UUID, and a list of predefined namespace UUIDs are included in Appendix C of RFC 4122.
  4. UUID version 4 is almost fully random, except for the first 6 to 7 bits which indicate it being a version 4 UUID and which variant thereof.
  5. UUID version 5 is like UUIDv3 except that the hashing algorithm used is SHA-1 instead of MD5.

Natively, PostgreSQL (as of version 15) offers one UUID generation function:

  • gen_random_uuid() can be used to generate version 4 UUIDs.

In addition, the Postgres standard distribution is equipped with the uuid-ossp contrib module, which sports a couple more UUID generation functions:

  • uuid_generate_v1() generates—you guessed it—a version 1 UUID, built from a timestamp and the computer’s MAC address.
  • uuid_generate_v1mc() also generates a UUIDv1, but with a random multicast MAC address.
  • uuid_generate_v3(namespace uuid, name text) “[g]enerates a version 3 UUID in the given namespace using the specified input name.” The hashing algorithm used is MD5.
  • uuid_generate_v4() seems to me to behave the same as the native gen_random_uuid().
  • uuid_generate_v5(namespace uuid, name text) is like uuid_generate_v3(), but does the hashing using SHA-1 instead of MD5, to produce a UUIDv5 instead of a UUIDv3.

From the beginning, I had “simply” opted for version 4 UUIDs, generated using gen_random_uuid() in each surrogate key column’s default expression. In so doing, I also stumbled upon another advantage of UUIDs over good old integer-based surrogate keys:

  1. UUIDs are much easier to create and relate predictable seed or test data in various types of SQL scripts. You can simply generate them and copy-paste them to your script to override the surrogate key column’s default. And then reuse them in foreign keys, etc.

Soon after deciding on using UUIDs as surrogate keys, I learned about a potential disadvantage too: random (version 4) UUIDs—the ones that I was using—cause index fragmentation; subsequently generated values (and thus subsequently inserted rows) may or may not—it is random, after all!—need to be lexically sorted before or after the previously generated UUID.

This issue cannot be simply solved by switching to UUIDv1 (or the uuid_generate_v1mc() variant of the UUIDv1 generation algorithm, from the aforementioned uuid-ossp contrib module). The reason not is that the UUIDv1 timestamp is encoded in such a way that the version 1 UUIDs cannot be sorted sensibly by there byte content. Of course, PostgreSQL being PostgreSQL, one could index on an expression using a function (like uuid_v1_timestamp() from pg-uuid-ext) that extracts the timestamp portion of the v. 1 UUID in such a way that it is sortable. But, so I learned: there are alternatives to UUID version 1 through 5—a couple too many even.

The blog post which introduced me to the UUID sorting problem was written by Tomas Vondra, who also created a Postgres extension with two sequential UUID generation functions: one which relies on sequences, which I wanted to avoid, and another one that is partially time-based: uuid_time_nextval().

The choice would have been easier if Tomas’ sequential-uuids extension had been the only game in town. Instead, there is, for example, also the semi-formal, cummunity-driven ulid spec, with “ulid” standing for “Universally Unique Lexicographically Sortable Identifier”. (I hate that they write their acronym in lowercase in their branding, by the way, which is why, going forward, I will stubbornly yell ULID in all uppercase. 😤) Like Tomas Vondra’s sequential UUIDs:

  • ULIDs are binary compatible with UUIDs; and
  • ULIDs start with a UNIX epoch-based timestamp encoded as a 48-bit integer, sortable both on the byte level and in their text representation.

There are a couple of differences too:

  • The canonical text representation of ULIDs never contains punctuation characters, which supposedly makes them more “URL safe”. (Though I fail to see how hyphens would make a UUID URL unsafe; perhaps the curly braces could, somehow, in some circumstances. 🤷)
  • The use of Crockford’s base32 instead of base16 hexadecimal digits requires only 26 characters for the canonical text representation of a ULID rather than the 36 characters required for UUIDs.

The ULID spec contains a list of ULID implementations in a variety of languages, and the project provides a reference JavaScript implementation. The list doesn’t point to a PostgreSQL implementation, but there are, in fact a number:

  • Scoville’s pgsql-ulid project contains functions to convert ULIDs to UUIDs and visa versa—ulid_to_uuid() and uuid_to_ulid(), respectively—plus a function—parse_ulid() to convert a canonical ULID to a bytea.

    Function signature Return type
    ulid_to_uuid(text) uuid
    uuid_to_ulid(uuid) text
    parse_ulid(text) bytea
  • Geckoboard’s pgulid project provides a generate_ulid() function that returns a text ULID.

    Function signature Return type
    generate_ulid() text
  • Vadosware’s pg_idkit Postgres extension offers a ULID generation (idkit_ulid_generate()) function among a whole bunch of other UUID-ish generator functions.

    Function signature Return type
    idkit_ulid_generate() ?

    I find pg_idkit‘s dependency on Rust a bit much, personally, if I’m just going to use one function, for which I really don’t see the advantage of introducing a complete language runtime + development tooling to compile what could have just been a plain C function.

  • I adore Rust, but dislike extraneous dependencies. For that same reason, I’m reluctant to use spa5k’s uids-postgres extension, which provides 3 ULID generation functions:

    Function signature Return type
    generate_ulid() text
    generate_ulid_bytes() bytea
    generate_ulid_from_string() text

Some of the aforementioned extension contain support for more sequential UUID types than just ULIDs. Some (like spa5k’s uids-postgres) also support KSUIDs, for example. Then, there a couple more extensions, that don’t support ULIDs, but do support, other sequential UUIDs; Modular Finance’s pg-xid, for example, specifically supports another contending sequential UUID spec: XID.

😵 🥴

Overwhelmed by the choices, I let the matter rest for some months, until I encountered the pg_uuidv7 extension, through which I learned about 3 proposed successors to UUID v. 1 through 5:

  1. UUID version 6 very much resembles UUIDv1, but with the timestamp bytes encoded in sortable order. The time stamp is still based off of a Gregorian timestamp source.
  2. UUID version 7 simplifies UUIDv1 by using a UNIX timestamp instead of Gregorian-based timestamp. The least significant 62 bits may be used either for sub-second precision, some sort of sequence counter or filled with pseudo-random data. 👑
  3. UUID version 8 seems to me to be an extra flexible variant, with freedom of which type of timestamp to use, freedom of the precision of that timestamp, and freedom of how to fill the rest of available (less significant) bits after the timestamp.

Knowing that a likely successors to UUIDv1 through v5 are on the horizon, the choice for UUIDv7 seems to me to be the most obvious choice. And pg_uuidv7 seems to me the most straight-forward implementation thereof, because:

  • pg_uuidv7 only supports UUIDv7, which is the only UUID format I need for FlashMQ.com.
  • pg_uuidv7 is implemented in C and thus introduces no additional dependencies.
  • The naming of uuid_generate_v7() is consistent with uuid_generate_v1() through uuid_generate_v5 from the standard uuid-ossp contrib module, which will make it easy for me to swap to that contrib module if/when that function is added there. (Should I do a PR myself based on the uuid_generate_v7() implementation by @fboulnois?)

Before I go on to the full comparison table below, let me highlight a pure PL/pgSQL implementation of UUIDv6 and UUIDv7 as well: this gist by @fabiolimace.

Overview of sequential UUID generation projects for PostgreSQL
Project Ext. Lang. Sequential UUID generation algorithm
ULID KSUID XID UUIDv6

UUIDv7 Other
sequential-uuids C 🦆
uids-postgres Rust
pgulid PL/pgSQL
pg-xid PL/pgSQL
pg_idkit Rust 🦆
pg_uuidv7 C
https://gist.github.com/fabiolimace’s gist PL/pgSQL

In case you’re wondering, the enemy (Microsoft SQL Server) ships with the non-standard NEWSEQUENTIALID() function.

Rapidly firing myself in the foot with C pointers

Now that I am dedicated to becoming a somewhat decent C programmer, I need to master pointers. Last week, I leveled up in my pointer usage by debugging a particularly nasty segfault. It took the help of gdb (the GNU Project Debugger) for me to notice that my segfault was accompanied by very weird values for some increment counters, while the pointer involved was a char* pointer, not a pointer to an int.

GDB

First, some notes on the GNU Project Debugger: it’s excellent! And … it’s easy to use. I have no idea why looong ago, when as a budding programmer I was trying to use it, I had so much trouble using it that it stuck into my memory as a very tough tool to use. Well, this is not the same brain anymore, so time to get rid of all these printf() statements everywhere (that I wouldn’t dare use in a programming language that I do have some fluency in, mind you!) [lines of shame: L45, L100, L101, L119 ].

With the help of gdb xjot-to-xml (and then, from within GDB, run < my-test-file.xjot), I noticed that some of the ints I used to track byte, line and column positiion had ridiculously high values for the input line, especially since I was pretty sure that my program crashed already on the first character.

In GDB, such things are easy to find out: you can very simply set a breakpoint anywhere:

break 109
run < tests/element-with-inline-content.xjot
Starting program: /home/bigsmoke/git/xjot/xjot-to-xml < tests/element-with-inline-content.xml
a
b

Breakpoint 1, _xjot_to_xml_with_buf (in_fd=537542260, out_fd=1852140901, buf=0x6c652d746f6f723c, buf_size=1024)
    at xjot_to_xml.c:109
109                 byte = ((char*)buf)[0];

From there, after hitting the breakpoint, I can check the content of the variable n that holds the number of bytes read by read() into buf.

print n
$1 = 130

So, the read() function had read 130 bytes into the buffer. Which makes sense, because element-with-inline-content.xjot was 128 characters, and the buffer, at 1024 bytes, is more than sufficient to hold it all.

But, then line_no and col_no variables:

(gdb) print line_no
$2 = 1702129263
(gdb) print col_no
$4 = 1092645999

It took me a while to realize that this must have been due to a buffer overrun. Finally, I noticed that I was feeding the address of the buf pointer to read() instead of the value of the pointer.

(I only just managed to fix it before Wiebe, out of curiosity to my C learning project, glanced at my code and immediately spotted the bug.)

The value of pointers

C is a typed language, but that doesn't mean that you cannot still very easily shoot yourself in the foot with types, and, this being C, it means that it's easiest to shoot yourself in the foot with the difference between pointers and non-pointers.

I initialized my buffer as a plain char array of size XJOT_TO_XML_BUFFER_SIZE_IN_BYTES. Then, the address of that array is passed to the _xjot_to_xml_with_buf() function. This function expects a buf parameter of type void*. (void* pointers can point to addresses of any type; I picked this “type”, because read() wants its buffer argument to be of that type.)

What went wrong is that I then took the address of void* buf, which is already a pointer. That is to say: the value of buf is the address of buffer which I passed to _xjot_to_xml_with_buf() from xjot_to_xml().

When I then took the address of the void* buf variable itself, and passed it to read(), read() started overwriting the memory in the stack starting at that address, thus garbling the values of line_no and col_no in the process.

The take-home message is: pointers are extremely useful, once you develop an intuition of what they're pointing at. Until that time, you must keep shooting yourself in the foot, because errors are, as Huberman says, the thing that triggers neuroplasticity.

WW challenge 1: learning better C by working on XJot

Since the beginning of this month (October 2021), I become officially jobless, after 6 years at YTEC. That’s not so much of a problem for a software developer in 2021—especially one in the Dutch IT industry, where there has been an enormous shortage of skilled developers for years. However… I don’t (yet) want a new job as a software developer, because: in the programming food pyramid, I’m a mere scripter. That is, the language in which I’m most proficient are all very high-level languages: Python, PHP, XSLT, Bash, JavaScript, Ruby, C# (in order of decreasing (recency of) experience. I have never mastered a so-called systems language: C, C++, Rust.

Now, because of the circumstances in which YTEC and I decided to terminate the contract, I am entitled to government support until I find a new job. During my last job, I’ve learned a lot, but almost all I learned made me even more high-level and business-leaning than I already was. I’ve turned into some sort of automation & integration consultant / business analyst / project manager. And, yes, I’ve sharpened some of my code organization, automated testing skills as well. Plus I now know how to do proper monitoring. All very nice and dandy. But, what I’m still missing are ① hardcore, more low-level technical skills, as well as ② front-end, more UX-oriented skills. Only with some of those skills under my belt will I be able to do the jobs I really want to do—jobs that involve ① code that has to actually work efficient on the hardware level—i.e., code that doesn’t eat up all the hardware resources and suck your battery (or your “green” power net) empty; and I’m interested in making (website) application actually pleasant (and accessible!) to use.

If I start applying for something resembling my previous job, I will surely find something soon enough. However, first I am to capture these new skills, if I also wish to continue to grow my skill-level, my self-respect, as well as my earning (and thus tax paying) potential. Hence, the WW challenge. WW is the abbreviation for Wet Werkeloosheid, a type of welfare that Dutchies such as myself get when they temporarily are without job, provided that you were either ⓐ fired or ⓑ went away in “mutual understanding” (as I was). If ⓒ you simply quit, you don’t have the right to WW (but there are other fallbacks in the welfare state of the NL).

Anyways, every month, I have to provide the UWV—the governmental organization overseeing people in the WW—with evidence of my job-seeking activities. Since I decided that I want to deepen my knowledge instead of jumping right back into the pool of IT minions, I will set myself challenges that require the new skills I desire.

My first goal is to become more comfortable with the C programming language. I have some experience with C, but my skill level is rudimentary at best. My most recent attempt to become more fluent in C was that I participated in the 2020 Advent of Code challenge. I didn’t finish that attempt, because, really, I’m a bread programmer. Meaning: before or after a 8+-hour day at the office, I have very little desire to spend my free time doing even more programming. To stay sane (and steer clear of burnout), that time is much-needed for non-digital activities in social space, in nature, and by inhabiting my physical body.

So now I do have time and energy to learn some new skills that really require sustained focus over longer periods of time to acquire. On the systems programming language level, besides C, I’m also interested in familiarizing myself with C++ and Rust. And it is a well-known facts that C knowledge transfers very well to these two languages.

Okay, instead of doing silly Christmas puzzles, this time I’ve resumed my practice by writing an actual program in C that is useful to me: xjot-to-xml. It will be a utility to convert XJot documents to regular XML. XJot is an abbreviated notation for XML, to make it more pleasant to author XML documents—especially articles and books.

Using ROT13 to circumvent Gmail’s .exe filter

Wiebe reminded me by mail to send him a Windows executable. I use Gmail. Gmail doesn’t allow me to send archives containing .exe files. Something to do with viruses on Windows. Wiebe doesn’t use Windows. He uses Wine. I just want to send the damn executable.

Last time I wanted to do this, I emerged net-mail/email. But that just wasn’t very cool. A while ago someone also suggested using a password-protected zip file, but I find that even less cool because I hate zip files. (I usually prefer bzipped tarballs.)

ROT13 is a variation of ROT3, the Ceasar Cipher, which is as old as it is insecure. For computer use ROT13 is even cooler, because you can decrypt just as you encrypt. Decrypt and encrypt are the same. Let’s grab the simplestshortest C implementation we can find, rot13-shortest.c:

main(a){while(a=~getchar())putchar(~a-1/(~(a|32)/13*2-11)*13);}

Compile it and use it as a filter to encode a file:

rot13-shortest < suspicious.tar.bz2 > suspicious.tar.bz2.r13

To decode (to “decrypt” is too grand a verb), just swap the input and output files:

rot13-shortest < suspicious.tar.bz2.r13 > suspicious.tar.bz2

Now, it’s time for me to hand in my Geek Card. Firstly, I confused ROT13 with ROT3 (even spending some time trying to find a ROT3 implementation). Secondly, I had never used ROT13 before, not even to by-pass a forum filter or to make a joke which is funnier if you wear a pen-protector. EBG13 wbxrf whfg nera’g shaal.

© 2023 BigSmoke

Theme by Anders NorenUp ↑