PostgresQL – evardsson.com: stuff that w0rks

Seeing the post from Aaron the other day on generating UUIDs in PHP got me to thinking about UUIDs in general.

The Why:

So why would we want to use UUIDs? The simple answer is that they are (almost) guaranteed to be a “Universally Unique” ID. So what? Every database table has its own row ids, why do they need to be unique from other database table row ids? Here’s an example:

Jo has three stores X, Y and Z. Each store has a local customer database. That database contains the following table structures:

persons:
  id
  name

phones:
  id
  person_id
  phone_number

(I didn’t say the tables were all that useful, but they work for the sake of discussion.)

Jo looks at the tables from all three stores and decides that it might be a good idea to combine them into one master customer database to share with all the stores. Here is where the trouble starts. When we compare the three databases we find that we cannot simply dump the data from any one of the databases into any of the others.

Store X

Store Y

Store Z

persons
id	name
1	Sam Smith
2	Terry Tyler
phones
id	person_id	phone_number
1	1	(555)555-1212
2	2	(555)555-1234
3	1	(555)555-4321

persons
id	name
1	Perry Paxson
2	Jeremiah Jones
phones
id	person_id	phone_number
1	1	(555)555-2121
2	2	(555)555-4412

persons
id	name
1	Edie Ellerts
2	Francis Fenimore
phones
id	person_id	phone_number
1	1	(555)555-3434
2	2	(555)555-7214
3	2	(555)555-9951

While this makes it simple in terms of looking up a customer’s phone number(s) on a single store, SELECT * FROM persons, phones WHERE phones.person_id = persons.person_id AND persons.name = 'Sam Smith', it leaves us in a tight spot where it comes to combining the databases. We either have to pull all the data programmatically and retain the references from phones to persons and re-insert the whole mess into a new set of tables, or we need to add another field to define which store database the record comes from and change the primary key from the single value (id) to a multiple-field key (id, store_name). This means, of course, that everywhere we reference these values in our code we need to change the SQL. Not a happy solution.

So what if we change the id field type from an integer to a UUID (which is a char(36))? Now our tables would look like:

Store X

persons
id	name
7d20cac1-d558-4a5c-98c0-ee332b554b29	Sam Smith
5fae5033-36e0-42f8-bb50-c00449c6cab0	Terry Tyler
phones
id	person_id	phone_number
82eec34b-897b-40af-9674-fa08e5537d12	7d20cac1-d558-4a5c-98c0-ee332b554b29	(555)555-1212
08400e86-9bd0-4713-bf6e-a5b8cd3e7e22	5fae5033-36e0-42f8-bb50-c00449c6cab0	(555)555-1234
ab64d8d7-0511-4b8e-ab57-9932c3c96b31	7d20cac1-d558-4a5c-98c0-ee332b554b29	(555)555-4321

Store Y

persons
id	name
9ca534e4-2dbd-4dc3-a7b3-98ab8b89fb70	Perry Paxson
4cc10585-b864-45bc-9f20-f9761af82e06	Jeremiah Jones
phones
id	person_id	phone_number
85302b40-c12b-4a05-8dbb-500f6f5cb26f	9ca534e4-2dbd-4dc3-a7b3-98ab8b89fb70	(555)555-2121
106a9524-800c-4e5a-ad4a-6ac9f601b370	4cc10585-b864-45bc-9f20-f9761af82e06	(555)555-4412

Store Z

persons
id	name
4da59fb9-ad00-4a5b-aa00-1bb503fe9c8f	Edie Ellerts
8a39f12c-94c4-4b70-ac70-3cff77be8b01	Francis Fenimore
phones
id	person_id	phone_number
47d0c49e-02df-4366-853b-a0502edb1de6	4da59fb9-ad00-4a5b-aa00-1bb503fe9c8f	(555)555-3434
cb868eca-822f-416a-ac7f-0df12504705f	8a39f12c-94c4-4b70-ac70-3cff77be8b01	(555)555-7214
e70bea84-66e8-407f-bb6c-b164864d22b1	8a39f12c-94c4-4b70-ac70-3cff77be8b01	(555)555-9951

We now have referential data that we can safely combine into a single database, and can even dump back into all the store local database copies.

The How:

When I need to generate my own UUIDs I certainly use the exact same code as Aaron, even cribbed from the exact same source. Of course Python has a UUID module which means you can create a version 1, 3, 4 or 5 UUID quickly and easily. I prefer the version 4 UUIDs myself, since version 1 contains identifiable data about the source, and versions 3 and 5 require other information to generate. (In case you are wondering, there is no version 2).

This is simple enough, but I actually prefer things a little simpler, and when I am using PostgreSQL they do get simpler as I can let the database do the UUID generation. (I am aware of the uuid() function that is built in to MySQL 5, however, that uses version 1 UUIDs which contain system identifiable data. If that is okay with you then by all means feel free to use it.) While there are at least a couple PostgreSQL plugins I have seen for generating UUIDs I actually make use of the CREATE FUNCTION capabilities of PostgreSQL to roll my own. (NOTE: To use this function you must have the PostgreSQL cryptographic libraries installed.)

So first I need to see what I am trying to generate.

From the RFC:

4.4. Algorithms for Creating a UUID from Truly Random or Pseudo-Random Numbers

The version 4 UUID is meant for generating UUIDs from truly-random or
pseudo-random numbers.

The algorithm is as follows:

Set the two most significant bits (bits 6 and 7) of the clock_seq_hi_and_reserved to zero and one, respectively.

Set the four most significant bits (bits 12 through 15) of the time_hi_and_version field to the 4-bit version number from Section 4.1.3.

Set all the other bits to randomly (or pseudo-randomly) chosen values.

The format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

[For more in-depth analysis of how a UUID is constructed refer to http://tools.ietf.org/html/rfc4122]

Creating a version 4 UUID requires a bunch of random hexadecimal data, with one hexadecimal character (x) (equivalent to a nibble – 4 bits for those of you who don’t recall) set to the value of 4 (binary 0100 from Section 4.1.3 of the RFC) and one hexadecimal character (x) set to one of 4 values: 8, 9, a or b; which is what the first nibble of a byte is when the two most significant bits are 10. For proof of this see below:

3	2	1	0	Hex	Dec
Bit
1	0	0	0	8	8
1	0	0	1	9	9
1	0	1	0	a	10
1	0	1	1	b	11

Now let’s look at what means we have for generating random data in hexadecimal format. PostgreSQL includes a random() function which generates a double precision number between 0 and 1. We also have a to_hex() function that will convert integer or binary string data to a hexadecimal number. This is a start, but we really don’t want to be calling random() more often than we need to, so is there a way to generate a bunch of hexadecimal data at one shot? Actually, there is, if we just add a sha1 function. So let’s start there:

CREATE OR REPLACE FUNCTION sha1(text)
  RETURNS character AS
$BODY$
BEGIN
RETURN ENCODE(DIGEST($1, 'sha1'), 'hex');
END;
$BODY$
  LANGUAGE 'plpgsql';

Now we have a quick way to generate lots of hexadecimal characters, all we need now is to randomize, and format according to what we have stated above.

CREATE OR REPLACE FUNCTION uuid()
  RETURNS character AS
$BODY$
SELECT SUBSTRING(x.my_rand FROM 1 FOR 8)||'-'||SUBSTRING(x.my_rand FROM 9 FOR 4)||'-4'||SUBSTRING(x.my_rand FROM 13 FOR 3)||'-'||x.clock_1||SUBSTRING(x.my_rand FROM 16 FOR 3)||'-'||SUBSTRING(x.my_rand FROM 19 FOR 12)
FROM  
(SELECT sha1(''||now()||random()) as my_rand, to_hex(8+(3*random())::int) as clock_1) as x;$BODY$
  LANGUAGE 'sql';

Notice that we are using now() to grab the current timestamp (including milliseconds), and concatenating that with the output of random to feed to sha1() to generate a hex digest. We are also generating a value of either 8, 9, a, or b for the first nibble in the clock_seq_hi field (the character highlighted in blue) by adding the value of 3 * random() cast as an integer to 8 and then returning that as a hex value. We then use SUBSTRING to pull the number of characters we need for each chunk and keep on moving left to right in the hash to grab the next values and concatenate them with dashes where needed and our version number.

For this and other PostgreSQL functions I use regularly you can download this SQL file.

evardsson.com: stuff that w0rks

Category: PostgresQL

New Project: SPDO

UUID Why and How

Why PostgreSQL