Category: PostgresQL

The Postgres RDBMS

Development

New Project: SPDO

I have just posted the (embarrassingly empty) page for my new pet project: SPDO (Sjan’s PDO) – in two flavors: PHP and Python.

There are only about a thousand PDOs out there, and perhaps a dozen or so of them are really functional and (in a few cases) very polished pieces of work. So why am I messing with writing my own? A couple of reasons:

  1. I like to have coding signatures that are consistent, whether I am working with PostgreSQL, MySQL or (heaven help us all) SQLite.
  2. I like to have coding signatures that are (reasonably) consistent across different languages – in this case Python and PHP.
  3. I wanted to take advantage of Prepared Statements where I could, even though the PHP implementations of those are pretty weak (especially in the case of MySQL).

Currently implemented in

  • Python:
    • PostgreSQL (with prepared statements)
    • SQLite (no prepared statements).
  • PHP
    • PostgreSQL (with prepared statements)
    • MySQL (with prepared statements)
    • SQLite (no prepared statements)

Here’s an example of how they work (in most simplistic terms):

#in Python
from pyDB import *
db = pyDB('mysite')
newid = db.insert('INSERT INTO test (name, value) VALUES (?,?)',['foo','bar'])
update_count = db.update('UPDATE test SET value=? WHERE id=?',['baz',newid])
results = db.select('SELECT * FROM test')
for row in results:
    for i in row:
        print "\t", i,"\t", row[i]
delete_count = db.delete('DELETE FROM test WHERE id=?',[newid])
//in PHP
require_once('phpdb.php');
$db = new phpDB('test');
$newid = $db->insert('INSERT INTO test (name, value) VALUES (?,?)',array('foo','bar'));
$update_count = $db->update('UPDATE test SET value=? WHERE id=?',array('baz',newid));
$results = db->select('SELECT * FROM test');
foreach($results as $row)
{
    foreach ($row as $key=>$val)
    {
        print "\t$key\t$val";
    }
}
$delete_count = $db->delete('DELETE FROM test WHERE id=?',array($newid));

The page with links to the code is in the list up top, and everything is MIT license. Enjoy.

Database

UUID Why and How

Seeing the post from Aaron the other day on generating UUIDs in PHP got me to thinking about UUIDs in general.

The Why:

So why would we want to use UUIDs? The simple answer is that they are (almost) guaranteed to be a “Universally Unique” ID. So what? Every database table has its own row ids, why do they need to be unique from other database table row ids? Here’s an example:

Jo has three stores X, Y and Z. Each store has a local customer database. That database contains the following table structures:

persons:
  id
  name

phones:
  id
  person_id
  phone_number

(I didn’t say the tables were all that useful, but they work for the sake of discussion.)

Jo looks at the tables from all three stores and decides that it might be a good idea to combine them into one master customer database to share with all the stores. Here is where the trouble starts. When we compare the three databases we find that we cannot simply dump the data from any one of the databases into any of the others.

Store X Store Y Store Z
persons
id name
1 Sam Smith
2 Terry Tyler
phones
id person_id phone_number
1 1 (555)555-1212
2 2 (555)555-1234
3 1 (555)555-4321
persons
id name
1 Perry Paxson
2 Jeremiah Jones
phones
id person_id phone_number
1 1 (555)555-2121
2 2 (555)555-4412
persons
id name
1 Edie Ellerts
2 Francis Fenimore
phones
id person_id phone_number
1 1 (555)555-3434
2 2 (555)555-7214
3 2 (555)555-9951

While this makes it simple in terms of looking up a customer’s phone number(s) on a single store, SELECT * FROM persons, phones WHERE phones.person_id = persons.person_id AND persons.name = 'Sam Smith', it leaves us in a tight spot where it comes to combining the databases. We either have to pull all the data programmatically and retain the references from phones to persons and re-insert the whole mess into a new set of tables, or we need to add another field to define which store database the record comes from and change the primary key from the single value (id) to a multiple-field key (id, store_name). This means, of course, that everywhere we reference these values in our code we need to change the SQL. Not a happy solution.

So what if we change the id field type from an integer to a UUID (which is a char(36))? Now our tables would look like:

Store X
persons
id name
7d20cac1-d558-4a5c-98c0-ee332b554b29 Sam Smith
5fae5033-36e0-42f8-bb50-c00449c6cab0 Terry Tyler
phones
id person_id phone_number
82eec34b-897b-40af-9674-fa08e5537d12 7d20cac1-d558-4a5c-98c0-ee332b554b29 (555)555-1212
08400e86-9bd0-4713-bf6e-a5b8cd3e7e22 5fae5033-36e0-42f8-bb50-c00449c6cab0 (555)555-1234
ab64d8d7-0511-4b8e-ab57-9932c3c96b31 7d20cac1-d558-4a5c-98c0-ee332b554b29 (555)555-4321
Store Y
persons
id name
9ca534e4-2dbd-4dc3-a7b3-98ab8b89fb70 Perry Paxson
4cc10585-b864-45bc-9f20-f9761af82e06 Jeremiah Jones
phones
id person_id phone_number
85302b40-c12b-4a05-8dbb-500f6f5cb26f 9ca534e4-2dbd-4dc3-a7b3-98ab8b89fb70 (555)555-2121
106a9524-800c-4e5a-ad4a-6ac9f601b370 4cc10585-b864-45bc-9f20-f9761af82e06 (555)555-4412
Store Z
persons
id name
4da59fb9-ad00-4a5b-aa00-1bb503fe9c8f Edie Ellerts
8a39f12c-94c4-4b70-ac70-3cff77be8b01 Francis Fenimore
phones
id person_id phone_number
47d0c49e-02df-4366-853b-a0502edb1de6 4da59fb9-ad00-4a5b-aa00-1bb503fe9c8f (555)555-3434
cb868eca-822f-416a-ac7f-0df12504705f 8a39f12c-94c4-4b70-ac70-3cff77be8b01 (555)555-7214
e70bea84-66e8-407f-bb6c-b164864d22b1 8a39f12c-94c4-4b70-ac70-3cff77be8b01 (555)555-9951

We now have referential data that we can safely combine into a single database, and can even dump back into all the store local database copies.

The How:

When I need to generate my own UUIDs I certainly use the exact same code as Aaron, even cribbed from the exact same source. Of course Python has a UUID module which means you can create a version 1, 3, 4 or 5 UUID quickly and easily. I prefer the version 4 UUIDs myself, since version 1 contains identifiable data about the source, and versions 3 and 5 require other information to generate. (In case you are wondering, there is no version 2).

This is simple enough, but I actually prefer things a little simpler, and when I am using PostgreSQL they do get simpler as I can let the database do the UUID generation. (I am aware of the uuid() function that is built in to MySQL 5, however, that uses version 1 UUIDs which contain system identifiable data. If that is okay with you then by all means feel free to use it.) While there are at least a couple PostgreSQL plugins I have seen for generating UUIDs I actually make use of the CREATE FUNCTION capabilities of PostgreSQL to roll my own. (NOTE: To use this function you must have the PostgreSQL cryptographic libraries installed.)

So first I need to see what I am trying to generate.

From the RFC:

4.4. Algorithms for Creating a UUID from Truly Random or Pseudo-Random Numbers

The version 4 UUID is meant for generating UUIDs from truly-random or
pseudo-random numbers.

The algorithm is as follows:

  • Set the two most significant bits (bits 6 and 7) of the clock_seq_hi_and_reserved to zero and one, respectively.
  • Set the four most significant bits (bits 12 through 15) of the time_hi_and_version field to the 4-bit version number from Section 4.1.3.
  • Set all the other bits to randomly (or pseudo-randomly) chosen values.

The format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

[For more in-depth analysis of how a UUID is constructed refer to http://tools.ietf.org/html/rfc4122]

Creating a version 4 UUID requires a bunch of random hexadecimal data, with one hexadecimal character (x) (equivalent to a nibble – 4 bits for those of you who don’t recall) set to the value of 4 (binary 0100 from Section 4.1.3 of the RFC) and one hexadecimal character (x) set to one of 4 values: 8, 9, a or b; which is what the first nibble of a byte is when the two most significant bits are 10. For proof of this see below:

Bit
3 2 1 0 Hex Dec
1 0 0 0 8 8
1 0 0 1 9 9
1 0 1 0 a 10
1 0 1 1 b 11

Now let’s look at what means we have for generating random data in hexadecimal format. PostgreSQL includes a random() function which generates a double precision number between 0 and 1. We also have a to_hex() function that will convert integer or binary string data to a hexadecimal number. This is a start, but we really don’t want to be calling random() more often than we need to, so is there a way to generate a bunch of hexadecimal data at one shot? Actually, there is, if we just add a sha1 function. So let’s start there:

CREATE OR REPLACE FUNCTION sha1(text)
  RETURNS character AS
$BODY$
BEGIN
RETURN ENCODE(DIGEST($1, 'sha1'), 'hex');
END;
$BODY$
  LANGUAGE 'plpgsql';

Now we have a quick way to generate lots of hexadecimal characters, all we need now is to randomize, and format according to what we have stated above.

CREATE OR REPLACE FUNCTION uuid()
  RETURNS character AS
$BODY$
SELECT SUBSTRING(x.my_rand FROM 1 FOR 8)||'-'||SUBSTRING(x.my_rand FROM 9 FOR 4)||'-4'||SUBSTRING(x.my_rand FROM 13 FOR 3)||'-'||x.clock_1||SUBSTRING(x.my_rand FROM 16 FOR 3)||'-'||SUBSTRING(x.my_rand FROM 19 FOR 12)
FROM  
(SELECT sha1(''||now()||random()) as my_rand, to_hex(8+(3*random())::int) as clock_1) as x;$BODY$
  LANGUAGE 'sql';

Notice that we are using now() to grab the current timestamp (including milliseconds), and concatenating that with the output of random to feed to sha1() to generate a hex digest. We are also generating a value of either 8, 9, a, or b for the first nibble in the clock_seq_hi field (the character highlighted in blue) by adding the value of 3 * random() cast as an integer to 8 and then returning that as a hex value. We then use SUBSTRING to pull the number of characters we need for each chunk and keep on moving left to right in the hash to grab the next values and concatenate them with dashes where needed and our version number.

For this and other PostgreSQL functions I use regularly you can download this SQL file.

Database

Why PostgreSQL

This article about answers the five most common excuses people give for not trying PostgreSQL, or for sticking with their proprietary RDBMS (such as Oracle or SQL Server).