SPDO 1.0b Release Candidate 1

March 8th, 2009 by Sjan Evardsson

SPDO version 1.0 beta Release Candidate 1 is out the door. This version adds the following:

Three custom Exception classes: SPDOError, ConnectionFailedError and ForeignKeyViolationError. Which are used (in order) when the program encounters an unknown sql error, a failure to connect to the database, and a violation of a foreign key constraint. Which means, of course, that foreign key constraints are now part of SPDO.

How this was accomplished is through the use of the new DbBuilder class. The DbBuilder class takes a multi-dimensional dict (Python) or array (PHP) of table and field data to build the tables and create triggers in MySQL and SQLite and add foreign keys in PostgreSQL. As an example consider the following:

In Python:

structure = {
    'authors':{
        'id':['primary_auto'],
        'first_name':['varchar(50)'],
        'last_name':['varchar(50)', 'NONULL', 'INDEX']
        },
    'books':{
        'id':['primary_auto'],
        'author_id':['int', "FK_CASCADE('authors','id')", 'INDEX'],
        'title':['varchar(100)','NONULL', 'INDEX']
        }
    }
dbb = DbBuilder(structure)
dbb.create()

In PHP:

$structure = array(
        'authors'=>array(
        'id'=>array('primary_auto'),
        'first_name'=>array('varchar(50)'),
        'last_name'=>array('varchar(50)', 'NONULL', 'INDEX')
    ),
    'books'=>array(
        'id'=>array('primary_auto'),
        'author_id'=>array('int', "FK_CASCADE('authors','id')", 'INDEX'),
        'title'=>array('varchar(100)','NONULL', 'INDEX')
    )
);
$dbb = DbBuilder($structure)
$dbb->create()

This results in the following queries to be executed:

In PostgreSQL:

CREATE TABLE authors (
    first_name VARCHAR(50) ,
    last_name VARCHAR(50) NOT NULL,
    id SERIAL PRIMARY KEY 
 );
CREATE TABLE books (
    author_id INTEGER ,
    id SERIAL PRIMARY KEY ,
    title VARCHAR(100) NOT NULL
);
CREATE INDEX authors_last_name_idx ON authors(last_name_id);
CREATE INDEX books_author_id_idx ON books(author_id);
CREATE INDEX books_title_idx ON books(title);
ALTER TABLE books ADD FOREIGN KEY (author_id) REFERENCES authors(id) ON DELETE CASCADE;

In MySQL:

CREATE TABLE authors (
    id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50) NOT NULL
);
CREATE TABLE books (
    id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
    author_id INTEGER NOT NULL,
    title VARCHAR(100) NOT NULL
);
CREATE TABLE fk_error_msg (
    error_msg VARCHAR(100) NOT NULL PRIMARY KEY
);
INSERT INTO fk_error_msg VALUES ('Foreign Key Constraint Violated!');
ALTER TABLE authors ADD INDEX(last_name);
ALTER TABLE books ADD INDEX(author_id);
ALTER TABLE books ADD INDEX(title);
CREATE TRIGGER fki_books_author_id
    BEFORE INSERT ON books
    FOR EACH ROW BEGIN
        IF
            0 = (SELECT COUNT(*) FROM authors WHERE id=new.author_id)
        THEN
            INSERT fk_error_msg VALUES ('Foreign Key Constraint Violated!');
        END IF;
    END;
CREATE TRIGGER fku_books_author_id
    BEFORE UPDATE ON books
    FOR EACH ROW BEGIN 
        IF
            0 = ( SELECT COUNT(*) FROM authors WHERE id = new.author_id )
        THEN
            INSERT INTO fk_error_msg VALUES ('Foreign Key Constraint Violated!');
        END IF ;
    END;
CREATE TRIGGER fkdc_books_author_id
    BEFORE DELETE ON authors
    FOR EACH ROW BEGIN
        DELETE FROM books WHERE author_id=old.id;
    END;

In SQLite:

CREATE TABLE authors (
    id INTEGER NOT NULL PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50) NOT NULL
);
CREATE TABLE books (
    id INTEGER NOT NULL PRIMARY KEY,
    author_id INTEGER NOT NULL,
    title VARCHAR(100) NOT NULL
);
CREATE INDEX authors_last_name_idx ON authors(last_name);
CREATE INDEX books_author_id_idx ON books(author_id);
CREATE INDEX books_title_idx ON books(title);
CREATE TRIGGER fki_books_author_id
    BEFORE INSERT ON [books]
        FOR EACH ROW BEGIN
            SELECT RAISE(ROLLBACK, 'insert on table "books" violates foreign key constraint "fki_books_author_id"')
            WHERE NEW.author_id IS NOT NULL AND (SELECT id FROM authors WHERE id = NEW.author_id) IS NULL;
        END;
CREATE TRIGGER fku_books_author_id
    BEFORE UPDATE ON [books]
        FOR EACH ROW BEGIN
            SELECT RAISE(ROLLBACK, 'update on table "books" violates foreign key constraint "fku_books_author_id"')
            WHERE NEW.author_id IS NOT NULL AND (SELECT id FROM authors WHERE id = NEW.author_id) IS NULL;
        END;
CREATE TRIGGER fkdc_books_author_id
    BEFORE DELETE ON authors
        FOR EACH ROW BEGIN
            DELETE FROM books WHERE books.author_id = OLD.id;
        END;

Be sure to check out the wiki, and the source code is all readable there, or download and enjoy. There are still a few edits that need to happen in the comments (like versions need to be updated, and changelogs added and there are a few typos probably still lurking about.)

Foreign Key constraints in SPDO

February 25th, 2009 by Sjan Evardsson

While I have long advocated for doing referential integrity in the application space I have also always made use of things like foreign key constraints in PostgreSQL. I know that MySQL has methods for enforcing foreign key constraints in InnoDB tables, but I typically don’t use InnoDB. And until recently I never thought about using SQLite for anything really serious (you can take this to mean I am thinking about using SQLite for something serious). So I began to look into triggers for SQLite to mimic foreign key constraints. This got me thinking about using triggers in MySQL to do the same thing for MyISAM tables. A quick search turned up the answer. I fully expected foreign key constraints in MySQL 6. Of course this thinking comes from the phrase on MySQL documentation for 4.1, 5.0, 5.1, and (sadly) now 6.0:

For storage engines other than InnoDB, MySQL Server parses the FOREIGN KEY syntax in CREATE TABLE statements, but does not use or store it. In the future, the implementation will be extended to store this information in the table specification file so that it may be retrieved by mysqldump and ODBC. At a later stage, foreign key constraints will be implemented for MyISAM tables as well.

Wishful thinking aside, I did get a request to add foreign key constraints to the SPDO implementation, so I think I will get started on adding the following:

  • A SPDOException class to throw back useful information (such as foreign key constraint violation)
  • A method to create tables
  • A method to add foreign key constraints to tables created by SPDO

While I don’t want to get too convoluted with the structure of SPDO, I do want a well-encapsulated and easy-to-use PDO. In that vein the table creation and addition of foreign key constraints may come about via a Table class that provides a few simple methods to do all these things. I tend to determine my data structure before I begin development and build all my tables first, however, if I am careful about how the Table class is constructed I should be able to build all the tables in whatever RDBMS I happen to be using, with the proper triggers and constraints. entirely programmatically.

Portable code

February 1st, 2009 by Sjan Evardsson

When developing applications against a single RDBMS there is no need to worry about portability. However, when you are developing with the idea that the application may be deployed against any of a range of RDBMSes then you need to start thinking about how you formulate your queries, table structures, etc. While it is common practice to have separate classes for working with different databases, there are some things you can do to make that even easier.

ENUMs are EVIL

Developers who spend a lot of time developing against MySQL get into the habit of creating ENUM fields. (I will admit a certain amount of guilt here, too). This is not only not portable, but it doesn’t work quite the way you would expect. Even though the values in the ENUM are meant to be the only values you can insert into that field, MySQL does not enforce this quite the way you might think it should.

As an example:

CREATE TABLE enumevil (
    id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
    evil ENUM('foo','bar','baz') NOT NULL DEFAULT 'foo'
);
Query OK, 0 rows affected (0.01 sec)
INSERT INTO enumevil (evil) VALUES ('bip');
Query OK, 1 row affected, 1 warning (0.02 sec)
SELECT * FROM enumevil;
+----+------+
| id | evil |
+----+------+
|  1 |      | 
+----+------+
1 row IN SET (0.00 sec)

Rather than spitting back an error, it quietly sets the value of the ENUM field to the empty string. That is not in the list of allowed values nor is it the default. So you are left with checking the values in code before inserting (which you should do anyway – see the next section on referential integrity.)

Instead of using an ENUM field consider this:

CREATE TABLE noevil_enum_choice (
    choice VARCHAR(3) NOT NULL PRIMARY KEY
);
INSERT INTO noevil_enum_choice VALUES ('foo');
INSERT INTO noevil_enum_choice VALUES ('bar');
INSERT INTO noevil_enum_choice VALUES ('baz');
CREATE TABLE noevil (
    id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
    choice VARCHAR(3) NOT NULL REFERENCES noevil_enum_choice.choice
);

Although MySQL does not enforce referential integrity, this is now something can be easily converted to PostgreSQL as:

CREATE TABLE noevil_enum_choice (
    choice VARCHAR(3) NOT NULL PRIMARY KEY
);
INSERT INTO noevil_enum_choice VALUES ('foo');
INSERT INTO noevil_enum_choice VALUES ('bar');
INSERT INTO noevil_enum_choice VALUES ('baz');
CREATE TABLE noevil (
    id SERIAL PRIMARY KEY,
    choice VARCHAR(3) NOT NULL REFERENCES noevil_enum_choice(choice)
);

And PostgreSQL does enforce referential integrity. This also makes it easier to expand your choice list by just inserting a new row.

Referential Integrity

On the flip-side of the coin, developers who spend a good deal of time developing against PostgreSQL come to rely on the referential integrity built in to that RDBMS. While there is nothing wrong with that, your code should enforce this as well, if you want to be able to move from one RDBMS to another. In the example above we could rely on PostgreSQL’s REFERENCES statement to kick back an error whenever a bad value (like ‘bip’) is inserted into the choice field of the noevil table. However, as soon as you move your application to MySQL or sqlite it will happily insert anything you like into that field (with MySQL truncating it to three characters).

This is why it is important for your applications to take into consideration their own referential integrity. Here’s some python to illustrate:

import string
from spdo import *
 
class noevil(object):
    __init__(self, choice):
        self.db = SPDO('test')
        self.choices = []
        mydict = self.db.select("SELECT * FROM noevil_enum_choice")
        for c in mydict:
            self.choices.append(c['choice'])
        if choice not in choices:
            ex[] = ["Invalid value for choice: ", choice, " Valid options are: "]
            for c in self.choices:
                ex.append[c]
                ex.append[" "]
            raise Exception(string.join(ex, ''))
        # continue with normal processing here

This is by no means the entirety of the topic, or even more than merest tip of the iceberg, but it is a good place to start.

UUID Why and How

January 15th, 2009 by Sjan Evardsson

Seeing the post from Aaron the other day on generating UUIDs in PHP got me to thinking about UUIDs in general.

The Why:

So why would we want to use UUIDs? The simple answer is that they are (almost) guaranteed to be a “Universally Unique” ID. So what? Every database table has its own row ids, why do they need to be unique from other database table row ids? Here’s an example:

Jo has three stores X, Y and Z. Each store has a local customer database. That database contains the following table structures:

persons:
  id
  name

phones:
  id
  person_id
  phone_number

(I didn’t say the tables were all that useful, but they work for the sake of discussion.)

Jo looks at the tables from all three stores and decides that it might be a good idea to combine them into one master customer database to share with all the stores. Here is where the trouble starts. When we compare the three databases we find that we cannot simply dump the data from any one of the databases into any of the others.

Store X Store Y Store Z
persons
id name
1 Sam Smith
2 Terry Tyler
phones
id person_id phone_number
1 1 (555)555-1212
2 2 (555)555-1234
3 1 (555)555-4321
persons
id name
1 Perry Paxson
2 Jeremiah Jones
phones
id person_id phone_number
1 1 (555)555-2121
2 2 (555)555-4412
persons
id name
1 Edie Ellerts
2 Francis Fenimore
phones
id person_id phone_number
1 1 (555)555-3434
2 2 (555)555-7214
3 2 (555)555-9951

While this makes it simple in terms of looking up a customer’s phone number(s) on a single store, SELECT * FROM persons, phones WHERE phones.person_id = persons.person_id AND persons.name = 'Sam Smith', it leaves us in a tight spot where it comes to combining the databases. We either have to pull all the data programmatically and retain the references from phones to persons and re-insert the whole mess into a new set of tables, or we need to add another field to define which store database the record comes from and change the primary key from the single value (id) to a multiple-field key (id, store_name). This means, of course, that everywhere we reference these values in our code we need to change the SQL. Not a happy solution.

So what if we change the id field type from an integer to a UUID (which is a char(36))? Now our tables would look like:

Store X
persons
id name
7d20cac1-d558-4a5c-98c0-ee332b554b29 Sam Smith
5fae5033-36e0-42f8-bb50-c00449c6cab0 Terry Tyler
phones
id person_id phone_number
82eec34b-897b-40af-9674-fa08e5537d12 7d20cac1-d558-4a5c-98c0-ee332b554b29 (555)555-1212
08400e86-9bd0-4713-bf6e-a5b8cd3e7e22 5fae5033-36e0-42f8-bb50-c00449c6cab0 (555)555-1234
ab64d8d7-0511-4b8e-ab57-9932c3c96b31 7d20cac1-d558-4a5c-98c0-ee332b554b29 (555)555-4321
Store Y
persons
id name
9ca534e4-2dbd-4dc3-a7b3-98ab8b89fb70 Perry Paxson
4cc10585-b864-45bc-9f20-f9761af82e06 Jeremiah Jones
phones
id person_id phone_number
85302b40-c12b-4a05-8dbb-500f6f5cb26f 9ca534e4-2dbd-4dc3-a7b3-98ab8b89fb70 (555)555-2121
106a9524-800c-4e5a-ad4a-6ac9f601b370 4cc10585-b864-45bc-9f20-f9761af82e06 (555)555-4412
Store Z
persons
id name
4da59fb9-ad00-4a5b-aa00-1bb503fe9c8f Edie Ellerts
8a39f12c-94c4-4b70-ac70-3cff77be8b01 Francis Fenimore
phones
id person_id phone_number
47d0c49e-02df-4366-853b-a0502edb1de6 4da59fb9-ad00-4a5b-aa00-1bb503fe9c8f (555)555-3434
cb868eca-822f-416a-ac7f-0df12504705f 8a39f12c-94c4-4b70-ac70-3cff77be8b01 (555)555-7214
e70bea84-66e8-407f-bb6c-b164864d22b1 8a39f12c-94c4-4b70-ac70-3cff77be8b01 (555)555-9951

We now have referential data that we can safely combine into a single database, and can even dump back into all the store local database copies.

The How:

When I need to generate my own UUIDs I certainly use the exact same code as Aaron, even cribbed from the exact same source. Of course Python has a UUID module which means you can create a version 1, 3, 4 or 5 UUID quickly and easily. I prefer the version 4 UUIDs myself, since version 1 contains identifiable data about the source, and versions 3 and 5 require other information to generate. (In case you are wondering, there is no version 2).

This is simple enough, but I actually prefer things a little simpler, and when I am using PostgreSQL they do get simpler as I can let the database do the UUID generation. (I am aware of the uuid() function that is built in to MySQL 5, however, that uses version 1 UUIDs which contain system identifiable data. If that is okay with you then by all means feel free to use it.) While there are at least a couple PostgreSQL plugins I have seen for generating UUIDs I actually make use of the CREATE FUNCTION capabilities of PostgreSQL to roll my own. (NOTE: To use this function you must have the PostgreSQL cryptographic libraries installed.)

So first I need to see what I am trying to generate.

From the RFC:

4.4. Algorithms for Creating a UUID from Truly Random or Pseudo-Random Numbers

The version 4 UUID is meant for generating UUIDs from truly-random or
pseudo-random numbers.

The algorithm is as follows:

  • Set the two most significant bits (bits 6 and 7) of the clock_seq_hi_and_reserved to zero and one, respectively.
  • Set the four most significant bits (bits 12 through 15) of the time_hi_and_version field to the 4-bit version number from Section 4.1.3.
  • Set all the other bits to randomly (or pseudo-randomly) chosen values.

The format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

[For more in-depth analysis of how a UUID is constructed refer to http://tools.ietf.org/html/rfc4122]

Creating a version 4 UUID requires a bunch of random hexadecimal data, with one hexadecimal character (x) (equivalent to a nibble – 4 bits for those of you who don’t recall) set to the value of 4 (binary 0100 from Section 4.1.3 of the RFC) and one hexadecimal character (x) set to one of 4 values: 8, 9, a or b; which is what the first nibble of a byte is when the two most significant bits are 10. For proof of this see below:

Bit
3 2 1 0 Hex Dec
1 0 0 0 8 8
1 0 0 1 9 9
1 0 1 0 a 10
1 0 1 1 b 11

Now let’s look at what means we have for generating random data in hexadecimal format. PostgreSQL includes a random() function which generates a double precision number between 0 and 1. We also have a to_hex() function that will convert integer or binary string data to a hexadecimal number. This is a start, but we really don’t want to be calling random() more often than we need to, so is there a way to generate a bunch of hexadecimal data at one shot? Actually, there is, if we just add a sha1 function. So let’s start there:

CREATE OR REPLACE FUNCTION sha1(text)
  RETURNS character AS
$BODY$
BEGIN
RETURN ENCODE(DIGEST($1, 'sha1'), 'hex');
END;
$BODY$
  LANGUAGE 'plpgsql';

Now we have a quick way to generate lots of hexadecimal characters, all we need now is to randomize, and format according to what we have stated above.

CREATE OR REPLACE FUNCTION uuid()
  RETURNS character AS
$BODY$
SELECT SUBSTRING(x.my_rand FROM 1 FOR 8)||'-'||SUBSTRING(x.my_rand FROM 9 FOR 4)||'-4'||SUBSTRING(x.my_rand FROM 13 FOR 3)||'-'||x.clock_1||SUBSTRING(x.my_rand FROM 16 FOR 3)||'-'||SUBSTRING(x.my_rand FROM 19 FOR 12)
FROM  
(SELECT sha1(''||now()||random()) as my_rand, to_hex(8+(3*random())::int) as clock_1) as x;$BODY$
  LANGUAGE 'sql';

Notice that we are using now() to grab the current timestamp (including milliseconds), and concatenating that with the output of random to feed to sha1() to generate a hex digest. We are also generating a value of either 8, 9, a, or b for the first nibble in the clock_seq_hi field (the character highlighted in blue) by adding the value of 3 * random() cast as an integer to 8 and then returning that as a hex value. We then use SUBSTRING to pull the number of characters we need for each chunk and keep on moving left to right in the hash to grab the next values and concatenate them with dashes where needed and our version number.

For this and other PostgreSQL functions I use regularly you can download this SQL file.

Drizzle – Lightweight DB Based on MySQL

July 23rd, 2008 by Sjan Evardsson

Brain Acker, director of architecture for MySQL has opened the door to Drizzle, a light-weight, high-concurrency database server based on MySQL, targeted at web applications.

The architectural ideas as described in the FAQ:

A micro-kernel that we then extend to add what we need (all additions come through interfaces that can be compiled/loaded in as needed). The target for the project is web infrastructure backend and cloud components.

The FAQ goes on explain the differences between Drizzle and MySQL which include:

No modes, views, triggers, prepared statements, stored procedures, query cache, data conversion inserts, ACL. Fewer data types.  Less engines, less code. Assume the primary engine is transactional.

Michael Widenius, founder and original MySQL developer, explains more about Drizzle in this blog post, including the most interesting piece (to me) – that Drizzle will always contain the most up-to-date InnoDB code, meaning you don’t need to wait around for MySQL 6 or download the plugins from Oracle each year to get the latest and greatest.

While Drizzle is still in development you can check out the code and try it out. Or maybe even get involved and help out. More information on how to do both is available on the wiki.

Really BIG Databases

February 15th, 2007 by Sjan Evardsson

Found this article while digging this evening.

It seems that topping this ‘Top 10′ list for large databases is The World Data Center for Climate (WDCC), operated by the Max Planck Institute for Meteorology and the German Climate Computing Center.

The WDCC has 220 terabytes of web-accessible data, plus another 6 petabytes of additional data. (Three times the total information in all the US academic research libraries combined.) I would hate to see how that ran on anything slower than their 35 million euro supercomputer!

I should also mention that in the comments section several even larger databases are mentioned in the traditional, post-top-10-anything-countdown, hey-what-about-so-and-so fest.

A simple intro to database normalization

August 30th, 2006 by Sjan Evardsson

I found a very clear, well-written introductory example to database normalization on devshed. Although it is in the MySQL portion of the site, it applies equally well across the board to other RDBMSs.

To get more details on normalization, the normal forms, and general good database development in general, check out Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design, Second Edition by Michael J. Hernandez. Without a doubt the most useful db development book I’ve ever laid my hands on.

Why PostgreSQL

March 18th, 2006 by Sjan Evardsson

This article about answers the five most common excuses people give for not trying PostgreSQL, or for sticking with their proprietary RDBMS (such as Oracle or SQL Server).