Category: Best Practices

There are good ways, and then better ones, can we look for the best?

Best Practices

How intranet software goes to hell

We have all seen it, many of us have tried to clean it up, and a few of us may have even been responsible for some of the worst written, non-documented, buggy, spaghetti-like code ever – “internal use only” apps. These are apps that are meant to simplify the jobs of your co-workers, meant to automate repetitive tasks and meant to be a means for managing the company’s business. So how do they end up so terrible? You’ve got the hottest, leanest, cleanest code on the public facing side, so you obviously have the talent in-house to make good software. (And let’s be honest, all of us think our own software is the best, because if we didn’t we would die of shame whenever anyone asked where we worked.)

So how does the software we build for ourselves go so wrong? Well, in my observations through many jobs over many years, I have come up with a formula for really lousy internal software.

Step one: Start small
By start small I don’t mean start with a single database with 4 or 5 tables and a couple views and a few report generation scripts. I mean start really, really small. Like “put a page on the intranet that lists all our vendors and their current status.”

By starting with such a small task it is easier to forgo any sort of documentation, architecture planning or requirements specifications. Its also easier to convince yourself that this is unimportant. After all, this is merely a convenience for your fellow workers and not an integral part of the revenue stream. This is the first step on the road to ruin.

Step two: Occasionally add a feature, not too much at once
It is important at this early stage in the gestation of your beastly code that you keep feature adds at least as small as the original task. By not having anything “worthy” of architecture or specification you can guarantee the continued growth of your new monster. These should be things like “Can we also show the vendor’s contact info on that list?” followed a month later by “can we filter the list to only show active vendors?” These changes should not only be small, but should be spaced far enough apart that the developer involved has forgotten about the changes that came before, or at least how many there were.

Step three: Repeat steps one and two, several times
Now that you have a minor little thing here, it is time to add some more. This time, let’s do the same thing, but for, say clients. Because you already have the basics it is a perfect time for some copy and paste development. Change the query, but don’t bother with changing variable names or anything. After all, you already know it works, just use it as is with some text label changes on the output. Easy-peasy and took you about five minutes. At this rate, you could just as quickly add the same sort of thing for employees. And any other sort of list that comes up.

Step four: Time for a big change
Now it becomes time to turn all your “unrelated” (although code-copied) little, unimportant, non-revenue stream items into one full-fledged app. Since you are already convinced that none of this is very important, and most of it is already built and functioning, it is easy to convince yourself that turning this into one contact management app is a small enough task to not need architecture, requirements or even any real documentation. This is generally where the real shape of the beast starts to take form. Now your query and display scripts will need to be able to insert, update and delete, and your one display will need to be diversified into display, edit forms, and perhaps a login page to insure the person using the forms has permissions to edit or delete.

If you really want to do it up right, instead of turning it into the obvious (in this case a contact management app) turn it into something close, but not quite the same. Say, an inventory and order management app. Hey, we already have the client and vendor info, we’re more than halfway there, right?

Step five: The final chapter
The last step is perhaps the easiest. Once the monstrosity is running on your intranet and working (however badly), ignore it. Requests for bug fixes go to the bottom of the queue as it is, after all, not part of your revenue stream. Developer time is better spent on your customer-facing apps and there really is no need to make it work completely, because “we got along fine without it before it was built.”

Of course, during this waiting period the app that horror built becomes a routine part of the workflow of those who use it regularly and they pretty well can’t do their job without it any more.

How to avoid it altogether, in one simple step:
In my experience, the simplest way to avoid these kinds of nightmare creatures of code, is to require a full architecture, specification and documentation cycle for even the simplest little things. You are likely to find that even though you were only asked for a vendor list, what your co-workers really need is far beyond that. Of course, you will only bother with treating it like any other development cycle if you can see the project as an important part of your business, and as having impact on the revenue stream. If it seems too small to bother with treating it like a full project, then either the requester has failed to make clear its importance, or it really is something that should not even be taken on.

Best Practices

Daylight Saving Time Headaches

I have never been particularly fond of the concept of Daylight Saving Time (cutting one off of a blanket and sewing to the other end does not make a longer blanket.) This time around, though, I ran into an issue involving the perfect combination of a monthly cron job, a server set to local time and the switch from Daylight Saving to Standard Time on the first of the month.

At precisely 1:14 am on the first day of the month the cron job ran, as it does the first day of every month, and picked a raffle winner for one of our client’s monthly contests. At 2:00 am the time on the server rolled back to 1:00 am in accordance with the switch to Standard Time for the US. Fourteen minutes later the job ran again, and picked another winner.

Whoops. Now our system has awarded two people a single prize. Telling the second one to get the prize that they didn’t really win would not get us any points with the client, as their customer would be upset. Likewise, charging the client for the second prize is a non-starter, as it is, in fact, our fault. When I inherited these systems I looked through all the cron jobs to get a feel for what the system is doing and when. What didn’t occur to me, however, was that jobs scheduled at the wrong time of day could fall victim to Daylight Saving/Standard Time change-overs.

Any daily job that runs between 1:00 am and 2:00 am will fail to run once a year (Standard -> Daylight Saving when clocks jump ahead an hour) and will run twice once a year (Daylight Saving -> Standard Time when clocks fall back from 2:00 am to 1:00 am). Weekly jobs that run between 1:00 am and 2:00 am on Sundays will likewise misbehave, while monthly jobs, regardless of day of the month, have a small chance of experiencing one of these issues. In this case, the job runs on the 1st, which happened to be the first Sunday in November, and bang: error.

Needless to say, we modified all the cron jobs to ensure than none of them start between 1:00 am and 2:00 am.

Best Practices

Data Scrubbing: Is there a right way?

An article yesterday from ars technica got me wondering. In my former position we often “scrubbed” databases for sample data from which to work. And certainly one can see the value in working with data with personally identifiable information removed for the purposes of business or health-care informatics, service level determinations, quality of service surveys, and so on. Yet, according to a study at Carnegie Mellon University:

87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}.

This seems to be the balance point: 3 pieces of non-anonymized data are enough to identify the majority of the population. (Think, “Three Strikes! You’re out!”) So what do we do when we need solid, anonymous data from which to work?

Taking the example of the health records from the article linked above, I would think that the following steps would be enough to fully scrub the data beyond where “reidentification” would not be possible. Since this is medical records we can safely (I feel) assume that randomizing the gender is a non-starter. (“Wow, according to this data 14% of all men went to the Emergency Room with pregnancy-related complications!”)

And since this data is taken from an area where the zip codes are known we are already at two strikes. So why did they not randomize the dates of birth? It would be difficult to do in the case of infants beyond a few days or weeks, since many of the health issues are related to their age in months. But for anyone over the age of 8 it should be simple enough to randomize the month and date of birth, and allow a set of ranges for randomizing the year of birth. If we assume a 20% range up or down we gain a lattitude of possible years of birth which increases the older the patient actually is. Another possibility is to give everyone the same date of birth, differing only in the year. (Jan 1, xxxx).

This of course means that any reporting done on age is meaningless, but it also means that the data can more safely be widely distributed. In cases where exact age and gender are required for study it would be better to merge data from many different areas, covering as many cities, counties, states and regions as possible. In this case we would still need to weigh the risks, as all three pieces of data would still be available, although at a much higher level of trail and error. In the case mentioned by ars technica the study covered seven zip codes. Perhaps spreading the information over a few hundred would make it much less worth the effort to sort through them all to try to identify individuals, and even then one would expect multiple possible hits.

The need for real data for statistical analysis and study is not going to go away. When you are considering releasing scrubbing data to release a “sanitized” version it would be good to keep the mantra “Three Strikes! You’re out!” in mind. When it comes to data for testing software operation, however, I still think the better method is complete randomization. Totally bogus data that has the look and feel of “real” data. (Which is no doubt why all the bogus users in my test dbs live in different cities, in different states, and at addresses ranging from 123 to 999 on Any Street!)

Best Practices

Keeping Gentoo Fresh

I had a converstion with a friend about Linux distros earlier today, and I was asked why I choose to run Gentoo on my web server. He told me that Gentoo was too hard to maintain on a server, and that when it came time to upgrade something (like Apache or PHP) due to security patches that it took too long, and too often failed. I was confused by this so I asked for clarification. What he described was the pain of updating anything on a “stale” Gentoo machine.

Unlike so many of the other popular distros, Gentoo does not, by default, use pre-compiled packages. So unlike doing rpm -i or apt-get install doing emerge on Gentoo requires that the package you are installing, and any missing dependencies, are pulled in as source code and compiled. When you think about adding packages like, say, Lynx, the process takes only a few minutes on a moderately decent machine. (Mine is a PII 966 and Lynx took about 4 minutes start to finish). When you talk about upgrading something like Apache, however, the length of time it takes depends not only on the speed of the machine, but how many of its dependencies are out of date. In fact, if you fail to update regularly you can run into an issue where not only are most of your files out of date, but your system profile is out of date and you need to do some serious wrenching to get the whole thing working again. In the times that this has happened to me (twice) I was able to get the system up-to-date once, and just gave up and reinstalled a newer version the second time. (These were both rarely used VMs, and not production boxes.) However, updating the profile on a “fresh” Gentoo is (in my experience) a painless procedure of rm /etc/make.profile && ln -sf /usr/portage/profiles/profile_name /etc/make.profile && emerge -uND world (uND : update newuse deep: update, take into account new use settings from the profile and make.conf, and include deep dependencies).

So how do I avoid the “stale” Gentoo syndrome? I take a three-step approach.

A daily cron job runs emerge -puvD world (puvD : pretend update verbose deep : just tells you what would be emerged, in an update, verbosely, and include deep dependencies) and emails me the output. This enables me to see each morning which packages have updates available.
Every day that I have the time for it I log into the machine and run emerge -uD world and follow it up with etc-update (if needed) and revdep-rebuild if any libraries were included in the updates. (I save building new kernels for Sundays, and that doesn’t happen all that often, but I do like to always run the latest.)
I check the messages from emerge to see if there are any special configuration changes that need to happen post-install that cannot be handled by etc-update. For instance, changing configurations in /etc/conf.d/packagename, new profiles or anything of that sort.

Ok, so I like to keep my system on the latest and keep a shiny new everything on it. How does that compare with something like, say, Debian? In Debian (and Debian-based distros) you can update packages to a certain point, after which the package for that version of Debian is no longer supported or updated. So you need to upgrade your version, and your kernel, which you do with apt-get upgrade dist. Seems easy enough. And how does Gentoo handle version upgrades? It doesn’t need to. If you keep your system up-to-date in the way I described your system will match whatever the latest Gentoo release has. In fact, I built my web server using Gentoo 2006.0 and have been keeping it up-to-date since then. (Gentoo seems to have stopped doing the biannual releases, btw – they are now releasing updated minimal install CDs nearly weekly for each architecture.)

Best Practices

Portable code

When developing applications against a single RDBMS there is no need to worry about portability. However, when you are developing with the idea that the application may be deployed against any of a range of RDBMSes then you need to start thinking about how you formulate your queries, table structures, etc. While it is common practice to have separate classes for working with different databases, there are some things you can do to make that even easier.

ENUMs are EVIL

Developers who spend a lot of time developing against MySQL get into the habit of creating ENUM fields. (I will admit a certain amount of guilt here, too). This is not only not portable, but it doesn’t work quite the way you would expect. Even though the values in the ENUM are meant to be the only values you can insert into that field, MySQL does not enforce this quite the way you might think it should.

As an example:

CREATE TABLE enumevil (
    id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
    evil ENUM('foo','bar','baz') NOT NULL DEFAULT 'foo'
);
Query OK, 0 rows affected (0.01 sec)
INSERT INTO enumevil (evil) VALUES ('bip');
Query OK, 1 row affected, 1 warning (0.02 sec)
SELECT * FROM enumevil;
+----+------+
| id | evil |
+----+------+
|  1 |      | 
+----+------+
1 row IN SET (0.00 sec)

Rather than spitting back an error, it quietly sets the value of the ENUM field to the empty string. That is not in the list of allowed values nor is it the default. So you are left with checking the values in code before inserting (which you should do anyway – see the next section on referential integrity.)

Instead of using an ENUM field consider this:

CREATE TABLE noevil_enum_choice (
    choice VARCHAR(3) NOT NULL PRIMARY KEY
);
INSERT INTO noevil_enum_choice VALUES ('foo');
INSERT INTO noevil_enum_choice VALUES ('bar');
INSERT INTO noevil_enum_choice VALUES ('baz');
CREATE TABLE noevil (
    id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
    choice VARCHAR(3) NOT NULL REFERENCES noevil_enum_choice.choice
);

Although MySQL does not enforce referential integrity, this is now something can be easily converted to PostgreSQL as:

CREATE TABLE noevil_enum_choice (
    choice VARCHAR(3) NOT NULL PRIMARY KEY
);
INSERT INTO noevil_enum_choice VALUES ('foo');
INSERT INTO noevil_enum_choice VALUES ('bar');
INSERT INTO noevil_enum_choice VALUES ('baz');
CREATE TABLE noevil (
    id SERIAL PRIMARY KEY,
    choice VARCHAR(3) NOT NULL REFERENCES noevil_enum_choice(choice)
);

And PostgreSQL does enforce referential integrity. This also makes it easier to expand your choice list by just inserting a new row.

Referential Integrity

On the flip-side of the coin, developers who spend a good deal of time developing against PostgreSQL come to rely on the referential integrity built in to that RDBMS. While there is nothing wrong with that, your code should enforce this as well, if you want to be able to move from one RDBMS to another. In the example above we could rely on PostgreSQL’s REFERENCES statement to kick back an error whenever a bad value (like ‘bip’) is inserted into the choice field of the noevil table. However, as soon as you move your application to MySQL or sqlite it will happily insert anything you like into that field (with MySQL truncating it to three characters).

This is why it is important for your applications to take into consideration their own referential integrity. Here’s some python to illustrate:

import string
from spdo import *
 
class noevil(object):
    __init__(self, choice):
        self.db = SPDO('test')
        self.choices = []
        mydict = self.db.select("SELECT * FROM noevil_enum_choice")
        for c in mydict:
            self.choices.append(c['choice'])
        if choice not in choices:
            ex[] = ["Invalid value for choice: ", choice, " Valid options are: "]
            for c in self.choices:
                ex.append[c]
                ex.append[" "]
            raise Exception(string.join(ex, ''))
        # continue with normal processing here

This is by no means the entirety of the topic, or even more than merest tip of the iceberg, but it is a good place to start.

Best Practices

Disclosure of Website Vulnerabilities Illegal?

A discussion on Slashdot earlier today brought up the question. It seems that Eric McCarty, a student at Purdue University in Dr. Pascal Meunier’s CS390 – Secure Computing, discovered, and reported, a flaw he found on the Physics department website. When that site was hacked two months later (most likely through a different flaw, since the one reported by McCarty was patched) law enforcement came looking for Mr. McCarty. In this particular case McCarty came forward, and was eventually cleared. However, it did change how Dr. Meunier teaches his class. He no longer recommends disclosure, but recommends that one eliminates all evidence of the discovery from their computer and say nothing.

I see this as a particularly disturbing direction in which to move.

Best Practices

Surf carefully

Although it has been said many, many times, be careful how you surf. Make sure your machine is patched, you have anti-virus and spy ware blockers, blah blah blah.

Well, if a picture is worth a thousand words, then maybe this video will shed some light on the subject (sorry – it is an ad for McAfee, which I neither use nor recommend – just my personal preference) .

Best Practices

A simple intro to database normalization

I found a very clear, well-written introductory example to database normalization on devshed. Although it is in the MySQL portion of the site, it applies equally well across the board to other RDBMSs.

To get more details on normalization, the normal forms, and general good database development in general, check out Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design, Second Edition by Michael J. Hernandez. Without a doubt the most useful db development book I’ve ever laid my hands on.

Best Practices

10 Windows Password Myths

I saw this article (Ten Windows Password Myths) over at Security Focus and thought it was worth sharing. And something I didn’t know about Windows (2000/XP) passwords:

If your password is 15 characters or longer, Windows stores the constant AAD3B435B51404EEAAD3B435B51404EE as your LM hash, which is equivalent to a null password.

The nice thing about that is that it then becomes impossible for your password to be cracked from the SAM database, since we kow it isn’t actually null.

Perhaps the most interesting point the article makes, (and though it has been made many times before it bears repeating) is that despite every other security measure we put into place, without strong passwords our defenses are easily overcome.

evardsson.com: stuff that w0rks