I started my career on the database end of the technological landscape, as a consulting database reporting specialist, then later as a consulting DBA for Sybase. It was there that I discovered that the real fun was in system administration, but I still have a deep love for data organization, modeling, and management, even though I don’t get to do quite as much of it anymore.
I’m sure that some of my skills are rusty (my last DBA consulting gig was almost 10 years ago), but not so rusty that I don’t know true database badness when I see it, and I’ve seen tons of it over the years in open source applications put together by developers who a) are not database people, and b) don’t realize they are not a database person, or worse, b) think they are in fact a database person, and so refuse to consult an *actual* database person.
If you’re not sure if your database exhibits signs of badness, I’ve put together some pointers that I hope will help you down the road to database goodness. Data modeling and data handling in general is an area that is heavily debated, so I expect the flames. All I ask is that you reflect for a moment on the comment I’m making in the context of the current general state of data modeling in the open source universe before you decide to burn me at the stake.
‘name’ and ‘value’ columns
I’ve seen this in more software than I’d like to admit, and not all of it open source. Â It tends to crop up in places where there is an arbitrarily large number of attributes that each have a ‘name’ and a ‘value’. This seems like a great way to go at first, because instead of figuring out which attributes are assigned to a given object, you can just select ‘name’ and ‘value’ by some id that is presumably stored in the same table. Unfortunately, this is horribly bad form for a number of reasons:
- it doesn’t follow any normalization rules. Each column becomes functionally dependent on all of the others.
- you shatter the thought that tables represent entities and columns represent attributes and keys represent instances of those entities.
- You lose the ability to do even simple joins because what should be a column is now (maybe) a row.
That last item is the one that is likely to bite you first (and hardest). Suppose you have a user table like this:
id | username |
1 | jonesy |
2 | kermit |
Instead of storing the user preferences in the user table, you store it in a table like this:
id | uid | name | value |
1 | 1 | jonesy@somedomain.com | |
2 | 1 | bldg | COS |
3 | 1 | room | 101 |
4 | 2 | kermit@somedomain.com | |
5 | 2 | bldg | COS |
Great! Now, everyone in your user table has a desktop computer that’s been assigned to them. Like a good little DBA, you map the user id to the asset tag id for the machine, like so:
id | uid | tag |
1 | 1 | 00999 |
2 | 2 | 00839 |
Ok! Now, write me a query that will give me the user name and email address for each tag.
Ewwwwwww.
If you’re a coder, you can probably do a couple of selects into various arrays and loop over them or do some kind of merge with a callback function or something to figure out which names and emails go with each tag, but how efficient is that? Supposing your open source project is installed at some huge company that has lots of people who have lots of desktops. Tens of thousands, even. This would not do at all!
Also, this is no way to insure data integrity, because your ‘value’ column is forced to be some kind of VARCHAR or TEXT field, because you have no idea what type of data is going in there. It will differ depending on the ‘name’ column. Of course, the front end application code has to account for this. While they’re at it, the front end coders will also have to check the spelling of every single potential value that can possibly go into the ‘name’ column to make sure there aren’t separate ’email’ and ‘EMAIL’ values. And what if you decide that a user should only ever be assigned to a single building? Front end coders will have to validate for that, too, instead of using a built in UNIQUE constraint. This is just painful all over.
The right way to put a table together is to figure out what entity the table is about. A ‘user’ table is a table about users. Therefore, the primary key should be the primary identifier for ‘user’. The rest of the columns in the table are attributes of ‘user’. In the event that a user can have more than one value for a given attribute (for example, I might have more than one ’email’ value), make a separate table that maps user id’s to email addresses.
NOTE:
In a lot of cases, the unique identifier is a serial ‘id’ column. There are pitfalls with that as well, but just know that an autogenerated id field handed to you by the database should not be the only way of uniquely identifying a row in your table. You should think of the id field more as a handle to a row that you’ve uniquely identified using some other means whenever possible.
One column, multiple values
If you’re doing this, it’s going to cause you and your project big headaches later. Putting multiple values in a single column takes away the database’s ability to validate input, enforce various types of constraints, including UNIQUE and CHECK constraints, and eliminates the possibility of performing joins on that column. All of this means more front end code. In addition, it causes you to write more code on the front end to parse the values coming back from the database!
Multiple columns, one attribute
This is almost the opposite of the above: a single table with columns called, for example “email1” and “email2”. This is problematic because, of course, you may decide at some point that having three email addresses is just fine. In that case, you’ll have to add another column to the table, get the data in there for the people you have data for, and fill in NULLs or empty strings for the rest. It’s ugly.
Like many other data design problems, this is another one that limits your ability to perform JOIN operations to exploit relationships between various data entities. Think about our users again, and how we wanted to find user names and email addresses for each asset tag. Which email should you use? In this scenario, you’ll either do a JOIN involving only one of the email address fields, or you’ll do a SELECT for each ’emailN’ column you have, and it only goes downhill from there, as we’re back to scalability issues and general ugliness and inelegance.
The right thing to do here is create an email lookup table that has columns for userid and email address. Now you have no more NULL data to worry about, because there are only rows in this table for users who have email addresses. Also, a side effect of this design move is that users can have an arbitrary number of email addresses, because all it means is adding another row to the table.
NOTE:
To further normalize, you could actually create an ’email_address’ table that just maps email addresses to id’s, and then instead of having uid and email columns, you’d have uid and mailid columns. I’ve gone with the assumption here that, while a user can have any number of email addresses, each email address is generally mapped to a single user, unless your application is a mailing list manager 😉
Nothing is NULL
Using “NOT NULL” throughout the entire database causes as many problems as it solves, IMHO. NULL is, at least, something the code can check for consistently. Without a NULL, the coder is forced to know what the default value is for each column that you have made ‘NOT NULL’. It is not typically the case that every NOT NULL field in the db has a default value of ‘ ‘, because ‘ ‘ is not valid input for columns of non-string datatypes. Fields using date, time, numeric and float datatypes all have different non-” valid default values. Further – these valid default values may differ from db to db. I’m not saying that NULL’s are a coder’s (or, indeed, a dba’s) best friend, but it allows a level of consistency that’s hard to match. If a value isn’t there, “NULL” is a universally acceptable alternative regardless of the database or datatype in question.
One entity, multiple tables This one can be a little less obvious. Let’s look at a generic “user” to illustrate. A great many applications implement security hierarchies based on the type of user. So, for example, there are admin users, guest users, and a couple of user types in between. My contention is that all of these admins and guests are still users, and therefore should all be put in a table called “users”, which is probably a much smaller table than you’re imagining, since it holds only that user data which all the various types of users have in common, which is often just “person” data like name and email address. Data that is specific to a role within the application would be kept in a lookup table, to allow for the possibility of a single user having multiple roles within the application. This problem has bitten seemingly every open source CMS in existence at one time or another.
Onward and Upward
I hope this has helped at least one person get their data in order. I’m also sure I’ve left out dozens of scenarios and common database badness that I’ve just forgotten about. Don’t even get me started on data validation at the app level – I purposely avoided app-level data handling mistakes and stuck to modeling problems because that alone is a big enough world to get lost in.
Enjoy!