Categories
Database

The Metadata Problem With Databases

It is the end of 2004 and virtually everyone uses databases to back end gobs data. Lately there has even been talk of “the database” being legacy technology, a small piece of the more exciting whole of an application. Mix a little bit of this thinking, my recent attempts to store column regular expressions in PostgreSQL constraints and a lot of day time work being spent on in-house apps interfacing with databases and you end up with this question. Where should I put my metadata?

Let me make sure that I’m clear on exactly what metadata I’m interested in (for now). Fields in a table generally have five attributes that we are commonly interested in: type, length, precision, nullness (I don’t think that is a word, but you know what I mean) and comment. That is all fine and good, but I find myself wanting more, that is what got me started on using regular expressions in constraints in the first place. I figured that might be a good place to store another field attribute and get the added bonus of having the database enforce it also. To start putting pen to paper here is an example table that we will work with:

CREATE TABLE users (
  user_id SERIAL NOT NULL,
  username VARCHAR(16) NOT NULL,
  lastname VARCHAR(50) NOT NULL,
  firstname VARCHAR(50) NULL,
  email_addr VARCHAR(250) NULL
);

Now imagine that you are writing a front end to this particular table (it could be web based, but that doesn’t matter) what sort additional attributes (more metadata!) would be handy to have? Regular expression limitations were one the first things on my list. I mean it is great that a username can only be 16 characters long, but what if only want to allow lowercase a through z and digits? A simple regex takes care of this: ‘^[a-z0-9]+$’. We could conceivably want to impose additional limitations on lastname, firstname and email_addr as well. Moving on we discover that lastname is a great database column name, but a crummy field label to use in an application, so every where that the lastname field is used we hard code a nicer label, like ‘Last Name’. This leaves me with a really unpleasant feeling, I’d much rather be able to ask for correct label to use everywhere that I’m displaying data for lastname. This problem could potentially be solved using the column comment feature in databases. So far so good, we could potentially store these two types of metadata in the database itself.

We have pretty much used up on places to put additional metadata, so what happens when we need to store more? One example would be something like an extended help or hint (bubble help?) for a field. In my applications are work I would like to be able to use to this to explain what a field is in greater detail (like why an insert or an update will fail if you try to add a username that doesn’t match ‘^[a-z0-9]+$’). I could use the comment field for this, but then I’m already using that for something else remember? For now that is the extent of the additional metadata I’m interested, but it would not be unreasonable to have additional metadata that I would be interested in later on.

Let use assume for a moment that all of the above issues can be satisfied, there is another concern I have with solutions I’ve outlined so far: metadata duplication. Going back to our example table, say we want to limit lastname to match ‘^[a-zA-Z]+$’. Looking around we discover that also have a customers table with a lastname field that we would also want to constrain in the same way, so add the same constraint to the customers.lastname as we did to users.lastname. Later on we become more enlightened and realize that it is reasonable to have spaces in lastname so we change the regular expression to ‘^[a-zA-Z ]+$’, in two different places. I almost hate to put it this way, but now our metadata is becoming heavily non-normalized. With PostgreSQL this particular problem could be addressed by the use of domains because it supports check constraints, but that still would leave us with a shared label and long description (PostgreSQL domains don’t have a comment attribute).

After going down this road awhile I came to the conclusion that I wasn’t going to be able to store all of the metadata that I wanted to using built-in database features. My next thought was to put this in the business logic layer of an application. I even half convinced myself this might be a good idea, but then that left me with a database that wouldn’t be able to enforce the regular expressions I was interested in. So I went back to idea of finding my original goal of finding a way to store the metadata in the database (with regex constraint support). It turns out that the most obvious solution works with a little bit of tweaking and one gotcha.

What might be this obvious solution be? Why a metadata table of course! Something like this perhaps:

CREATE TABLE metadata (
  metadata_id SERIAL NOT NULL,
  column_name VARCHAR(100) NOT NULL,
  display_name VARCHAR(250) NULL,
  long_note TEXT NULL,
  regex VARCHAR(1024) NULL
);

With this table you could share common metadata and add to as needed in the future. Then I looked at how to make this work with check constraints, so I tried something like this:

ALTER TABLE users ADD CONSTRAINT
username ~ (
  SELECT regex
  FROM metadata
  WHERE column_name = 'username'
);

That idea got shot down real fast. Unfortunately you can not do sub-queries in constraints on PostgreSQL. For a moment I thought that I was going to be faced with defeat before I even got out the door. After looking over more examples of constraints it looked like PostgreSQL would allow the use of a user defined function, so I tried this:

CREATE FUNCTION metadata_regex(varchar)
RETURNS varchar AS '
  SELECT regex
  FROM metadata
  WHERE column_name = $1;
' LANGUAGE SQL;
ALTER TABLE users ADD CONSTRAINT
username_ck CHECK (
  username ~ metadata_regex('username')
);

For what ever reason this works where the plain sub-query does not. There are a few nice features to this approach, you can set a constraint on every column without creating the metadata for it. If the metadata_regex() function returns null then it is an insert/update will always succeed (as if there was no constraint there at all except for the small overhead of doing the lookup). This was you can fill in metadata later if you want to. My boss was quick to ask what happens when you change an existing regex that would invalidate existing data? The short answer is that the system will let you change the regex and not give an error even with data that would now be invalid. It will enforce the new regex when you insert new data or update old data. Depending on your needs this may be a feature or a big problem. If you need to catch these regex changes and throw errors if data would be invalid with new regex there are two possible solutions. One solution would be to write a trigger what would verify that existing data would not be invalidated by the new regex and throw an error if it did.

Another solution to the regex change problem is to create functions that serve as an API to the metadata; so all updates, inserts and deletes would be done via these functions instead of directly manipulating the metadata table. These functions would act in a similar way as the trigger described above, preventing changes that would invalidate existing data. I’m not wholly convinced that one is better than the other, but I am starting to lean towards the API idea because it allows you hide the details of how the metadata is stored. This technique also lead me to another feature of PostgreSQL, creating new types.

Combining the API idea with creating new types allows you run regular SQL queries using the functions like tables. For some examples of how this can work take a look at ‘Defining and returning Rowtypes‘ in the 19 May 2003 General Bits. There is another example at General Bits under ‘SQL 2003 Standards Approved‘ from 17 April 2004. Between these two examples you should start to see the possibilities here, by creating an API to all your additional metadata not only can you limit changes you can allow the database to enforce them, either via triggers or constraints and expose them to the application layer through traditional SQL queries.

My goal is to be able to add new metadata (attributes) to database columns, allowing not just the upper application layers to make use of it but the database also. After some research and pondering it looks like PostgreSQL has the features necessary to make this possible in an easy and extensible way.