Categories
Database

The Metadata Problem With Databases

It is the end of 2004 and virtually everyone uses databases to back end gobs data. Lately there has even been talk of “the database” being legacy technology, a small piece of the more exciting whole of an application. Mix a little bit of this thinking, my recent attempts to store column regular expressions in PostgreSQL constraints and a lot of day time work being spent on in-house apps interfacing with databases and you end up with this question. Where should I put my metadata?

Let me make sure that I’m clear on exactly what metadata I’m interested in (for now). Fields in a table generally have five attributes that we are commonly interested in: type, length, precision, nullness (I don’t think that is a word, but you know what I mean) and comment. That is all fine and good, but I find myself wanting more, that is what got me started on using regular expressions in constraints in the first place. I figured that might be a good place to store another field attribute and get the added bonus of having the database enforce it also. To start putting pen to paper here is an example table that we will work with:

CREATE TABLE users (
  user_id SERIAL NOT NULL,
  username VARCHAR(16) NOT NULL,
  lastname VARCHAR(50) NOT NULL,
  firstname VARCHAR(50) NULL,
  email_addr VARCHAR(250) NULL
);

Now imagine that you are writing a front end to this particular table (it could be web based, but that doesn’t matter) what sort additional attributes (more metadata!) would be handy to have? Regular expression limitations were one the first things on my list. I mean it is great that a username can only be 16 characters long, but what if only want to allow lowercase a through z and digits? A simple regex takes care of this: ‘^[a-z0-9]+$’. We could conceivably want to impose additional limitations on lastname, firstname and email_addr as well. Moving on we discover that lastname is a great database column name, but a crummy field label to use in an application, so every where that the lastname field is used we hard code a nicer label, like ‘Last Name’. This leaves me with a really unpleasant feeling, I’d much rather be able to ask for correct label to use everywhere that I’m displaying data for lastname. This problem could potentially be solved using the column comment feature in databases. So far so good, we could potentially store these two types of metadata in the database itself.

We have pretty much used up on places to put additional metadata, so what happens when we need to store more? One example would be something like an extended help or hint (bubble help?) for a field. In my applications are work I would like to be able to use to this to explain what a field is in greater detail (like why an insert or an update will fail if you try to add a username that doesn’t match ‘^[a-z0-9]+$’). I could use the comment field for this, but then I’m already using that for something else remember? For now that is the extent of the additional metadata I’m interested, but it would not be unreasonable to have additional metadata that I would be interested in later on.

Let use assume for a moment that all of the above issues can be satisfied, there is another concern I have with solutions I’ve outlined so far: metadata duplication. Going back to our example table, say we want to limit lastname to match ‘^[a-zA-Z]+$’. Looking around we discover that also have a customers table with a lastname field that we would also want to constrain in the same way, so add the same constraint to the customers.lastname as we did to users.lastname. Later on we become more enlightened and realize that it is reasonable to have spaces in lastname so we change the regular expression to ‘^[a-zA-Z ]+$’, in two different places. I almost hate to put it this way, but now our metadata is becoming heavily non-normalized. With PostgreSQL this particular problem could be addressed by the use of domains because it supports check constraints, but that still would leave us with a shared label and long description (PostgreSQL domains don’t have a comment attribute).

After going down this road awhile I came to the conclusion that I wasn’t going to be able to store all of the metadata that I wanted to using built-in database features. My next thought was to put this in the business logic layer of an application. I even half convinced myself this might be a good idea, but then that left me with a database that wouldn’t be able to enforce the regular expressions I was interested in. So I went back to idea of finding my original goal of finding a way to store the metadata in the database (with regex constraint support). It turns out that the most obvious solution works with a little bit of tweaking and one gotcha.

What might be this obvious solution be? Why a metadata table of course! Something like this perhaps:

CREATE TABLE metadata (
  metadata_id SERIAL NOT NULL,
  column_name VARCHAR(100) NOT NULL,
  display_name VARCHAR(250) NULL,
  long_note TEXT NULL,
  regex VARCHAR(1024) NULL
);

With this table you could share common metadata and add to as needed in the future. Then I looked at how to make this work with check constraints, so I tried something like this:

ALTER TABLE users ADD CONSTRAINT
username ~ (
  SELECT regex
  FROM metadata
  WHERE column_name = 'username'
);

That idea got shot down real fast. Unfortunately you can not do sub-queries in constraints on PostgreSQL. For a moment I thought that I was going to be faced with defeat before I even got out the door. After looking over more examples of constraints it looked like PostgreSQL would allow the use of a user defined function, so I tried this:

CREATE FUNCTION metadata_regex(varchar)
RETURNS varchar AS '
  SELECT regex
  FROM metadata
  WHERE column_name = $1;
' LANGUAGE SQL;
ALTER TABLE users ADD CONSTRAINT
username_ck CHECK (
  username ~ metadata_regex('username')
);

For what ever reason this works where the plain sub-query does not. There are a few nice features to this approach, you can set a constraint on every column without creating the metadata for it. If the metadata_regex() function returns null then it is an insert/update will always succeed (as if there was no constraint there at all except for the small overhead of doing the lookup). This was you can fill in metadata later if you want to. My boss was quick to ask what happens when you change an existing regex that would invalidate existing data? The short answer is that the system will let you change the regex and not give an error even with data that would now be invalid. It will enforce the new regex when you insert new data or update old data. Depending on your needs this may be a feature or a big problem. If you need to catch these regex changes and throw errors if data would be invalid with new regex there are two possible solutions. One solution would be to write a trigger what would verify that existing data would not be invalidated by the new regex and throw an error if it did.

Another solution to the regex change problem is to create functions that serve as an API to the metadata; so all updates, inserts and deletes would be done via these functions instead of directly manipulating the metadata table. These functions would act in a similar way as the trigger described above, preventing changes that would invalidate existing data. I’m not wholly convinced that one is better than the other, but I am starting to lean towards the API idea because it allows you hide the details of how the metadata is stored. This technique also lead me to another feature of PostgreSQL, creating new types.

Combining the API idea with creating new types allows you run regular SQL queries using the functions like tables. For some examples of how this can work take a look at ‘Defining and returning Rowtypes‘ in the 19 May 2003 General Bits. There is another example at General Bits under ‘SQL 2003 Standards Approved‘ from 17 April 2004. Between these two examples you should start to see the possibilities here, by creating an API to all your additional metadata not only can you limit changes you can allow the database to enforce them, either via triggers or constraints and expose them to the application layer through traditional SQL queries.

My goal is to be able to add new metadata (attributes) to database columns, allowing not just the upper application layers to make use of it but the database also. After some research and pondering it looks like PostgreSQL has the features necessary to make this possible in an easy and extensible way.

6 replies on “The Metadata Problem With Databases”

If you are looking for a prior example of this type of system, check out Progress. I has this metadata in the field definitions to support the 4GL. It has column names, output formats, help messages, validation expressions, and validation error messages. You’re right that changing the format of a common column can become a problem as every table with that column needs to be fixed.

It would be nice if somehow a common set of field metadata could be added directly to the field definitions in other databases.

Joseph,

I’d watch out, even with the new solution. Table lookup functions in constraints may load, but that doesn’t mean that they won’t create problems down the line. You’ve already discovered one limitation, changes to existing data. A second limitation will become apparent when you try to back-up and restore the database; the constraint as you’ve written it may not register dependencies correctly and thus may not restore.

Overall, what I’m saying is that, in the current PostgreSQL design spec, constraints are not supposed to point to external table references, and you can cause yourself headaches through doing this. Of course, you can always jump on PGSQL-HACKERS with a patch to better support constraints with external references, but you’ll have to argue it out. A lot of people say this is the job of database Assertions, a hypothetical feature.

In any case, in your search for metadata you missed a very useful feature … DOMAINs. These are a SQL92 feature that exists somewhere between a full datatype and a table column. As in the example above, you’d do:

CREATE DOMAIN USERNAME AS TEXT
CHECK VALUE ~* ‘^[a-zA-Z ]{3,16}$’;

Then you can create the table as:

CREATE TABLE users (
user_id SERIAL NOT NULL,
username USERNAME NOT NULL,
lastname VARCHAR(50) NOT NULL,
firstname VARCHAR(50) NULL,
email_addr VARCHAR(250) NULL
);

For fields requiring complex constraints, I find this approach a lot more self-documenting than the use of table constraints. Also, it means that the USERNAME domain can be re-used in other tables.

You’ll note that I’ve put the 3-to-16-character limit in the regex instead of using VARCHAR(16). I did this for several reasons:

1) The database needs to check the expression against the regex, and we might as well do ONE check, and not TWO.

2) It allowed me to do a *minimum* as well as maximum length.

3) I generally feel that VARCHAR is a legacy of limited-storage systems of the late 80’s and try to avoid it in modern applications, SQL-standard or not.

Also, I think you can COMMENT ON the DOMAIN, to give yourself some extra metadata.

–Josh Berkus
PostgreSQL Project
josh, who is at postgresql.org

Josh:

Excellent point about backup and restores, I didn’t really address that but it is certainly something to keep in mind. In my example, if you kept your metadata in another table that table would have to be the very first table loaded. Otherwise tables with real data in them would fail as soon as they tried to lookup constraint details. Perhaps one possible solution would be have PostgreSQL support a special metadata table that was marked as such so that the system could be given enough info backup and restore properly (ala foreign keys and the like).

If you’ll read through the article, I mention the use of DOMAINs as a possible solution to the problem. DOMAINs only solve part of the problem by allowing for a DOMAIN to be used in across multiple tables/columns but doesn’t provide the flexibility that I was looking for. Looks like you can’t add a comment to a DOMAIN. What happens when you want to add some additional metadata to a column? The way that DOMAINs work only add a small amount of new flexibility without going all the way.

Joseph,

Yes, I realize that the use of DOMAINS doesn’t give you the full flexibility which you are looking for. FWIW, the lack of COMMENT ON for DOMAINS is fixed for 8.0.

Implementing a metadata-extended-attributes table like you suggest would certainly be an interesting idea. However, I contend that if you want it to operate issue-free and perform well, some hacking of the PostgreSQL source code will have to be involved. When/if you do it, start a project on pgFoundry (www.pgfoundry.org); I’m sure others will be interested.

–Josh Berkus
PostgreSQL Project

you comment about the problems with a backup/restore, but wouldn’t a restore in the wrong order just result in (for this example) the metadata_regex just returning a null and therefor allowing the insert?

David,

In a word, no. Even though you are right about returning a null instead of a regex if the metadata table hadn’t been populated yet allowing for inserts to succeed, there is another dependancy layer, the lookup functions. The functions used to lookup the regex would need to be created first before they could be used in a constraint definition. Creating these functions will fail if the metadata table doesn’t exist.

So we would still be stuck unless PostgreSQL was given enough knowledge to properly order the metadata table and corresponding functions when doing a dump.

Leave a Reply

Your email address will not be published. Required fields are marked *