NotZed's Coding Diary Zucchi's GNOME blog

Data versioning

Here's a realtively simple way to implement data versioning in a database, in a way that should be scalable as well. It only needs a couple of support tables and a single function and can apply versioning across multiple data sets concurrently. It also supports branching and tagging using a natural model.

I got started thinking about this to implement undo in a structured Orbat editor, particularly in a multi-user context, but I refined it while researching ideas for a content management system. So the main example below is for a content management system which uses generic data containers rather than specific rows - but more complicated rows can also be accomodated storing any type of data.

The problem

Tables

All examples are for PostgreSQL

revision table

There is a single global revision table - it stores all revision information including branches and tags. It can also be used to store per-revision meta-data such as author and creation time.

create table rev (
  id serial not null primary key,
  branchid int not null references rev(id),
  name text,
  author text,
  ctime text,
  constraint rev_name_uc unique (name)
);

create index rev_branchid_idx on rev(branchid);

The root of the tree has an entry with id = branchid = 0. Revisions then strictly increase for every change to any data set. Because of this, in a real-world example you would probably use a bigserial for the primary key and bigint for the branchid.

Branches are specified by simply setting the branchid to a non-zero value, and multiple branches created in the same way. In this code branches and tags are the same thing - they both occur on the branch above the branch data, but it they could be specified in different ways using the same data structures.

Example:

revision  branch  tag       description/path to version
      0       0    trunk     0, root of the tree
      1       0              0.1
      2       0    abranch   0.2, root of 'abranch'
      3       2              0.2.1
      4       0              0.3
      5       2    aabranch  0.2.2, root of 'aabranch', off 'abranch'
      6       5              0.2.2.1

entry table

Each type of object to be versioned will need it's own entry table. This provides the master key for each object, and can be used to provide other immutable meta-data, such as the type of object (if that is appropriate).

The entry table isn't necessary if immutable meta-data isn't required, it could be replaced by a simple sequence, but it helps bind everything together with foreign keys if nothing else.

create table entry (
 id serial not null primary key,
 revid int not null references rev(id),
 classid int
);

In this example, we have the initial revision revid of the item, and a classid - some type identifier for an otherwise generic data item, for example whether it is a blog post or a news item in a cms. Other meta-data could include the owner - the creator is specified in the revision table.

data table

The data is stored in a data table which stores an entryid and a revid for each content record.

create table data (
  id serial primary key,
  entryid int not null references entry(id),
  revid int not null references rev(id),
  content text,
  constraint data_uc unique (entryid, revid)
);

create index data_entryid_idx on data(entryid);
create index data_revid_idx on data(revid);

Each revision has complete stand-alone content - no backward diffs or antything tricky. This just simplifies things a lot and speeds up most operations at the expense of disk space.

The explicit indices may not be required due to fact the unique constraint will create an index over the same terms.

function: branch_table

Only one function is required, and it is used to help when looking up branch information. Because the revision tree is stored as a flat table, this function is required to translate this into a more database-friendly format for queries.

This is really the secret sauce which makes any of this work.

create type branch_range as ( id int, branchid int, lo int, hi int );

CREATE OR REPLACE FUNCTION branch_table(ibranchid int)
  RETURNS setof branch_range AS
$BODY$
-- returns ranges of values for a branch
declare
  res branch_range;  
begin
  res.branchid := ibranchid;
  res.id := ibranchid;
  res.hi := 2147483647; -- 'max int'
  res.lo := ibranchid;
  return next res;
  while (res.branchid <> 0) loop
    res.branchid := branchid from rev where id=res.branchid;
    res.hi = res.lo;
    res.lo = res.branchid;
    return next res;
  end loop;
end;
$BODY$
  LANGUAGE 'plpgsql' STABLE;

Given a branchid this will spit out a table of ranges of revisions that can be used to find members of the branch. It relies on the fact that revisions are monotonically increasing across all branches.

Once you have the table above, the following SQL will list all revisions which belong to the branch. This can easily use indices and so will scale very well.

select * from rev,branch_table(<branchid>) as b
 where rev.branchid = b.branchid
   and rev.id >= b.lo
   and rev.id <= b.hi
 order by rev.id;

This function and the query above are impacted by the branch/tag mechanism used.

Operations

Ok that's the data structures out of the way, how is it used?

Creating a new item

Within the same transaction, the following operations need to be carried out:

  1. Allocate and insert a new revision entry into the rev table.
  2. Allocate and insert a new entry into the entry table with the new revision.
  3. Insert a new record into the data table with the new entryid and revision.

Steps 2 and 3 could be repeated to add as many items as required so they are all associated atomically with the same revision. The additions could also span mutliple tables and data types.

Creating a new revision

Within the same transaction, the following operations need to be carried out:

  1. Allocate an insert a new revision entry into the rev table.
  2. Insert a new record into the data table with the same entryid as the old one, but with a new revision and data.

Again, step 2 can be repeated for however many data items are changing in this revision.

Note that each revision is complete and stand-alone. Not having to generate or apply backward diffs makes this process much simpler at the cost of some data space.

Adding to a branch is as simple as setting the branchid in the new revision.

Creating a branch or tag

Tags and branches are the same thing. A new revision is created and the name is set to some non-null unique value. Or the name field can be set on an existing revision.

Branches only become branches once a new revision is added to the tag revision. Because of this it might be an idea to always require an empty new revision for every branch, that way there is no ambuguity about whether to include the data at the specific revision of the branch on the trunk and/or the branch.

An alternative and equally valid way to implement it would be to make branches new revisions which are already split from the trunk above. In this case you would create a new revision with the name set to the branch name, and the branchid set to the revision id of the parent trunk. This makes the branching more explicit - but I think the distinction is unecessary.

Note how this is a simple insert on a tiny table. Branching is for all intents, free.

Finding the latest content of an item on a given branch

Assuming some mechanism to find the entryid and branchid, the following query can find the latest content for that entry on that branch. Note that this assumes that data.id is monotonically increasing with the revision number, if that is not the case, then the version needs to be looked up separately first and the query gets a little more complicated (and I won't cover it here).

  select data.* from data
   where id = (select max(data.id)
                 from branch_table(<branchid>) as b
                 join rev
                   on rev.branchid = b.branchid
                  and rev.id >= b.lo
                  and rev.id <= b.hi
                 join data
                   on data.revid = rev.id
                  and data.entryid = <entryid>);

Finding all latest content on a given branch

Assuming some mechanism to find the branchid, the following query can find the latest content for all items currently live on the given branch. It is a small change (see change bars).

  select entry.*, data.* from data
    join entry on entry.id = data.entryid
|  where id in (select max(data.id)
                  from branch_table(<branchid>) as b
                  join rev
                    on rev.branchid = b.branchid
                   and rev.id >= b.lo
                   and rev.id <= b.hi
                  join data
                    on data.revid = rev.id
|             group by data.entryid);

Finding all content for a specific (tagged) version

If a tag is just an empty branch, then the previous will work fine. Otherwise you need slight modification to the query to limit the results (see the 'change bar' below). The same query is used to find the root of a branch which is implicitly tagged automatically.

  select entry.*, data.* from data
    join entry on entry.id = data.entryid
   where id in (select max(data.id)
                  from branch_table(<tagid>) as b
                  join rev
                    on rev.branchid = b.branchid
                   and rev.id >= b.lo
                   and rev.id <= b.hi
|                  and rev.id <= <tagid>
                  join data
                    on data.revid = rev.id
                   and data.entryid = <entryid>
              group by data.entryid);

Enhancements

Mutable Meta-data

Since the revision table is global and can thus be used to version different tables, each type of mutable data could be stored in it's own table. This would reduce any redundant copies of the data being stored when they don't change, at the expense of the overhead of having to perform branch matching separately on each table lookup.

It might be useful to seprate meta-data and content at least, if the meta-data is likely to change often and/or independently of the content.

User Supplied Keys

If there is some sort of user-supplied key, such as a 'node name' or 'filename', which can be changed over time, then some extra work is required to make sure it remains unique on each branch.

You cannot just use a unique constraint on the key, as you can and will have the same name on different branches. And you can't just have it on the key:branch ether, since names could come and go in an interleaved fashion.

Multiple tags

The mechanism presented doesn't allow multiple tags for the same revision. Since revisions are global and cheap, a new tag can be added to the head revision (of a given branch) by simply adding a new revision entry. For older revisions you could simply create a new empty revision and set the branchid to the revision of the conflicting tag.

Deleting

With no other mutable-meta-data mechanism, you could store a new NULL data record. If you had meta-data it could go there somewhere.

Merging

Merging of branches is a pretty important feature, but most of the mechanics are beyond the mere versioning issues. What isn't is tracking the merges that have taken place rather than forcing the user to use manual tagging or write numbers down. Perhaps an additional column or two could be added to track the merge operations in a graph. Given that revisions are global, a simple graph would cover whole branch merges or single file merges using the same logic.

The basic merge operation works something like this:

  1. For each item which has changed in the source branch/timespan
  2. For each item which is new in the source branch/timespan

Note that new items are copied over - you endup with a redundant copy of the data. This could be addressed by allowing multiple versions for each data item, but you would need another table to store that information.

A web-based CMS system provides some challenges here - you don't want to force the user to edit every item separately in one go, and the operation needs to run atomically. One way around this is probably to write conflicted entries to a new branch off the target branch, and let them fix it up, and repeat until done.

Concurrent Edits

This occurs when different sessions edit the same item and try to store them again on the same branch. This is mostly just the merging problem on single files.

One way to implement it in a scalable way would be to use a temporary branch ...

Temporary branches

For a web-based system you may want to keep storing mutliple revisions as the user saves draft copies, allowing powerful undo and review features. Each edit could create a per-user branch which keeps being added to as the session progresses. Once the draft is complete and ready for saving, you could just perform a merge of the temporary branch the same as for any other branch, and then delete the whole lot.

Actually it could probably be quite easily extended to allow edits of multiple items which are then stored concurrently.

Lightweight copies

Copies across branches are just that - copies. A lightweight copy could be implemented fairly simply by just allowing multiple revisions on data items. It would require an extra level of indirection through an association table, so adds more complexity to the queries (but not much).

Remove the revid from the data table, and create something like the following:

create table data_rev (
  dataid int not null references data(id),
  revid int not null references rev(id),
  constraint data_rev_uc unique (dataid, revid)
);

Copyright 2008 Michael Zucchi, All Rights Reserved.