Here's a realtively simple way to implement data versioning in a database, in a way that should be scalable as well. It only needs a couple of support tables and a single function and can apply versioning across multiple data sets concurrently. It also supports branching and tagging using a natural model.
I got started thinking about this to implement undo in a structured Orbat editor, particularly in a multi-user context, but I refined it while researching ideas for a content management system. So the main example below is for a content management system which uses generic data containers rather than specific rows - but more complicated rows can also be accomodated storing any type of data.
All examples are for PostgreSQL
There is a single global revision table - it stores all revision information including branches and tags. It can also be used to store per-revision meta-data such as author and creation time.
create table rev ( id serial not null primary key, branchid int not null references rev(id), name text, author text, ctime text, constraint rev_name_uc unique (name) ); create index rev_branchid_idx on rev(branchid);
The root of the tree has an entry with id = branchid = 0. Revisions
then strictly increase for every change to any data set. Because of
this, in a real-world example you would probably use
a bigserial for the primary key and bigint
for the branchid.
Branches are specified by simply setting the branchid to a non-zero value, and multiple branches created in the same way. In this code branches and tags are the same thing - they both occur on the branch above the branch data, but it they could be specified in different ways using the same data structures.
Example:
revision branch tag description/path to version
0 0 trunk 0, root of the tree
1 0 0.1
2 0 abranch 0.2, root of 'abranch'
3 2 0.2.1
4 0 0.3
5 2 aabranch 0.2.2, root of 'aabranch', off 'abranch'
6 5 0.2.2.1
The entry table isn't necessary if immutable meta-data isn't required, it could be replaced by a simple sequence, but it helps bind everything together with foreign keys if nothing else.
create table entry ( id serial not null primary key, revid int not null references rev(id), classid int );
In this example, we have the initial revision revid of
the item, and a classid - some type identifier for an
otherwise generic data item, for example whether it is a blog post or
a news item in a cms. Other meta-data could include the owner - the
creator is specified in the revision table.
The data is stored in a data table which stores an entryid and a revid for each content record.
create table data ( id serial primary key, entryid int not null references entry(id), revid int not null references rev(id), content text, constraint data_uc unique (entryid, revid) ); create index data_entryid_idx on data(entryid); create index data_revid_idx on data(revid);
Each revision has complete stand-alone content - no backward diffs or antything tricky. This just simplifies things a lot and speeds up most operations at the expense of disk space.
The explicit indices may not be required due to fact the unique constraint will create an index over the same terms.
Only one function is required, and it is used to help when looking up branch information. Because the revision tree is stored as a flat table, this function is required to translate this into a more database-friendly format for queries.
This is really the secret sauce which makes any of this work.
create type branch_range as ( id int, branchid int, lo int, hi int );
CREATE OR REPLACE FUNCTION branch_table(ibranchid int)
RETURNS setof branch_range AS
$BODY$
-- returns ranges of values for a branch
declare
res branch_range;
begin
res.branchid := ibranchid;
res.id := ibranchid;
res.hi := 2147483647; -- 'max int'
res.lo := ibranchid;
return next res;
while (res.branchid <> 0) loop
res.branchid := branchid from rev where id=res.branchid;
res.hi = res.lo;
res.lo = res.branchid;
return next res;
end loop;
end;
$BODY$
LANGUAGE 'plpgsql' STABLE;
Given a branchid this will spit out a table of ranges of
revisions that can be used to find members of the branch. It relies
on the fact that revisions are monotonically increasing across all
branches.
Once you have the table above, the following SQL will list all revisions which belong to the branch. This can easily use indices and so will scale very well.
select * from rev,branch_table(<branchid>) as b where rev.branchid = b.branchid and rev.id >= b.lo and rev.id <= b.hi order by rev.id;
This function and the query above are impacted by the branch/tag mechanism used.
Ok that's the data structures out of the way, how is it used?
Within the same transaction, the following operations need to be carried out:
rev
table.
entry table
with the new revision.
data table with the new
entryid and revision.
Steps 2 and 3 could be repeated to add as many items as required so they are all associated atomically with the same revision. The additions could also span mutliple tables and data types.
Within the same transaction, the following operations need to be carried out:
rev table.
data table with the same
entryid as the old one, but with a new revision and data.
Again, step 2 can be repeated for however many data items are changing in this revision.
Note that each revision is complete and stand-alone. Not having to generate or apply backward diffs makes this process much simpler at the cost of some data space.
Adding to a branch is as simple as setting the branchid
in the new revision.
Tags and branches are the same thing. A new revision is created and
the name is set to some non-null unique value. Or
the name field can be set on an existing revision.
Branches only become branches once a new revision is added to the tag revision. Because of this it might be an idea to always require an empty new revision for every branch, that way there is no ambuguity about whether to include the data at the specific revision of the branch on the trunk and/or the branch.
An alternative and equally valid way to implement it would be to make
branches new revisions which are already split from the trunk above.
In this case you would create a new revision with
the name set to the branch name, and
the branchid set to the revision id of the
parent trunk. This makes the branching more explicit - but I think
the distinction is unecessary.
Note how this is a simple insert on a tiny table. Branching is for all intents, free.
Assuming some mechanism to find the entryid and branchid, the following query can find the latest content for that entry on that branch. Note that this assumes that data.id is monotonically increasing with the revision number, if that is not the case, then the version needs to be looked up separately first and the query gets a little more complicated (and I won't cover it here).
select data.* from data
where id = (select max(data.id)
from branch_table(<branchid>) as b
join rev
on rev.branchid = b.branchid
and rev.id >= b.lo
and rev.id <= b.hi
join data
on data.revid = rev.id
and data.entryid = <entryid>);
Assuming some mechanism to find the branchid, the following query can find the latest content for all items currently live on the given branch. It is a small change (see change bars).
select entry.*, data.* from data
join entry on entry.id = data.entryid
| where id in (select max(data.id)
from branch_table(<branchid>) as b
join rev
on rev.branchid = b.branchid
and rev.id >= b.lo
and rev.id <= b.hi
join data
on data.revid = rev.id
| group by data.entryid);
If a tag is just an empty branch, then the previous will work fine. Otherwise you need slight modification to the query to limit the results (see the 'change bar' below). The same query is used to find the root of a branch which is implicitly tagged automatically.
select entry.*, data.* from data
join entry on entry.id = data.entryid
where id in (select max(data.id)
from branch_table(<tagid>) as b
join rev
on rev.branchid = b.branchid
and rev.id >= b.lo
and rev.id <= b.hi
| and rev.id <= <tagid>
join data
on data.revid = rev.id
and data.entryid = <entryid>
group by data.entryid);
Since the revision table is global and can thus be used to version different tables, each type of mutable data could be stored in it's own table. This would reduce any redundant copies of the data being stored when they don't change, at the expense of the overhead of having to perform branch matching separately on each table lookup.
It might be useful to seprate meta-data and content at least, if the meta-data is likely to change often and/or independently of the content.
If there is some sort of user-supplied key, such as a 'node name' or 'filename', which can be changed over time, then some extra work is required to make sure it remains unique on each branch.
You cannot just use a unique constraint on the key, as you can and will have the same name on different branches. And you can't just have it on the key:branch ether, since names could come and go in an interleaved fashion.
The mechanism presented doesn't allow multiple tags for the same
revision. Since revisions are global and cheap, a new tag can be
added to the head revision (of a given branch) by simply adding a new
revision entry. For older revisions you could simply create a new
empty revision and set the branchid to the revision of
the conflicting tag.
With no other mutable-meta-data mechanism, you could store a new NULL data record. If you had meta-data it could go there somewhere.
Merging of branches is a pretty important feature, but most of the mechanics are beyond the mere versioning issues. What isn't is tracking the merges that have taken place rather than forcing the user to use manual tagging or write numbers down. Perhaps an additional column or two could be added to track the merge operations in a graph. Given that revisions are global, a simple graph would cover whole branch merges or single file merges using the same logic.
The basic merge operation works something like this:
diff3
Note that new items are copied over - you endup with a redundant copy of the data. This could be addressed by allowing multiple versions for each data item, but you would need another table to store that information.
A web-based CMS system provides some challenges here - you don't want to force the user to edit every item separately in one go, and the operation needs to run atomically. One way around this is probably to write conflicted entries to a new branch off the target branch, and let them fix it up, and repeat until done.
This occurs when different sessions edit the same item and try to store them again on the same branch. This is mostly just the merging problem on single files.
One way to implement it in a scalable way would be to use a temporary branch ...
For a web-based system you may want to keep storing mutliple revisions as the user saves draft copies, allowing powerful undo and review features. Each edit could create a per-user branch which keeps being added to as the session progresses. Once the draft is complete and ready for saving, you could just perform a merge of the temporary branch the same as for any other branch, and then delete the whole lot.
Actually it could probably be quite easily extended to allow edits of multiple items which are then stored concurrently.
Copies across branches are just that - copies. A lightweight copy could be implemented fairly simply by just allowing multiple revisions on data items. It would require an extra level of indirection through an association table, so adds more complexity to the queries (but not much).
Remove the revid from the data table, and
create something like the following:
create table data_rev ( dataid int not null references data(id), revid int not null references rev(id), constraint data_rev_uc unique (dataid, revid) );