Why copy-on-write semantics and node-level-versioning are key to efficient snapshots

Life is subdued to constant evolution. So is our data, be it in research, business or personal information management. As such it is surprising that databases usually just keep the current state.


Usually database systems simply either overwrite data in-place or do a copy-on-write operation followed by a removal of the outdated data (the latter maybe some time later from a background process). Data however naturally evolves over time. It is often times of great value to keep the history. We for instance might record the payroll of an employee at the first of march in 2019. Let us say it is 5000€ / month. Then as of 15th April we notice, that the recorded payroll was wrong and correct it to 5300€. Now, what is the answer to what the payroll was on the 1st of march in 2019? Database Systems, which only preserve the most recent version do not even know that the payroll wasn’t right. Our answer to this question depends on what source we consider most authoritative: the record or reality? The fact that they disagree effectively splits this payroll event into two tracks of time, rendering neither source entirely accurate. Questions such as these might be easily answered by temporal database systems such as Sirix. We provide at all times the system/ transaction time, which is set, once a transaction commits (when is a fact valid in the database / the record). Application or valid time has to be set by the application itself (when is a fact valid in the real world/reality?). We first want to motivate the usage of Sirix through some example use cases and afterwards dive into technical aspects and concepts.

Data Audits

Thus, one usage scenario for Sirix is data auditing. We keep track of past revisions in our specialized index structure for (bi)temporal data and our novel sliding snapshot versioning algorithm, which balances read- and write-performance while avoiding any peaks (our system is very space efficient and depending on the versioning algorithm we only copy changed records plus maybe a few more during writes). Thus we for instance usually do not simply copy a whole database page if only a single record has changed; we also do not have to cluster data during every commit as in B+-trees and LSM-Trees), Furthermore, we can simply keep track of changes through hooks and our user defined secondary index structures to avoid potentially expensive diffing of whole resources (even though we can utilize hashes, which can optionally be built during update operations for integrity checks, and thus skip the comparison of whole sub-tree if the hashes and stable, unique node identifiers match). As already mention we can use an ID-based diff algorithm if we do not want to keep track of what changed in an additional index. We never allow to override or delete old revisions (that said we might look into how to prune old revisions if that is ever needed). A single read/write transaction appends data at all times. For sure you can revert to a specific revision and commit the new revision, but all revisions in-between will be accessible for data audits. Thus, we are able to help to answer who changed what and when.

Time travel queries

Data audits are about how specific records/nodes have changed. Time Travel queries can answer questions as these, but also allows to reconstruct a whole sub-tree as it looked at a specific time or during a specific time span, or how the whole document/resource changed over time. You might want to analyze the past to predict the future. Through additional temporal XPath axis and XQuery functions we encourage you to investigate how your data has evolved.

An example query might look like:

This opens the database mycol.jn and therein the document/resource mydoc.jn as it was seen by the database at a specific time, thus it loads a snapshot/specific revision. This example opens and queries a JSON-database/resource, but you can also create native XML-databases. We then search for statuses with a datetime, which is newer / greater than the 1st February of 2018, whereby the status didn’t exist in the previous revision. We’re then outputting the revision-number and the text of the found status (projection). You’re also able to open a resource in between a given time-span and receive a sequence of JSON items as a result, which represents a list of revisions in this time-span. This is just a simple example: We provide many more temporal functions.

Fixing application or human errors / simple undo/redo operations / rollbacks / reproduce experiments

You can simple revert to a specific revision/point in time where everything was in a known good state and commit the revision again, or you might simply select a specific record/node and correct the error and commit the new revision.

How are all these usage scenarios possible? Let’s turn our focus to the question why historical data hasn’t been retained in the past and how new storage advances in recent years made it possible, to build sophisticated solutions to help answer these questions without the hurdle, state-of-the-art systems bring.

Why storing historical data becomes feasible nowadays

As Marc Kramis points out in his paper “Growing Persistent Trees into the 21st Century”:

The switch to flash drives keenly motivates to shift from the “current state” paradigm towards remembering the evolutionary steps leading to this state.

From Marc Kramis: Modifications evolving the state of your data.

The main insight is that flash drives as for instance SSDs, which are common nowadays have zero seek time while not being able to do in-place modifications of the data. Flash drives are organized into pages and blocks. Due to their characteristics they are able to read data on a fine-granular page-level, but can only erase data at the coarser block-level. Furthermore, blocks first have to be erased, before they can be updated. Thus, updated data is written to another place. A garbage collector marks the data, which has been rewritten to the new place as erased at the previous block location, such that new data can be stored in the future. Metadata to find the data at the new location are updated.

Thus, it is possible to store fine granular changes nowadays which naturally evolve the state of the data.

Search/insertion/deletion-operations of a temporal database should be in logarithmic time (O(log(n)), to compete with commonly used index structures.

The temporal storage system is a temporal storage system, which has been inspired by the file system ZFS and the versioning system Git amongst others. It is based on copy-on-write semantics and node-level versioning and thus never overwrites data.

Let’s first define what a temporal database system is all about.

A temporal database is capable of storing and retrieving past states of your data efficiently. Typically it stores the transaction time, when the data actually is committed to the database. If we also store the valid time, that is how long a fact is true in the real world, we have a bitemporal relation, that is two time-axis, the transaction time and the valid time.

Sirix has been developed to answer questions such as: Give me last month’s history of the Dollar-Pound Euro exchange rate. What was the customers address on February 12th in 2019 as it was recorded back in the day? Did they move or did we correct an error? Did we have errors in the database, which were corrected later on?

Concepts / architecture of

We support N read-only transactions, which are bound to a single revision (each transaction might be started on any past revision) concurrently to one write transaction on a single resource. Our system thus is based on snapshot isolation. The write-transaction can revert the most recent revision to any past revision. Changes to this past revision can then be commit to create a new snapshot and thus a new revision. The revisions in-between are never lost. We wrote our storage manager from scratch, thus let us now shift our focus to the internals of Sirix.

The page-structure for one revision is depicted in the following figure. Note that during each transaction commit a new revision root page with its sub-tree is created whereas unchanged pages are simply referenced at their original file-offset positions.

Page-Structure of

UberPage: The UberPage is the main entry point. It contains header information about the configuration of the resource as well as a reference to an IndirectPage. The reference contains the offset of the IndirectPage in the data-file or the transaction-intent log and an in-memory pointer. An uber page is always written as the last page during a transcation-commit. Thus, even in case the transaction fails, we always have a valid, consistent state of the storage.

IndirectPage: IndirectPages are used to increase the fanout of the tree and in essence to be able to store and retrieve a large number of records (while only ever having to read a predefined number of pages once a record has to be read again). We currently store 512 references in the IndirectPage to either another layer of indirect pages or the data pages, either a RevisionRootPage or a RecordPage. A new level of indirect pages is added whenever we run out of the number of records we can store in the leaf pages (either revisions or records), which are referenced by the IndirectPages. The height of the current subtree, that is the number of levels of indirect pages is always stored in the respective subtree-root page. We borrowed the ideas from the filesystem ZFS and hash-array based tries as we also store checksums in parent database-pages/page-fragments, which in turn form a self-validating merkle-tree. As IndirectPages potentially may have many null-pointers we use a bitset to keep track of which array indices are really set and thus are able to store a compact array or list in-memory.

RevisionRootPage: The RevisionRootPage is the main entry point to a revision. It stores the author-ID, an optional commit-message and a timestamp in the form of the unix epoch (milliseconds since 1970). Furthermore it stores a reference to a PathPage, a CASPage (if it exists), a NamePage and an IndirectPage. The indirect page is the entry point to the data stored in the leaf RecordPages. The right subtree of the RevisionRootPage started by the IndirectPage actually is the main entry point to our data stored in the leaf nodes, the RecordPages once again. To support fast access to a RevisionRootPage we store a second file with just the offsets to specific revisions in a revisions-file, which is read into main-memory on startup.

The other pages are mainly used to store different kinds of index structures in their sub-trees (AVL-tree nodes are stored in the record pages) but are omitted here for brevity. More information can be found in

Transaction Commit

Snapshots, that is new revisions are created during every commit. Apart from a numerical revision-ID, the timestamp is serialized. A revision can afterwards be opened either by specifying the ID or the timestamp. Using a timestamp involves a binary-search on an array of timestamps, stored persistently in a second file and loaded in memory on startup. The search ends if either the exact timestamp is found or the closest revision to the given point in time. Data is never written back to the same place and thus not modified in-place. Instead, Sirix uses copy-on-write (COW) semantics at the node-/record-level (creates page-fragments and usually doesn’t copy whole pages). Every time a new version of the page has to be synced, modified records, which have changed as well as some of the unchanged records are written to a new location. Which records exactly are copied depends on the versioning algorithm used. It is thus especially well suited for flash-based drives as for instance SSDs. Changes to a resource within a database occur within the aforementioned resource-bound single write-transaction. Therefore, first a resource manager has to be opened on the specific resource to start single resource-wide transactions. Note, that we already started work on database wide transactions 🙂

Copy-on-write nature of

We assume that we have inserted/updated/deleted a record in the leftmost RecordPage. Depending on the versioning algorithm the modified record as well as some other records of the page are copied to a new page fragment. First, all changes are stored in an in-memory transaction (intent) log, which can be persisted on memory pressure, if needed. Second, during a transaction commit the page-structure of the current RevisionRootPage is serialized in a post-order traversal. All changed RecordPags are written to disk / a flash drive, starting with the left most. If other changed record pages exist underneath an indirect page, these are serialized before the IndirectPage, which points to the updated record pages. Then the IndirectPage which points to the updated revision root page is written. The indirect pages are written with updated references to the new persistent locations of the record pages. We also store checksums in the parent pointers as in ZFS, such that the storage in the future is able to detect data corruption and heal itself, once we partition and especially replicate the data. The whole page-structure is serialized in this manner. We also want to store an encryption key in the references in the future, to support encryption at rest similar to the approach used in OpenZFS.

Note, that we have to update the ancestor path of each changed RecordPage. However, storing indirect pages as well as the RevisionRootPage/CASPage, PathSummaryPage and the PathPage is cheap. We currently store copies of the NamePages, but in the future might also version these according to the chosen versioning algorithm, just like RecordPages, such that we do not need to copy the whole dictionaries and save storage costs thereof. Each reference, which doesn’t point to a new page or page-fragment is left unchanged. Thus, unchanged pages (which are also not on the ancestor-path of changed pages) are simply referenced at their respective position in the former revision and never rewritten.

One of the most distinctive features of Sirix is that we are versioning on a per revision/per record level. Thus, RecordPages are not simply copied to a new persistent location with all records in the page, even if only a single record has been modified. Instead, we implemented versioning algorithms known from backup systems and invented a novel sliding snapshot algorithm to overcome their shortcomings. The new record page fragment always contains a reference to the former version. Thus, our versioning algorithms are able to de-reference a fixed, predefined number of page-fragments at maximum to reconstruct a RecordPage in-memory.

Versioning algorithms for storing and retrieving node-level snapshots

As most database system we store at most a fixed number of nodes/records, that is the actual data per database-page (currently 512 records at most). The records themselves are of variable size. Overlong records, which exceed a predefined length in bytes are stored in additional overflow pages and only referenced in the record-pages.

We implemented a number of versioning strategies best known from backup systems for copy-on-write operations of record-pages. Namely we either copy

  • the full record-pages, that is any record in the page (full)
  • only the changed records in a record-page regarding the former version (incremental)
  • only the changed records in a record-page since a full page dump (differential)

Incremental-versioning is the other extreme and write-performance is best, as it stores the optimum (only changed records), but on the other hand reconstructing a page needs intermittent full snapshots of pages, such that the performance doesn’t deteriorate with each new revision of the page as the number of increments increases with each new version.

Differential-versioning tries to balance reads and writes a bit better, but is still not optimal. Each time records in a page are modified a new page is written, with all changed records since a past full dump of the page. This means that only ever two revisions of the page-fragment have to be read to reconstruct a record-page. However write-performance also deteriorates with each new revision of the page.

Incremental versioning in regards to write performance, due to the requirement of intermittent full dumps of the page results in write-peaks. Differential versioning also suffers from a similar problem. Without an intermittent full dump a lot of data would have to be duplicated on each new write.

Marc Kramis came up with the idea of a novel sliding snapshot algorithm, which balances read/write-performance to circumvent any write-peaks.

The algorithm makes use of a sliding windows. First, any changed record must be stored, second any record, which is older than a predefined length N of the window and which has not been changed during these N-revisions. Only these N-revisions at max have to be read. The fetching of the page-fragments could be done in parallel or we simply stop once the full-page has been reconstructed starting with the most recent revision. Probably the best high level overview of the algorithm can be found in Marc’s Thesis: Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications

Once we made sure our storage system scaled linear for fetching old-revisions as well as the most recent revision and logarithmic for fetching and storing single records as well as whole revisions we focused our attention to upper layers.

If you like this, please give us some claps so more people see it or a star on Github… and most importantly check it out (I’d love to hear any suggestions, feedback, suggestions for future work, as for instance work on replication/partitioning, bug reports ;-), just everything… please get in touch) 🙂

read original article here