Interface

The buffer cache interface to the upper levels consists of nine functions.

int init_buffer(struct init_descriptor *init_descriptor)
Initializes the buffer cache for operation.
init_descriptor contains the database file name and possibly other information
returns zero for success or an Eris error code

int begin_transaction(int writep, struct transaction **transaction_out)
All other functions are executed within the context of a running transaction. Transactions are always started with this function.
writep TRANSACTION_READ_ONLY or TRANSACTION_WRITABLE.
transaction_out receives the started transaction handle
returns zero for success or an Eris error code

int commit_transaction(struct transaction *transaction)
Ends a transaction and makes the changes made within it persistent. Releases all locks and mappings in this transaction's context.
transaction a running transaction handle
returns zero for success or an Eris error code

int rollback_transaction(struct transaction *transaction)
Ends a transaction and discards the changes made within it. Releases all locks and mappings in this transaction's context.
transaction a running transaction handle
returns zero for success or an Eris error code; rollback probably can only fail when massive internal server datastructure corruption is underway

int alloc_logical(struct transaction *transaction, disk_pointer *logical_out, void **page_out)
Allocates an unused logical block to the running transaction. This operation is only allowed in a writable transaction. The logical block is allocated dirty, and is immediately mapped into memory.
transaction a running transaction handle
logical_out receives the newly allocated logical block number
page_out receives the memory address at which the block is mapped
returns zero for success or an Eris error code

int reference_logical(struct transaction *transaction, disk_pointer logical, int writep, void **page_out)
Map the given logical block in the running transaction into memory if it is not already mapped. Optionally mark it dirty so that it can be written to and retrieve a write lock instead of a read lock.
transaction a running transaction handle
logical logical block number
writep whether write access is requested
page_out receives the memory address at which the block is mapped
returns zero for success or an Eris error code

int release_logical(struct transaction *transaction, disk_pointer logical)
Release a mapping of a logical block in the running transaction.
transaction a running transaction handle
logical logical block number
returns zero for success or an Eris error code

int downgrade_logical(struct transaction *transaction, disk_pointer logical)
Cancel the request for write access to a mapping. The block must not been written to.
transaction a running transaction handle
logical logical block number
returns zero for success or an Eris error code

int free_logical(struct transaction *transaction, disk_pointer logical)
Free a logical block in the running transaction. Possible mappings are undone, and no further mappings of this logical block are possible. The logical block number may be recycled in another transaction after commit.
transaction a running transaction handle
logical logical block number
returns zero for success or an Eris error code

locking scheme

In the first implementation the only supported locking scheme will be the tree protocol. It means that the locking will proceed in the higher level tree (not the logical to physical mapping tree) from the root towards the leaves. The initially locked page may be any page. After that, only a child of a locked page may be locked. An unlocked page may not be relocked. Pages may be unlocked at any time.

This is a deadlock free protocol. It also means that we do not have to check in reference_logical whether the logical page can already be found in the mapping changes list. Because there is no random access to the mapping changes, a list is a sufficient data structure for them.

There is a case to be made against the tree protocol if we do not know at the first operation in a transaction what possible operations can follow in the transaction. Therefore there might be a change in the locking protocol in the future, even though any other protocol will certainly be more complex.

implementation

The block indices on the disk and the block indices used by the higher levels are separate. The former are called physical blocks and the latter logical blocks. The mapping from logical to physical is local to a database snapshot, of which there are at least one, but may be several.

The logical to physical mapping is represented on disk by a tree of physical blocks, the root of which is the snapshot root block. The root blocks have a serial number on them, the largest of which will be selected as the snapshot to be used during database recovery.

The memory representation of the mapping tree has indirection to allow for the tree being only partially in memory, and loaded in as needed. For each mapping block in memory, there is a corresponding directory header structure, which contains a pointer to the in memory copy of the mapping block and a pointer to an array of pointers to directory headers or page headers in the case of the leaf blocks of the mapping. All other references to the mapping block copy and the array of pointers go through the directory header.

When a mapping block is read into memory, directory headers are allocated for the block's children, and pointers to the directory headers are written to the corresponding array locations. If some of the blocks children are nulls, the corresponding pointer array locations will be nulls as well. The directory headers allocated will be initialized to contain null pointers. If the corresponding mapping block in the immediately preceding or following snapshot is already in memory, the directory headers are not allocated for those children which are shared between the two snapshots. Shared physical blocks between two snapshots will always be found at the same structural position in the mapping trees, and always with immediately consecutive snapshots. It is possible that many snapshots have the same physical block for a given logical block, but all those snapshots will be consecutive, so the check is only required for the immediately adjacent snapshots.

All writable transactions refer to the latest snapshot. Read-only transactions may refer to any snapshot. Snapshots have a reference count of referring transactions. If any non-latest snapshot reaches the reference count of zero, it is released. The latest snapshot is always retained. When a snapshot is released, the structure that it references and is not shared with immediately adjacent snapshots is freed both in memory and on disk.

Requesting a write mapping means that a new physical block will be allocated for the logical block and the modified data will be written to that new physical block, not the old one. The logical to physical mapping is not written to before commit, however. Before that, the changes to the mapping are stored in a per transaction mapping change list. At transaction end time, depending on whether the end is a commit or a rollback, the necessary writes and resource freeing is executed based on the contents of the change list.

reference_logical(c, 0, true, ...)

The mapping tree is incrementally read from disk to memory, and the horizontal links combining logically similar mapping headers in adjacent snapshots are kept in synch. In this case only one block had to be read before noticing that the lookup can continue along already read structure. The page header is marked to show that the block is now write locked. Previous existing read locks for read only transactions continue to exist, as the actual memory area that is written to is now allocated and initialized from the page hanging in the mapping tree, and hung in the transaction c change list waiting for it to either commit or roll back. No reads for writable transactions are allowed, to avoid serialization problems.


commit_transaction(c)

During the commit of the writing transaction c the changes from the transaction change list are written out to the mapping tree. The actual data page which has hung from the change list during the transaction is hung at the bottom of the mapping. The part of the mapping tree which is shared with the adjacent snapshot is replicated as the new physical block numbers are filled in from leaves of the mapping towards the root. After each level has had all the numbers filled in, its disk block copies are written to disk and the new disk addresses of those writes are filled in on the next higher level. After the root block has been written to disk, the transaction has reached durability.


commit_transaction(b)

When we commit the read only transaction b, all memory and disk that is referenced by b and b alone is freed. Horizontal links between logically similar headers in snapshots to either side are snapped over the removed structure if appropriate. When the same header is found via neighbourghing routes, as the page at bottom left in the sample diagram is, we know that we have reached structure that is still referenced, and further descent is unnecessary.


There are two bitmaps for storing the allocation state of physical and logical blocks. The bitmaps exist only in memory. They are reconstructed from the mapping tree during startup. This may be changed in future versions, as this strategy is only viable when the bitmaps will fit reasonably in memory and reading the entire mapping tree does not take too long.

data structures

struct transaction
next struct transaction * All transactions are in a doubly linked list sorted according to starting time.
prevstruct transaction *
snapshot struct snapshot * Not null. The logical to physical block mapping of this transaction is contained in the snapshot object.
writep int Boolean. Is this transaction allowed to modify the database.
writedp int Boolean. Has this transaction modified the database.
serial int Monotonically increasing serial number for this transaction.
map_changes struct map_change_page * Changes made to the logical to physical mapping during this transaction.
struct snapshot
next struct snapshot * All snapshots are in a doubly linked list sorted according to starting time.
prev struct snapshot *
map struct directory_header * Not null. The root of the logical to physical mapping tree.
ref_count int Number of referring transactions.
writedp int Boolean. Has this snapshot been modified. Can only be true for the latest snapshot.
struct header
memory1 void ** In directory headers an array of pointers to directory or page headers. The array can be non-existent, but if it exists, the pointers are all non-null, unless all logical blocks reachable through the pointer are unallocated in this snapshot.
disk_pointer * After memory is the in memory copy of the disk page.
memory2 void * In page headers simply the in memory copy of the page contents.
header int Contains the following fields:
dirtyp 0 Boolean. Used by commit to keep track of which directories have been modified and need to be allocated a new physical block and written to disk.
waitp 1 Boolean. Used to mark that there is a thread waiting for a lock on this page.
iowaitp 2 Boolean. Used to mark that the page in memory copy is being read from disk, and the access should wait for the read.
read_count 3- In page headers the read lock count.
next struct header * The logically similar header in the immediately newer snapshot, if any.
previous struct header * The logically similar header in the immediately older snapshot, if any.
struct map_change_page
next struct map_change_page * The next page of mapping changes.
valid int The number of valid mapping change entries on this page.
change struct map_change[] The mapping changes themselves.
struct map_change
logical disk_pointer Number of the logical block whose mapping has been changed.
physical disk_pointer Number of the physical block in the new mapping.
page void * In memory copy of the data belonging to the new physical block.