Interface
The buffer cache interface to the upper levels consists of nine functions.
int init_buffer(struct init_descriptor
*init_descriptor)
Initializes the buffer cache for
operation.
init_descriptor contains the database file
name and possibly other information
returns zero for
success or an Eris error code
int begin_transaction(int writep, struct transaction
**transaction_out)
All other functions are executed
within the context of a running transaction. Transactions are
always started with this function.
writep
TRANSACTION_READ_ONLY or TRANSACTION_WRITABLE.
transaction_out receives the started transaction
handle
returns zero for success or an Eris error code
int commit_transaction(struct transaction
*transaction)
Ends a transaction and makes the changes
made within it persistent. Releases all locks and mappings in
this transaction's context.
transaction a running
transaction handle
returns zero for success or an Eris
error code
int rollback_transaction(struct transaction
*transaction)
Ends a transaction and discards the
changes made within it. Releases all locks and mappings in this
transaction's context.
transaction a running
transaction handle
returns zero for success or an Eris
error code; rollback probably can only fail when massive internal
server datastructure corruption is underway
int alloc_logical(struct transaction *transaction,
disk_pointer *logical_out, void **page_out)
Allocates
an unused logical block to the running transaction. This
operation is only allowed in a writable transaction. The logical
block is allocated dirty, and is immediately mapped into
memory.
transaction a running transaction handle
logical_out receives the newly allocated logical block
number
page_out receives the memory address at which
the block is mapped
returns zero for success or an Eris
error code
int reference_logical(struct transaction *transaction,
disk_pointer logical, int writep, void **page_out)
Map
the given logical block in the running transaction into memory if
it is not already mapped. Optionally mark it dirty so that it can
be written to and retrieve a write lock instead of a read
lock.
transaction a running transaction handle
logical logical block number
writep whether
write access is requested
page_out receives the memory
address at which the block is mapped
returns zero for
success or an Eris error code
int release_logical(struct transaction *transaction,
disk_pointer logical)
Release a mapping of a logical
block in the running transaction.
transaction a running
transaction handle
logical logical block number
returns zero for success or an Eris error code
int downgrade_logical(struct transaction *transaction,
disk_pointer logical)
Cancel the request for write
access to a mapping. The block must not been written to.
transaction a running transaction handle
logical
logical block number
returns zero for success or an
Eris error code
int free_logical(struct transaction *transaction,
disk_pointer logical)
Free a logical block in the
running transaction. Possible mappings are undone, and no further
mappings of this logical block are possible. The logical block
number may be recycled in another transaction after commit.
transaction a running transaction handle
logical
logical block number
returns zero for success or an
Eris error code
locking scheme
In the first implementation the only supported locking scheme will be the tree protocol. It means that the locking will proceed in the higher level tree (not the logical to physical mapping tree) from the root towards the leaves. The initially locked page may be any page. After that, only a child of a locked page may be locked. An unlocked page may not be relocked. Pages may be unlocked at any time.
This is a deadlock free protocol. It also means that we do not have to check in reference_logical whether the logical page can already be found in the mapping changes list. Because there is no random access to the mapping changes, a list is a sufficient data structure for them.
There is a case to be made against the tree protocol if we do not know at the first operation in a transaction what possible operations can follow in the transaction. Therefore there might be a change in the locking protocol in the future, even though any other protocol will certainly be more complex.
implementation
The block indices on the disk and the block indices used by the higher levels are separate. The former are called physical blocks and the latter logical blocks. The mapping from logical to physical is local to a database snapshot, of which there are at least one, but may be several.
The logical to physical mapping is represented on disk by a tree of physical blocks, the root of which is the snapshot root block. The root blocks have a serial number on them, the largest of which will be selected as the snapshot to be used during database recovery.
The memory representation of the mapping tree has indirection to allow for the tree being only partially in memory, and loaded in as needed. For each mapping block in memory, there is a corresponding directory header structure, which contains a pointer to the in memory copy of the mapping block and a pointer to an array of pointers to directory headers or page headers in the case of the leaf blocks of the mapping. All other references to the mapping block copy and the array of pointers go through the directory header.
When a mapping block is read into memory, directory headers are allocated for the block's children, and pointers to the directory headers are written to the corresponding array locations. If some of the blocks children are nulls, the corresponding pointer array locations will be nulls as well. The directory headers allocated will be initialized to contain null pointers. If the corresponding mapping block in the immediately preceding or following snapshot is already in memory, the directory headers are not allocated for those children which are shared between the two snapshots. Shared physical blocks between two snapshots will always be found at the same structural position in the mapping trees, and always with immediately consecutive snapshots. It is possible that many snapshots have the same physical block for a given logical block, but all those snapshots will be consecutive, so the check is only required for the immediately adjacent snapshots.
All writable transactions refer to the latest snapshot. Read-only transactions may refer to any snapshot. Snapshots have a reference count of referring transactions. If any non-latest snapshot reaches the reference count of zero, it is released. The latest snapshot is always retained. When a snapshot is released, the structure that it references and is not shared with immediately adjacent snapshots is freed both in memory and on disk.
Requesting a write mapping means that a new physical block will be allocated for the logical block and the modified data will be written to that new physical block, not the old one. The logical to physical mapping is not written to before commit, however. Before that, the changes to the mapping are stored in a per transaction mapping change list. At transaction end time, depending on whether the end is a commit or a rollback, the necessary writes and resource freeing is executed based on the contents of the change list.
reference_logical(c, 0, true, ...)
The mapping tree is incrementally read from disk to memory, and the horizontal links combining logically similar mapping headers in adjacent snapshots are kept in synch. In this case only one block had to be read before noticing that the lookup can continue along already read structure. The page header is marked to show that the block is now write locked. Previous existing read locks for read only transactions continue to exist, as the actual memory area that is written to is now allocated and initialized from the page hanging in the mapping tree, and hung in the transaction c change list waiting for it to either commit or roll back. No reads for writable transactions are allowed, to avoid serialization problems.
commit_transaction(c)
During the commit of the writing transaction c the changes from the transaction change list are written out to the mapping tree. The actual data page which has hung from the change list during the transaction is hung at the bottom of the mapping. The part of the mapping tree which is shared with the adjacent snapshot is replicated as the new physical block numbers are filled in from leaves of the mapping towards the root. After each level has had all the numbers filled in, its disk block copies are written to disk and the new disk addresses of those writes are filled in on the next higher level. After the root block has been written to disk, the transaction has reached durability.
commit_transaction(b)
When we commit the read only transaction b, all memory and disk that is referenced by b and b alone is freed. Horizontal links between logically similar headers in snapshots to either side are snapped over the removed structure if appropriate. When the same header is found via neighbourghing routes, as the page at bottom left in the sample diagram is, we know that we have reached structure that is still referenced, and further descent is unnecessary.
There are two bitmaps for storing the allocation state of physical and logical blocks. The bitmaps exist only in memory. They are reconstructed from the mapping tree during startup. This may be changed in future versions, as this strategy is only viable when the bitmaps will fit reasonably in memory and reading the entire mapping tree does not take too long.
data structures
| struct transaction | ||
| next | struct transaction * | All transactions are in a doubly linked list sorted according to starting time. |
| prev | struct transaction * | |
| snapshot | struct snapshot * | Not null. The logical to physical block mapping of this transaction is contained in the snapshot object. |
| writep | int | Boolean. Is this transaction allowed to modify the database. |
| writedp | int | Boolean. Has this transaction modified the database. |
| serial | int | Monotonically increasing serial number for this transaction. |
| map_changes | struct map_change_page * | Changes made to the logical to physical mapping during this transaction. |
| struct snapshot | ||
| next | struct snapshot * | All snapshots are in a doubly linked list sorted according to starting time. |
| prev | struct snapshot * | |
| map | struct directory_header * | Not null. The root of the logical to physical mapping tree. |
| ref_count | int | Number of referring transactions. |
| writedp | int | Boolean. Has this snapshot been modified. Can only be true for the latest snapshot. |
| struct header | ||
| memory1 | void ** | In directory headers an array of pointers to directory or page headers. The array can be non-existent, but if it exists, the pointers are all non-null, unless all logical blocks reachable through the pointer are unallocated in this snapshot. |
| disk_pointer * | After memory is the in memory copy of the disk page. | |
| memory2 | void * | In page headers simply the in memory copy of the page contents. |
| header | int | Contains the following fields: |
| dirtyp | 0 | Boolean. Used by commit to keep track of which directories have been modified and need to be allocated a new physical block and written to disk. |
| waitp | 1 | Boolean. Used to mark that there is a thread waiting for a lock on this page. |
| iowaitp | 2 | Boolean. Used to mark that the page in memory copy is being read from disk, and the access should wait for the read. |
| read_count | 3- | In page headers the read lock count. |
| next | struct header * | The logically similar header in the immediately newer snapshot, if any. |
| previous | struct header * | The logically similar header in the immediately older snapshot, if any. |
| struct map_change_page | ||
| next | struct map_change_page * | The next page of mapping changes. |
| valid | int | The number of valid mapping change entries on this page. |
| change | struct map_change[] | The mapping changes themselves. |
| struct map_change | ||
| logical | disk_pointer | Number of the logical block whose mapping has been changed. |
| physical | disk_pointer | Number of the physical block in the new mapping. |
| page | void * | In memory copy of the data belonging to the new physical block. |