Suppose a transaction T requires subtxns at sites A, B, C. Site A is the coordinator. Consider the following sequence of events:
After the last operation in T, site A
Sites B and C...
Site A...
Site B writes its commit record
Site C writes its commit record
Recall that a subtxn can only enter the committed state from the ready state after receiving a commit message.
Now, consider the following scenarios:
Suppose C crashes before writing its commit record. What happens?
Suppose B crashes after writing its commit record. What happens?
Suppose B crashes before writing its ready record. What happens?
Suppose A crashes before writing its commit record. What happens?
Suppose A crashes after writing its commit record. What happens?
Recall that in a Relational DBMS, every relation has exactly one primary, or clustered, index. A given relation may have zero or more secondary, or unclustered, indices. The goal of both of these kinds of indices is to limit the number of disk pages that have to be read into memory to perform queries on the relation.
The primary index specifies the physical layout of the data on disk. Typically, the key used in creating the index is a relation’s primary key field. The value is the actual data record corresponding to that row. Data in the primary index is organized on disk contiguously, such that linear scans have optimal efficiency, and queries by the primary key field are as fast as possible.
A secondary index does not relate to how data is stored on disk. Rather, its keys are some other field of the relation, and its values are the primary key field of the corresponding record. This increases efficiency when searching for records by some field other than the primary key.
Let’s explore these concepts by considering the following table, which contains information about employees at a company:
employeeId | firstName | lastName |
---|---|---|
1 | John | Smith |
2 | Anne | Brown |
3 | Arthur | Miller |
4 | Robert | Wood |
5 | Jane | Petzol |
6 | Vlad | Markov |
7 | Alex | Breen |
8 | Cody | Doucette |
9 | Kylie | Moses |
10 | Helen | Summers |
11 | Brian | Leftwich |
12 | Mark | Andrews |
13 | Thomas | Brady |
14 | Abe | Godwin |
15 | Timothy | Young |
16 | Janice | Fournier |
Since employeeId is the primary key field, the primary index for this table might be a B+-tree that looks like the following:
Note that the actual data for the Employee relation would be stored in the leaf nodes of the above tree.
Let’s say we wanted to do a complete scan of the Employee table’s records. How many disk reads would this scan take?
Note that even though this is a significant number of disk reads, because each of the pages read is physically contiguous on disk, there would be a far shorter rotational delay in performing the scan, and each disk read would be shorter than an average disk read.
Now let’s say that it’s often useful to lookup an employee’s record by searching by the first name of that employee. If that was the case, we might find it useful to build a secondary index on the firstName field, like so:
Notice that the keys of this index are the first names of the employees, and the values are that employee’s id.
How many disk reads would it take to search for all Employee records where the employee’s first name is Timothy, leveraging both indices?
How many disk reads would the same search take if you were limited to just the primary index?
When is it a good idea to add a secondary index to your table?
When is it a good idea to remove a secondary index from your table?
How does removing a secondary index speed up UPDATEs?
Last updated on December 5, 2024.