E-66
Database Systems
  • Home
  • Lectures
  • Problem Sets
  • Sections
  • Syllabus
  • Schedule
  • Staff
  • Policies
  • Canvas
  • Gradescope
  • Ed Discussion

Section 5

Hash Tables

In addition to the B-trees and B+trees that we discussed last week, we can also store database keys and data in a hash table index. For an on-disk hash table, each bucket is (typically) a page, as is each overflow bucket. In a dynamic hash table, the number of buckets increases gradually in proportion to the number of items in the table, with minimal need for re-hashing existing items.

  1. Start with an empty hash table containing 2 buckets. Using linear dynamic hashing add the following values, in order, to the table: 1000, 972, 713, 12, 436, 113, 116. Use the hash function h(x) = x % 10 (x mod 10). Grow the table whenever the number of items becomes more than twice the number of buckets.

    First we calculate i and f. i = log2 n = log2 2 = 1. f = k/2n = 0/2·2 = 0, where k is the number of entries. With this formula for f, we must always ensure that f <= 1. We start with the following hash table:

    0 []
    1 []
    
    1. Insert 1000: h(1000) = 0 = 0000. The rightmost i bit of h(1000), where i = 1, is 0, so we add 1000 to 0. f is now 1/(2·2) = 1/4.

      0 [1000]
      1 []
      
    2. Insert 972. h(972) = 2 = 0010. The rightmost 1 bit is 0, so add 972 to 0. f = 2/4 = 1/2.

      0 [1000, 972]
      1 []
      
    3. Insert 713. h(713) = 3 = 0011. The rightmost bit is 1, so add 713 to 1. f = 3/4.

      0 [1000, 972]
      1 [713]
      
    4. Insert 12. h(12) = 2 = 0010. The rightmost bit is 0, so add 12 to 0. f = 4/4 = 1, so on the next insert we will have to add a bucket.

      0 [1000, 972, 12]
      1 [713]
      
    5. Insert 436. h(436) = 6 = 0110. However, when we insert 436 we have to grow the table by 1 bucket because with 436, f = 5/4 > 1. So, n += 1 yields n = 3.

      Now we must test to see whether we need to add to i. i = log2 3, 1 < i < 2, so we round up to i = 2. Now we grow the table by a bucket, add a leading zero to the current hash values, split the current bucket 0, and use i = 2 rightmost digits (10) to insert 436.

      00 [1000] (because 1000 hashes to 0000, or 00)
      01 [713] (we don't need to look at the contents of this bucket at all)
      10 [972, 12, 436] (**2 hashes to 0010 and **6 hashes to 0110, or 10)
      

      Note that f is now 5/(2·3) = 5/6.

    6. Insert 113. h(113) = 3 = 0011. The rightmost i = 2 bits are 11. However, there is no bucket 11 so we ignore the leading digit and use the bucket specified by the rightmost i - 1 bits. So, we insert 113 into bucket 01. Now f = 6/6 = 1, so on the next insert we’ll have to add a bucket.

      00 [1000]
      01 [713, 113]
      10 [972, 12, 436]
      
    7. Insert 116. h(116) = 6 = 0110. However, we must grow the table by 1 bucket, bucket 11 (and split bucket 01). Now n = 4. i = log2 4 = 2 still.

      Notice that we’re not splitting most efficiently since we’re splitting a bucket with only 2 values (and not even splitting it evenly). If we were limited to, say, 3 values per bucket, we would have to use an overflow bucket to store 116 because it is the fourth value with a hash of 10 (and so we have to “chain” on an overflow bucket). After this insert f = 7/(4·2) = 7/8.

      00 [1000]
      01 []
      10 [972, 12, 436]---[116]
      11 [713, 113] (They both hash to 0011 or 11.)
      

    Serializable, Conflict Serializable, Recoverable, and Cascadeless Schedules

Recall that:

  • A schedule is serializable if its effects are equivalent to the effects of some serial schedule.
  • A schedule is conflict serializable if it can be transformed into a serial schedule by swapping pairs of non-conflicting actions.
  • Two actions conflict if they involve the same data item and at least one of them is a write.
  • A schedule is recoverable if every transaction commits only after all transactions whose changes they read have committed.
  • A schedule is cascadeless when it avoids dirty reads: reads of data that have been modified by uncommitted transactions.

In the exercises below, determine whether or not the provided schedule has each of the above properties.

T1 T2
read(X)
read(X)
write(Y)
write(Y)
commit
commit
  1. Is this schedule...

    1. serializable?

      Yes, because its effects are equivalent to those of the serial schedule T1;T2—in both schedules, T1 and T2 read the initial value of X, neither transaction reads a value written by the other, and T2 performs the final write to Y.

    2. conflict serializable?

      Yes, it’s equivalent to the serial schedule in which T1 performs all of its operations and then T2 performs all of its operations. To see this, we note that T2’s read of X does not conflict with T1’s write to Y, because the two operations involve different data items. So, these two operations can be swapped to give us the serial schedule T1;T2 (note that the commits do not matter when it comes to determining conflict serializability).

    3. recoverable?

      Yes, because there are no dirty reads—i.e., there are no situations in which one transaction reads a value that has been written by another transaction that has yet to commit. Therefore, there’s no need to worry about the order in which the transactions commit.

    4. cascadeless?

      Yes, because there are no dirty reads.

T1 T2
read(X)
write(X)
read(X)
write(Y)
commit
commit
  1. Is this schedule...

    1. serializable?

      No. This schedule is not equivalent to either of the two possible serial schedules T1;T2 or T2;T1. It’s not equivalent to T1;T2 because, in our schedule, T1’s second read of X reads the value written by T2, but in the serial schedule T1;T2 it would read the initial value of X. It’s not equivalent to T2;T1 because, in our schedule, T1’s first read of X reads the initial value of X, but in the schedule T2;T1 it would read the value written by T2.

    2. conflict serializable?

      No. We can’t use swaps of consecutive non-conflicting operations to transform this schedule into a serial schedule. There are two different pairs of operations that conflict and cannot be swapped:

      • T1’s first read of X and T2’s write to X
      • T2’s write to X and T1’s read of X

      Since we cannot move T2’s write to X either before or after all of T1’s operations, there’s no way to produce a conflict-equivalent serial schedule.

    3. recoverable?

      No, because T1 performs a dirty read (reading a value of X written by T2 before T2 commits) and T1 commits before T2 does. If the system were to crash after T1 committed but before T2 committed—or if T2 were to abort after T1 committed—the system could be put into an inconsistent state, because the rollback of T2 should undo T2’s write of X, and yet T1 could have performed an action based on that write.

    4. cascadeless?

      No, because T1 performs a dirty read.

T1 T2
read(X)
write(X)
read(X)
write(Y)
commit
commit
  1. Is this schedule...

    1. serializable?

      No. Again, apart from the order of the commits this is the same schedule as the previous schedule, so it’s still not equivalent to either of the two possible serial schedules.

    2. conflict serializable?

      No. Changing the order of the commits doesn’t change the fact that we still can’t perform swaps of conflicting actions to get a serial schedule.

    3. recoverable?

      Yes. Although T1 still performs a dirty read in this schedule, it now commits after T2, which satisfies the requirement for a recoverable schedule. If T2’s write is undone by a crash or abort, T1 can also be rolled back, since it won’t have committed yet. While such a cascading rollback is undesirable, it prevents us from a possible inconsistent state.

    4. cascadeless?

      No, because T1 still performs a dirty read.

T1 T2 T3
read(A)
read(A)
read(B)
write(A)
read(B)
write(A)
write(A)
commit
commit
commit
  1. Is this schedule...

    1. serializable?

      Yes. Although the values of A read by T1 and T2 are different than the values that they would read in any serial schedule, that doesn’t matter, because the values of A that T1 and T2 write are not read by T3, and those values are overwritten by T3’s write of A. So although T1 and T2 might write different values of A under the serial schedule T1;T2;T3 or T2;T1:T3, those writes will not matter, since T3 doesn’t read their writes, and their writes aren’t visible at the end of any of these schedules.

    2. conflict serializable?

      No. We can see this by noting that we can’t move all of T1’s operations after all of T2’s operations, since we can’t swap T1’s w(A) with T2’s w(A) because they conflict, and we also can’t move all of T1’s operations before all of T2’s operations, since T1’s w(A) conflicts with T2’s r(A). This means that T1 can neither come before nor after T2, which means a conflict-equivalent serial schedule is impossible here.

    3. recoverable?

      Yes, because there are no dirty reads–i.e., there are no situations in which one transaction reads a value that has been written by another transaction that has yet to commit. Therefore, there’s no need to worry about the order in which the transactions commit.

    4. cascadeless?

      Yes, because there are no dirty reads.

T1 T2 T3 T4
1 read(X)
2 read(X)
3 write(Y)
4 read(Y)
5 read(Y)
6 write(X)
7 read(W)
8 write(Y)
9 read(W)
10 read(Z)
11 write(W)
12 read(Z)
13 write(Z)
  1. Draw a precedence graph to determine if this schedule is conflict serializable.

    We add an edge from transaction A to transaction B in the graph if transaction A would have to come before transaction B in a serial schedule. Transaction A would need to come before transaction B if there’s a pair of conflicting operations, and transaction A’s operation comes first in the schedule. For example:

    T1 T2
    write(X)
    read(X)

    Since T1’s write to X comes first in this schedule, and that write conflicts with T2’s read of X, T1 must come before T2 in the equivalent serial schedule.

    When we’re done drawing the graph, if we end up with a cycle, it means that we have an impossible situation (e.g., transaction A must come before transaction B, but transaction B must come before transaction A) and the schedule is not conflict serializable.

    As you can see, there are cycles in the graph: T1→T3→T4→T1 and T1→T2→T3→T4→T1. This means that the schedule is not conflict serializable.

Last updated on October 3, 2024.