Section 8

Distributed Locking and Replication

Recall that the fully distributed locking algorithm is parameterized by three numbers:

n, the total number of copies of a data item
x, the number of copies that must be locked to acquire a global exclusive lock, and
s, the number of copies that must be locked to acquire a global shared lock

Answer the following questions:

What are the constraints on these numbers?
The constraints are:
- x > n/2, which guarantees no two transactions can acquire a global exclusive lock at one time
- s > n - x, which guarantees that if one transaction has a global exclusive lock, another can’t have a global shared lock, and vice versa
Explain how write-all replication obeys these constraints.
Write-all replication means that:
- When writing, a transaction updates all n replicas (x = n)
- When reading, a transaction can access any replica (s = 1)
If we substitute for x and s using the above equations, the locking constraints become n > n/2 (which is true) and 1 > n - n (which is true). This implies that write-all replication always obeys the distributed locking discipline.
Explain how voting replication can obey these constraints.
When using voting with fully distributed locking, a majority of the copies should be written. In general, voting replication does not require this, but it is necessary when using fully distributed locking to prevent multiple writers working simultaneously.

Therefore, when using voting with fully distributed locking:
- When writing, a transaction updates a majority of the copies (x > n/2)
- When reading, a transaction reads over n - x copies (s > n - x)
Here the first equation above is the same as an original constraint, and the second equation above is equivalent to s > n - x, which is also the same as an original constraint. This implies that voting replication obeys the distributed locking discipline.
What are other possible values for n, s, and x?

There are many possible values that will fit the constraints, but some will be more efficient than others. For example, let n = 10, s = 6, and x = 7. Then all of the constraints are satisfied, but we are requiring that more shared locks need to be obtained than are strictly necessary.

An Application of Distributed Locking and Replication

Imagine that you are hired as a database administrator for a new chain of banks. Your job is to design a database to hold customer and account data. Your manager explains to you that he thinks asynchronous replication of data should be used between the various branches because he wants the database to be faster than those of competing banks. As a graduate of CSCI E-66, you know this could be a bad idea.

What is the primary reason that asynchronous replication in a bank database would be a bad idea?
Asynchronous replication is faster than synchronous replication, but cannot guarantee that the most up-to-date value is present. For a bank application this would be an especially poor choice, since if a customer did a balance query on an out-of-date, secondary copy of the account, they might think they had more money than was actually in the account (stored in the master replica). The customer then may overdraft.
Come up with an alternate distributed database scheme to present to your manager. In particular:
- Which would be the better form of synchronous replication: read-any/write-all or voting? Think about what must take place for balance queries and withdrawal/deposit transactions.
  
  Neither approach is clearly better. Balance queries only require reads, deposits only require writes, and withdrawals require both reads and writes. If withdrawals and deposits are more common than balance queries, it may make sense to use voting, since writes are very expensive in the read-any/write-all scheme.
- Which would be the best form of distributed concurrency control: centralized locking, primary-copy locking, or fully distributed locking? You can assume that speed is still important, and it is expected that there will be many bank branches.
  
  Primary-copy locking may be a good compromise in this case. A single point of failure and a bottleneck for lock requests is enough to rule out centralized locking. If there are a large number of branches, fully distributed locking may be too slow. Therefore, primary-copy locking may be the best choice between safety and speed.

MapReduce

Using pseudocode, design the mapper and reducer functions for counting the occurrences of words in a set of input files.
The pseudocode for the mapper function accepts a key and a line as input; the key in the mapper represents the offset of the line in the input file, and is not useful for this application. The line should be split into individual words, and then each word should be emitted with a count of 1.
```
mapper(key, line):
    words[] = split(line)
    for each word in words:
        emit(word, 1)
```
The reducer also accepts a (key, value) pair, where the key is a word and the value is a list of counts for that word. In this application, all of the counts will be 1. After the counts for the word are summed, the word and the total is output.
```
reducer(word, counts[]):
    total = 0
    for each count in counts:
        total += count
    emit(word, total)
```

Last updated on November 2, 2024.