Problem Set 4

Due before the start of lecture on November 26, 2024. See below for a summary of the policies regarding late submissions.

Preliminaries

Homework is due prior to the start of lecture. If it is submitted more than 10 minutes after the start of lecture, it will be considered late. There will be a 10% deduction for submissions made any time in the following seven days, up to the start of the next lecture. We will not accept any homework that is more than 7 days late. Plan your time carefully, and don’t wait until the last minute to begin an assignment. Starting early will give you ample time to ask questions and obtain assistance.

In your work on this assignment, make sure to abide by our policies on academic conduct.

If you have questions while working on this assignment, please attend office hours or post them on Ed Discussion.

Part I

40 points total

Creating the necessary folder

Create a subfolder called ps4 within your e66 folder, and put all of the files for this assignment in that folder.

Creating the necessary file

This part of the assignment will all be completed in a single PDF file. To create it, you should do the following:

Access the template that we have created by clicking on this link and signing into your Google account as needed.
When asked, click on the Make a copy button, which will save a copy of the template file to your Google Drive.
Select File->Rename, and change the name of the file to ps4_partI.
Add your work for all of the problems from Part I to this file.
Once you have completed Part I, choose File->Download->PDF document, and save the PDF file on your machine. The resulting PDF file (ps4_partI.pdf) is the one that you will submit. See the submission guidelines at the end of Part I.

Problem 1: Replication

30 points total

Assume that a database is replicated across 11 different sites.

If the system uses fully distributed locking, which of the following voting schemes (if any) would work? Explain briefly why each scheme would or would not work.
1. update 6 copies, read 5 copies
2. update 5 copies, read 8 copies
3. update 8 copies, read 4 copies
4. update 3 copies, read 10 copies
Now assume that the system uses primary-copy locking. Which of the following voting schemes (if any) would work? Explain briefly why each scheme would or would not work.
1. update 6 copies, read 5 copies
2. update 5 copies, read 8 copies
3. update 8 copies, read 4 copies
4. update 3 copies, read 10 copies
If your workload primarily involves reads, which of the above configurations (1a, 1b, 1c, 1d, 2a, 2b, 2c, or 2d) would be the best one to choose? Explain briefly why you think it would be best, and make sure to also mention any potential drawbacks that it could have.
If your workload primarily involves writes, which of the above configurations (1a, 1b, 1c, 1d, 2a, 2b, 2c, or 2d) would be the best one to choose? Explain briefly why you think it would be best, and make sure to also mention any potential drawbacks that it could have.

Problem 2: Distributed locking with update locks

10 points

In lecture, we showed how a distributed DBMS could create global exclusive locks and global shared locks by acquiring local locks for an appropriate number of replicas.

Earlier in the semester, we discussed a third type of lock that is sometimes used in a centralized DBMS: an update lock, which can be acquired for the read portion of a read-modify-write transaction. What if a distributed DBMS wanted to add this type of lock as well? Explain how it could create global exclusive, global shared, and global update locks by acquiring appropriate numbers of local exclusive, local shared, and local update locks. The global locks should have the same semantics as the equivalent locks in a centralized database in terms of when a lock can be acquired and how many transactions can hold a given type of lock at the same time.

Notes:

The rules that we provided for global exclusive locks and global shared locks in lecture were for a system that does not use global update locks. In this problem, you need to provide the rules for global exclusive, global shared, and global update locks in a system that does use update locks. In other words, you will need a rule (i.e., an inequality) and a corresponding explanation for each type of global lock.
Your inequalities should make use of the following variables:
- x = the number of local exclusive locks needed to acquire a global exclusive lock
- s = number of local shared locks needed to acquire a global shared lock
- u = number of local update locks needed to acquire a global update lock
In addition, you can use arithmetic operators (e.g., subtraction) and the functions min and max.
When it comes to global exclusive locks and global shared locks, their rules may be the same as the ones we presented in lecture for a system without update locks, or they may not be. In other words, you may or may not need to adjust the rules for global exclusive locks and/or global shared locks in light of the fact that you are now dealing with a system that has global update locks.

Submitting your work for Part I

Login to Gradescope by clicking the link in the left-hand navigation bar. Once you are in logged in, click on the box for CSCI E-66.

Submit your ps4_partI.pdf file using these steps:

If you still need to create the PDF file, open your file on Google Drive, choose File->Download->PDF document, and save the PDF file on your machine.
Click on the name PS 4: Part I in the list of assignments. You should see a pop-up window labeled Submit Assignment. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Choose the Submit PDF option, and then click the Select PDF button and find the ps4_partI.pdf that you created in step 1. Then click the Upload PDF button.
You should see an outline of the problems along with thumbnails of the pages from your uploaded PDF. For each problem in the outline:
- Click the title of the problem.
- Click the page(s) on which your work for that problem can be found.
As you do so, click on the magnifying glass icon for each page and doublecheck that the pages that you see contain the work that you want us to grade.
Once you have assigned pages to all of the problems in the question outline, click the Submit button in the lower-right corner of the window.
You should see a box saying that your submission was successful. Click the (x) button to close that box.
You can use the Resubmit button at the bottom of the page to resubmit your work as many times as needed before the final deadline.

Important

It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough time to do so, wait an hour or two and then try again. If you are unable to submit and it is close to the deadline, email your homework before the deadline to esaracin@bu.edu

Part II

60 points total

Problem 3: More practice with XML queries

25 points

This problem asks you to write some additional XPath and XQuery queries for the XML version of our movie database that you worked with in Problem Set 3.

Getting started

You should again use BaseX to compose and check your queries, and you should follow the same guidelines as the ones we gave you in PS 3.

Your queries should go in the following file:
ps4_queries.py

Put this file in your ps4 folder. If your browser doesn’t allow you to specify where the file should be saved, try right-clicking on the link above and choosing Save as... or Save link as..., which should produce a dialog box that allows you to choose the correct folder for the file.

The query problems

Write a standalone XPath expression (not a FLWOR expression) to find the names of all PG-13 movies in the database with a runtime of greater than 180 minutes. The result of your query should be a collection of name elements with the names of these movies.
Now write an XQuery FLWOR expression to find the same movies as the previous query, but this time you should:
- generate new elements of type long_pg13 whose values consist of the name of the movie, followed by a space, followed by the movie’s runtime in parentheses using the format shown below. For example:
```
<long_pg13>Titanic (194 minutes)</long_pg13>
```
  Note that we include the word minutes in the parentheses.
- order the results in order of increasing runtime.
Find all people who have won an Oscar two years in a row. (The types of the two Oscars don’t matter; all we care about is that they were won by the same person in consecutive years.) The result of your query should be one or more new elements of type back_to_back that consist of the following:
- the name child element of the person
- new child elements called first_win and second_win whose values consist of the type and year of the first Oscar and second Oscar respectively. In both cases, there should be a single space between the type and year and parentheses surrounding the year.
For example:
```
<back_to_back>
  <name>Alejandro Gonzalez Inarritu</name>
  <first_win>BEST-DIRECTOR (2015)</first_win>
  <second_win>BEST-DIRECTOR (2016)</second_win>
</back_to_back>
```
Hints:
- You don’t need to count anything here! Rather, your for clause should include two different components that each obtain information about Oscars, and you should test to see if the current combination of Oscars involve the same person and are from consecutive years.
- Predicates can include arithmetic operators, and this should allow you to ensure that the Oscars are from consecutive years.
For each of the top 10 grossing movies (i.e., movies with earnings ranks between 1 and 10), construct a new element of type top_grosser that includes the following:
- the name and earnings_rank child elements of the movie
- new child elements of type big_star that have as their value the name of one of the actors associated with that movie in the database.
For example:
```
    <top_grosser>
      <name>Star Wars: The Force Awakens</name>
      <earnings_rank>1</earnings_rank>
      <big_star>Harrison Ford</big_star>
      <big_star>Mark Hamill</big_star>
      <big_star>Carrie Fisher</big_star>
      <big_star>Adam Driver</big_star>
      <big_star>Daisy Ridley</big_star>
    </top_grosser>
```
Hint: You should use the contains function when checking if an ID is in a collection of IDREFS.

Problems 4-6: Processing distributed data using MapReduce

35 points total

Overview and format of the data

In this assignment, you will write three Java programs to run MapReduce jobs using Apache Hadoop.

Because it can be challenging to install and run Apache Hadoop locally, you will instead be running and testing your programs on Gradescope after eliminating any syntax errors in your code. You will NOT be running them on your own machine.

The Autograder will run Hadoop in local mode, which simulates a distributed cluster. However, the programs that you construct could also be run on a real cluster without modification – and they could handle much more data than we will be using in our testing!

Our problem domain is a social network database that contains information about users and their connections to other users in the network.

The data is stored in one or more plain-text files, where each line of a given file represents a single user and looks something like this:

18,Brown,Matthew,1989-11-05,mbrown@gmail.com,189,305,17,31;569121,235708,32087,188745,549575

From left to right, each line contains the following fields:

the user’s unique ID
the user’s last name
the user’s first name
the user’s birthdate
optionally, the user’s email address
optionally, a list of group IDs, separated by commas, specifying the groups the user belongs to
optionally, a list of user IDs, separated by commas, referring to the user’s friends

The fields are comma-separated, with the exception of a semicolon that appears before the friend list. However, if a user has no friends, the line will not contain a semicolon.

Here’s a subset of lines from the largest input file (users200k.txt), showing a number of different (though not all!) variations:

65,Lewis,James,1965-12-26,jlewis1965@hotmail.com,257,7,227;911255,176121,554966,671982,775492,834730,948609
80,Lawson,Emerald,1997-10-01,270,144,368,201,484,8;40146,44545,285685,734547,861038
201,White,Alexander,1958-08-19,awhite1958@yahoo.com;979442,33777,416988,823482,920887,929242

Getting started

You should begin by downloading the following zip file:
ps4_mapreduce.zip
Unzip/extract the contents of the file.
Depending on your system, after extracting the contents you will either have:
- a folder named ps4_mapreduce that contains all of the files that you need for the map-reduce problems
- an outer folder called ps4_mapreduce that contains an inner folder named ps4_mapreduce that contains all of the Java files that you need.
Take the ps4_mapreduce folder that actually contains the necessary files and drag it into your ps4 folder so that you can easily find and open it from within VS Code.
Launch VS Code on your laptop.
In VS Code, select the File->Open Folder or File->Open menu option, and use the resulting dialog box to find and open the problem6 folder that you created above – the one that contains the provided files. (Note: You must open the folder; it is not sufficient to simply open one of the Java files in the folder.)

The name of the folder should appear in the Explorer pane on the left-hand side of the VS Code window, along with a list of all of its contents.
Review the provided files. We have given you:
- WordCount.java - a sample MapReduce program that you are welcome to use as a template for your programs
- starter code for each of the problems (Problem4.java, Problem5.java, etc.)
- data - a folder containing sample input data files for the problems you will solve. They include users20.txt, which we use by default when testing your programs on Gradescope (see below).

IMPORTANT: You should NOT try to run any of these programs on your own machine! Rather, you should follow the procedures outlined below.

Compiling and testing your programs

As mentioned above, we will be running and testing your programs on Gradescope instead of on your own machine.

However, you should use still the compiler in VS Code to eliminate any syntax errors in your programs before you try to test them on Gradescope.

Finding and eliminating syntax errors

Click on the name of program in the Explorer pane to open it in the Editor window.
Open a Terminal window in VS Code, and click on the Problems tab at the top of the Terminal window.
Locate any error messages for the program you’re working on. They begin with a red circle with an X in it. Clicking on the error message will show you the corresponding line from your code.

Note: You may see some warning messages, which begin with a yellow triangle symbol that contains an exclamation mark (!). These can be safely ignored.
Make whatever changes are needed to eliminate all of the error messages.
Remember: do NOT try to run your code on your own machine!

Testing and debugging

Once you have eliminated the syntax errors, save the program in VS Code.
Upload the program to Gradescope. Each problem has its own Gradescope page, and you have two options that you can use for testing:
1. Upload just the program itself (i.e., the Java file). Doing so will run the program on a small input file and compare the results to the expected results.
2. Upload the program and one or more input files like the ones we give you in the data folder. Doing so will run the program on those input files and show you the results, but it will not assess their accuracy. By enabling you to create and use your own input files, this testing option will allow you to ensure that your code handles edge cases that may not be present in the default input file.
Review the results displayed in Gradescope.
Fix any logic errors and resubmit until you obtain the correct results. Important: To assist you in debugging, you can add temporary println statements to your code. If you do so, Gradescope will include the output of those statements as part of the results that it displays.

Important guidelines

Employ good programming style. Use appropriate indentation, select descriptive variable names, localize variables, insert blank lines between logical parts of your program, and add comments as necessary to explain what your code does.
Your nested classes for mappers and reducers should extend the built-in Mapper and Reducer classes, as discussed in lecture and shown in WordCount.java. Your mapper classes will override the inherited map() method, and your reducer classes will override the inherited reduce() method. You should not override or make use of the other inherited methods of the Mapper and Reducer classes.
You are welcome to include additional helper methods in your classes as needed, although doing so is not required. You may also include additional helper classes for a given problem, but they should go in the same file as the rest of the code for that problem, and they should also be nested static classes.
You will need to determine how to correctly parse each line in the input file. In particular, you will need to account for the optional fields in a given line. We encourage you to use the following methods as needed:
- the String method called split() that we discussed in lecture
- other String methods like indexOf() and substring(); however, you should not use anything that isn’t available in Java 8
- the following static methods for converting from a Java String object that is the string representation of a number to the corresponding numeric value:
  - Integer.parseInt()
  - Long.parseLong()
You may assume that the birthdate and email fields are well-formed; that is, the birthdate will always be of the form YYYY-MM-DD, where YYYY is the four-digit birth year, MM is the two-digit month, and DD is the two-digit day. The email, if present, will contain an occurence of the @ character to separate the user name from the domain.
You should limit yourself to the packages that we’ve imported at the top of each starter file. You must not use classes from any other Java package.
You should only use features of Java that were available in Java 8.
You should not use any global variables (i.e., static class variables) in your programs. Class constants (i.e., static final variables) are fine if needed.
Feel free to use these resources as needed:
- the full Java 8 API
- the Hadoop MapReduce API

Problem 4: Email domains

11 points

Write a MapReduce program to find the number of users for each email-address domain. The output of the program should be (key, value) pairs in which the key is an email-address domain and the value is the number of users that have addresses from that domain. For example:

icloud.com   500
gmail.com  250
...

(although you will probably get different numbers than the ones shown above!)

Notes:

In Problem4.java, we have given you a template for your code. You only need to implement the bodies of the map and reduce methods.
To allow for large numbers (i.e., long integers) in the final results, we’ve specified that the values output by the reducer should be of type LongWritable.
Use the birth-month counter program from lecture as a model for what you should do.
Don’t forget that not all user records include an email address. However, you may assume that at least one of the input records includes an email address.
You may assume that all email addresses include an @ symbol. If a given user doesn’t have an email address, the map method can simply return without writing anything.

Problem 5: Email domain with the most users

12 points

Write a MapReduce program to find the email domain with the most users. The final output of the program should be a single (key, value) pair representing the email domain with the most users and the number of users that it has. If there are multiple email domains that are tied for the most users, your program may report any one of them.

Notes:

You will need a chain of two MapReduce jobs for this problem. We discussed how to do this in lecture.
The second job will process the results of the first job. Because those results are stored in a text file, the key-value pairs passed into the second map function will have the following format:
- The key will be a file-offset value that you should ignore (just as any other map function that processes data stored in a text file should ignore its keys).
- The value will be a Text value consisting of one line from the results file produced by the first job. In other words, it will contain one of the key-value pairs written by the first job’s reduce function, with a single tab character ("\t") between the key and the value.
In the second job, the map function should ensure that all of the (key, value) pairs that it outputs have the same constant key. You may either use a string or an integer for this purpose. Using a constant key will ensure that all of these pairs go to a single reducer task, which will then be able to determine which email domain has the largest number of users.
We have given you templates for the two sets of mapper and reducer classes, but they are more limited than what we provided for the previous problem. In particular:
- The extends clauses for the mapper and reducer classes look like this:
```
extends <Object, Object, Object, Object>
```
  As needed, you should replace a given occurrence of Object with the class for the appropriate Hadoop data type. See the lecture notes for a reminder of the purpose of each of the four types.
- We haven’t given you headers for the map and reduce methods. Make sure that you use the appropriate types for the parameters of these methods, and that you include the same throws clauses that we used for those methods in the previous problem.
- In the main method, you should modify as needed the lines that specify the classes for the output keys and output values. You should not change the other lines in this method.

Problem 6: User with the most friends

12 points

Write a MapReduce program to find the user with the most friends. The final output of the program should be a single (key, value) pair representing the ID of the user with the most friends and the number of friends that this user has. If there are multiple users with the most friends, your program may report any one of them.

Unlike the previous problem, we only need a single map-reduce job, rather than a chain of two jobs. That’s because a given user has only one record in the data files, and thus the initial mapper can determine each user’s total number of friends on its own.

Notes:

You will need to take appropriate steps to ensure that all of key-value pairs output by the mapper tasks are processed by a single reducer task, which will then be able to determine which user has the largest number of friends.
As in the previous problem, you will need to select and fill in the appropriate classes in the extends clauses for the two classes and in the headers of the map and reduce methods.
In the main method, you should modify as needed the lines that specify the classes for the output keys and output values. You should not change the other lines in this method.
Don’t forget that not all user records include a list of friends. However, in order to handle input files in which no one has any friends, your map method should still write something for every user. That way, the reduce method will still receive some input, and it can conclude that the maximum number of friends is 0!
You may assume that long integers are not needed for the counts of the number of friends.

Submitting your work for Part II

You should submit only the following four files:

ps4_queries.py (for Problem 3)
Problem4.java
Problem5.java
Problem6.java

Each problem should be submitted to its own Gradescope page using the following steps:

Login to Gradescope and click on the box for CSCI E-66.
Click on the appropriate problem in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your files to the box labeled DRAG & DROP. You can either drag and drop the files from their folder into the box, or you can click on the box itself and browse for the files.
Click the Upload button.
You should see a box saying that your submission was successful. Click the (x) button to close that box.
The Autograder will perform some tests. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.

Notes:
- For Problem 3, you should see test results for each query. If you don’t see any results for a given query, it probably means that you have a syntax or logic error in your query, and you should attempt to fix it and resubmit.
- For Problem 3, you should keep making changes as needed until you get full credit for a given query. There will be no partial credit awarded for an incorrect query.
- For Problems 4-6, passing the preliminary tests does not guarantee that you will ultimately get full credit for the problem. Additional tests will be run later, and you should perform your own testing following the procedures outlined above to ensure that your code works correctly in all cases.
  
  In particular, don’t forget that you can upload your own input file(s) when you submit your program. We recommend that you try input files that include instances of special cases (e.g., users without email addresses or friend lists). To do so, you can use one or more of the text files that we have provided in the data subfolder. Either use one of those files in its entirety, or create a new file containing a subset of lines from those files so that you can focus on particular types of cases.
If needed, use the Resubmit button at the bottom of the page to resubmit your work. Important: Every time that you make a submission, you should submit all of the files for that Gradescope assignment, even if some of them have not changed since your last submission.
Near the top of the page, click on the box labeled Code. Then click on the name of each file to view its contents. Check to make sure that the files contain the code that you want us to grade.

Important

It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough time to do so, wait an hour or two and then try again. If you are unable to submit and it is close to the deadline, email your homework before the deadline to esaracin@bu.edu

Last updated on November 20, 2024.