Due before the start of lecture on November 26, 2024. See below for a summary of the policies regarding late submissions.
Homework is due prior to the start of lecture. If it is submitted more than 10 minutes after the start of lecture, it will be considered late. There will be a 10% deduction for submissions made any time in the following seven days, up to the start of the next lecture. We will not accept any homework that is more than 7 days late. Plan your time carefully, and don’t wait until the last minute to begin an assignment. Starting early will give you ample time to ask questions and obtain assistance.
In your work on this assignment, make sure to abide by our policies on academic conduct.
If you have questions while working on this assignment, please attend office hours or post them on Ed Discussion.
40 points total
Create a subfolder called ps4
within your
e66
folder, and put all of the files for this assignment in that folder.
This part of the assignment will all be completed in a single PDF file. To create it, you should do the following:
Access the template that we have created by clicking on this link and signing into your Google account as needed.
When asked, click on the Make a copy button, which will save a copy of the template file to your Google Drive.
Select File->Rename, and change the name of the file to
ps4_partI
.
Add your work for all of the problems from Part I to this file.
Once you have completed Part I, choose File->Download->PDF document,
and save the PDF file on your machine.
The resulting PDF file (ps4_partI.pdf
) is the one that you will
submit. See the submission guidelines at the end of Part I.
30 points total
Assume that a database is replicated across 11 different sites.
If the system uses fully distributed locking, which of the following voting schemes (if any) would work? Explain briefly why each scheme would or would not work.
Now assume that the system uses primary-copy locking. Which of the following voting schemes (if any) would work? Explain briefly why each scheme would or would not work.
If your workload primarily involves reads, which of the above configurations (1a, 1b, 1c, 1d, 2a, 2b, 2c, or 2d) would be the best one to choose? Explain briefly why you think it would be best, and make sure to also mention any potential drawbacks that it could have.
If your workload primarily involves writes, which of the above configurations (1a, 1b, 1c, 1d, 2a, 2b, 2c, or 2d) would be the best one to choose? Explain briefly why you think it would be best, and make sure to also mention any potential drawbacks that it could have.
10 points
In lecture, we showed how a distributed DBMS could create global exclusive locks and global shared locks by acquiring local locks for an appropriate number of replicas.
Earlier in the semester, we discussed a third type of lock that is sometimes used in a centralized DBMS: an update lock, which can be acquired for the read portion of a read-modify-write transaction. What if a distributed DBMS wanted to add this type of lock as well? Explain how it could create global exclusive, global shared, and global update locks by acquiring appropriate numbers of local exclusive, local shared, and local update locks. The global locks should have the same semantics as the equivalent locks in a centralized database in terms of when a lock can be acquired and how many transactions can hold a given type of lock at the same time.
Notes:
The rules that we provided for global exclusive locks and global shared locks in lecture were for a system that does not use global update locks. In this problem, you need to provide the rules for global exclusive, global shared, and global update locks in a system that does use update locks. In other words, you will need a rule (i.e., an inequality) and a corresponding explanation for each type of global lock.
Your inequalities should make use of the following variables:
x = the number of local exclusive locks needed to acquire a global exclusive lock
s = number of local shared locks needed to acquire a global shared lock
u = number of local update locks needed to acquire a global update lock
In addition, you can use arithmetic operators (e.g., subtraction)
and the functions min
and max
.
When it comes to global exclusive locks and global shared locks, their rules may be the same as the ones we presented in lecture for a system without update locks, or they may not be. In other words, you may or may not need to adjust the rules for global exclusive locks and/or global shared locks in light of the fact that you are now dealing with a system that has global update locks.
Login to Gradescope by clicking the link in the left-hand navigation bar. Once you are in logged in, click on the box for CSCI E-66.
Submit your ps4_partI.pdf
file using these steps:
If you still need to create the PDF file, open your file on Google Drive, choose File->Download->PDF document, and save the PDF file on your machine.
Click on the name PS 4: Part I in the list of assignments. You should see a pop-up window labeled Submit Assignment. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Choose the Submit PDF option, and then click the Select PDF
button and find the ps4_partI.pdf
that you created in step 1.
Then click the Upload PDF button.
You should see an outline of the problems along with thumbnails of the pages from your uploaded PDF. For each problem in the outline:
As you do so, click on the magnifying glass icon for each page and doublecheck that the pages that you see contain the work that you want us to grade.
Once you have assigned pages to all of the problems in the question outline, click the Submit button in the lower-right corner of the window.
You should see a box saying that your submission was successful.
Click the (x)
button to close that box.
You can use the Resubmit button at the bottom of the page to resubmit your work as many times as needed before the final deadline.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
esaracin@bu.edu
60 points total
25 points
This problem asks you to write some additional XPath and XQuery queries for the XML version of our movie database that you worked with in Problem Set 3.
You should again use BaseX to compose and check your queries, and you should follow the same guidelines as the ones we gave you in PS 3.
Your queries should go in the following file:
ps4_queries.py
Put this file in your ps4
folder. If your browser doesn’t allow you
to specify where the file should be saved, try right-clicking on the
link above and choosing Save as... or Save link as..., which
should produce a dialog box that allows you to choose the correct
folder for the file.
Write a standalone XPath expression (not a FLWOR
expression) to find the names of all PG-13 movies in the database
with a runtime of greater than 180 minutes.
The result of your query should be a collection of name
elements with the names of these movies.
Now write an XQuery FLWOR expression to find the same movies as the previous query, but this time you should:
generate new elements of type long_pg13
whose values
consist of the name of the movie, followed by a space,
followed by the movie’s runtime in parentheses
using the format shown below. For example:
<long_pg13>Titanic (194 minutes)</long_pg13>
Note that we include the word minutes
in the parentheses.
order the results in order of increasing runtime.
Find all people who have won an Oscar two years in a row. (The
types of the two Oscars don’t matter; all we care about is that
they were won by the same person in consecutive years.) The result
of your query should be one or more new elements of type
back_to_back
that consist of the following:
the name
child element of the person
new child elements called first_win
and second_win
whose values consist of the type and year of the first Oscar and
second Oscar respectively. In both cases, there should be a
single space between the type and year and
parentheses surrounding the year.
For example:
<back_to_back> <name>Alejandro Gonzalez Inarritu</name> <first_win>BEST-DIRECTOR (2015)</first_win> <second_win>BEST-DIRECTOR (2016)</second_win> </back_to_back>
Hints:
You don’t need to count anything here! Rather, your for
clause should include two different components that each
obtain information about Oscars, and you should test to see if
the current combination of Oscars involve the same person and
are from consecutive years.
Predicates can include arithmetic operators, and this should allow you to ensure that the Oscars are from consecutive years.
For each of the top 10 grossing movies (i.e., movies with
earnings ranks between 1 and 10), construct a new element of
type top_grosser
that includes the following:
name
and earnings_rank
child elements of the moviebig_star
that have as their
value the name of one of the actors associated with that
movie in the database.For example:
<top_grosser> <name>Star Wars: The Force Awakens</name> <earnings_rank>1</earnings_rank> <big_star>Harrison Ford</big_star> <big_star>Mark Hamill</big_star> <big_star>Carrie Fisher</big_star> <big_star>Adam Driver</big_star> <big_star>Daisy Ridley</big_star> </top_grosser>
Hint: You should use the contains
function when
checking if an ID is in a collection of IDREFS.
35 points total
In this assignment, you will write three Java programs to run MapReduce jobs using Apache Hadoop.
Because it can be challenging to install and run Apache Hadoop locally, you will instead be running and testing your programs on Gradescope after eliminating any syntax errors in your code. You will NOT be running them on your own machine.
The Autograder will run Hadoop in local mode, which simulates a distributed cluster. However, the programs that you construct could also be run on a real cluster without modification – and they could handle much more data than we will be using in our testing!
Our problem domain is a social network database that contains information about users and their connections to other users in the network.
The data is stored in one or more plain-text files, where each line of a given file represents a single user and looks something like this:
18,Brown,Matthew,1989-11-05,mbrown@gmail.com,189,305,17,31;569121,235708,32087,188745,549575
From left to right, each line contains the following fields:
The fields are comma-separated, with the exception of a semicolon that appears before the friend list. However, if a user has no friends, the line will not contain a semicolon.
Here’s a subset of lines from the largest input file (users200k.txt
),
showing a number of different (though not all!) variations:
65,Lewis,James,1965-12-26,jlewis1965@hotmail.com,257,7,227;911255,176121,554966,671982,775492,834730,948609 80,Lawson,Emerald,1997-10-01,270,144,368,201,484,8;40146,44545,285685,734547,861038 201,White,Alexander,1958-08-19,awhite1958@yahoo.com;979442,33777,416988,823482,920887,929242
You should begin by downloading the following zip file:
ps4_mapreduce.zip
Unzip/extract the contents of the file.
Depending on your system, after extracting the contents you will either have:
a folder named ps4_mapreduce
that contains all of the files that
you need for the map-reduce problems
an outer folder called ps4_mapreduce
that contains an inner
folder named ps4_mapreduce
that contains all of the Java files that
you need.
Take the ps4_mapreduce
folder that actually contains the necessary
files and drag it into your ps4
folder so that you can easily find
and open it from within VS Code.
Launch VS Code on your laptop.
In VS Code, select the File->Open Folder or File->Open menu
option, and use the resulting dialog box to find and open the
problem6
folder that you created above – the one that contains the
provided files. (Note: You must open the folder; it is not
sufficient to simply open one of the Java files in the folder.)
The name of the folder should appear in the Explorer pane on the left-hand side of the VS Code window, along with a list of all of its contents.
Review the provided files. We have given you:
WordCount.java
- a sample MapReduce program that you are welcome
to use as a template for your programs
starter code for each of the problems (Problem4.java
,
Problem5.java
, etc.)
data
- a folder containing sample input data files for the
problems you will solve. They include users20.txt
, which we
use by default when testing your programs on Gradescope (see
below).
IMPORTANT: You should NOT try to run any of these programs on your own machine! Rather, you should follow the procedures outlined below.
As mentioned above, we will be running and testing your programs on Gradescope instead of on your own machine.
However, you should use still the compiler in VS Code to eliminate any syntax errors in your programs before you try to test them on Gradescope.
Finding and eliminating syntax errors
Click on the name of program in the Explorer pane to open it in the Editor window.
Open a Terminal window in VS Code, and click on the Problems tab at the top of the Terminal window.
Locate any error messages for the program you’re working on. They begin with a red circle with an X in it. Clicking on the error message will show you the corresponding line from your code.
Note: You may see some warning messages, which begin with a
yellow triangle symbol that contains an exclamation mark (!
).
These can be safely ignored.
Make whatever changes are needed to eliminate all of the error messages.
Remember: do NOT try to run your code on your own machine!
Testing and debugging
Once you have eliminated the syntax errors, save the program in VS Code.
Upload the program to Gradescope. Each problem has its own Gradescope page, and you have two options that you can use for testing:
Upload just the program itself (i.e., the Java file). Doing so will run the program on a small input file and compare the results to the expected results.
Upload the program and one or more input files like the ones
we give you in the data
folder. Doing so will run the
program on those input files and show you the results, but it
will not assess their accuracy. By enabling you to create and
use your own input files, this testing option will allow you
to ensure that your code handles edge cases that may not be
present in the default input file.
Review the results displayed in Gradescope.
Fix any logic errors and resubmit until you obtain the correct
results. Important: To assist you in debugging, you can add
temporary println
statements to your code. If you do so,
Gradescope will include the output of those statements as part of
the results that it displays.
Employ good programming style. Use appropriate indentation, select descriptive variable names, localize variables, insert blank lines between logical parts of your program, and add comments as necessary to explain what your code does.
Your nested classes for mappers and reducers should extend the
built-in Mapper
and Reducer
classes, as discussed in lecture
and shown in WordCount.java
. Your mapper classes will override
the inherited map()
method, and your reducer classes will override
the inherited reduce()
method. You should not override or make
use of the other inherited methods of the Mapper
and Reducer
classes.
You are welcome to include additional helper methods in your classes as needed, although doing so is not required. You may also include additional helper classes for a given problem, but they should go in the same file as the rest of the code for that problem, and they should also be nested static classes.
You will need to determine how to correctly parse each line in the input file. In particular, you will need to account for the optional fields in a given line. We encourage you to use the following methods as needed:
the String
method called split()
that we discussed in lecture
other String
methods like indexOf()
and substring()
; however, you should not use anything that isn’t
available in Java 8
the following static methods for converting from a Java
String
object that is the string representation of a number
to the corresponding numeric value:
Integer.parseInt()
Long.parseLong()
You may assume that the birthdate and email fields are
well-formed; that is, the birthdate will always be of the form
YYYY-MM-DD, where YYYY is the four-digit birth year, MM is the
two-digit month, and DD is the two-digit day. The email, if present,
will contain an occurence of the @
character to separate the user
name from the domain.
You should limit yourself to the packages that we’ve imported at the top of each starter file. You must not use classes from any other Java package.
You should only use features of Java that were available in Java 8.
You should not use any global variables (i.e., static class variables) in your programs. Class constants (i.e., static final variables) are fine if needed.
Feel free to use these resources as needed:
11 points
Write a MapReduce program to find the number of users for each email-address domain. The output of the program should be (key, value) pairs in which the key is an email-address domain and the value is the number of users that have addresses from that domain. For example:
icloud.com 500 gmail.com 250 ...
(although you will probably get different numbers than the ones shown above!)
Notes:
In Problem4.java
, we have given you a template for your code.
You only need to implement the bodies of the map
and reduce
methods.
To allow for large numbers (i.e., long integers) in the final
results, we’ve specified that the values output by the reducer
should be of type LongWritable
.
Use the birth-month counter program from lecture as a model for what you should do.
Don’t forget that not all user records include an email address. However, you may assume that at least one of the input records includes an email address.
You may assume that all email addresses include an @
symbol.
If a given user doesn’t have an email address, the map
method can
simply return without writing anything.
12 points
Write a MapReduce program to find the email domain with the most users. The final output of the program should be a single (key, value) pair representing the email domain with the most users and the number of users that it has. If there are multiple email domains that are tied for the most users, your program may report any one of them.
Notes:
You will need a chain of two MapReduce jobs for this problem. We discussed how to do this in lecture.
The second job will process the results of the first job. Because those
results are stored in a text file, the key-value pairs passed into the
second map
function will have the following format:
The key will be a file-offset value that you should ignore
(just as any other map
function that processes data stored in
a text file should ignore its keys).
The value will be a Text
value consisting of one line from
the results file produced by the first job. In other words, it
will contain one of the key-value pairs written by the first
job’s reduce
function, with a single tab character ("\t"
)
between the key and the value.
In the second job, the map
function should ensure that all of the
(key, value) pairs that it outputs have the same constant key.
You may either use a string or an integer for this purpose. Using
a constant key will ensure that all of these pairs go to a
single reducer task, which will then be able to determine which
email domain has the largest number of users.
We have given you templates for the two sets of mapper and reducer classes, but they are more limited than what we provided for the previous problem. In particular:
The extends
clauses for the mapper and reducer classes
look like this:
extends <Object, Object, Object, Object>
As needed, you should replace a given occurrence of Object
with the class for the appropriate Hadoop data type. See the
lecture notes for a reminder of the purpose of each of the
four types.
We haven’t given you headers for the map
and reduce
methods. Make sure that you use the appropriate
types for the parameters of these methods, and that you
include the same throws
clauses that we used for those
methods in the previous problem.
In the main
method, you should modify as needed the lines
that specify the classes for the output keys and output
values. You should not change the other lines in this
method.
12 points
Write a MapReduce program to find the user with the most friends. The final output of the program should be a single (key, value) pair representing the ID of the user with the most friends and the number of friends that this user has. If there are multiple users with the most friends, your program may report any one of them.
Unlike the previous problem, we only need a single map-reduce job, rather than a chain of two jobs. That’s because a given user has only one record in the data files, and thus the initial mapper can determine each user’s total number of friends on its own.
Notes:
You will need to take appropriate steps to ensure that all of key-value pairs output by the mapper tasks are processed by a single reducer task, which will then be able to determine which user has the largest number of friends.
As in the previous problem, you will need to select and fill in
the appropriate classes in the extends
clauses for the two
classes and in the headers of the map
and reduce
methods.
In the main
method, you should modify as needed the lines
that specify the classes for the output keys and output
values. You should not change the other lines in this
method.
Don’t forget that not all user records include a list of friends.
However, in order to handle input files in which no one has
any friends, your map
method should still write something
for every user. That way, the reduce
method will still
receive some input, and it can conclude that the maximum number
of friends is 0!
You may assume that long integers are not needed for the counts of the number of friends.
You should submit only the following four files:
ps4_queries.py
(for Problem 3)Problem4.java
Problem5.java
Problem6.java
Each problem should be submitted to its own Gradescope page using the following steps:
Login to Gradescope and click on the box for CSCI E-66.
Click on the appropriate problem in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your files to the box labeled DRAG & DROP. You can either drag and drop the files from their folder into the box, or you can click on the box itself and browse for the files.
Click the Upload button.
You should see a box saying that your submission was successful.
Click the (x)
button to close that box.
The Autograder will perform some tests. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.
Notes:
For Problem 3, you should see test results for each query. If you don’t see any results for a given query, it probably means that you have a syntax or logic error in your query, and you should attempt to fix it and resubmit.
For Problem 3, you should keep making changes as needed until you get full credit for a given query. There will be no partial credit awarded for an incorrect query.
For Problems 4-6, passing the preliminary tests does not guarantee that you will ultimately get full credit for the problem. Additional tests will be run later, and you should perform your own testing following the procedures outlined above to ensure that your code works correctly in all cases.
In particular, don’t forget that you can upload your own input
file(s) when you submit your program. We recommend that you
try input files that include instances of special
cases (e.g., users without email addresses or friend lists).
To do so, you can use one or more of the text files that
we have provided in the data
subfolder. Either use one of
those files in its entirety, or create a new file containing
a subset of lines from those files so that you can focus on
particular types of cases.
If needed, use the Resubmit button at the bottom of the page to resubmit your work. Important: Every time that you make a submission, you should submit all of the files for that Gradescope assignment, even if some of them have not changed since your last submission.
Near the top of the page, click on the box labeled Code. Then click on the name of each file to view its contents. Check to make sure that the files contain the code that you want us to grade.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
esaracin@bu.edu
Last updated on November 20, 2024.