Due at 11:59 PM on Sunday, December 15, 2024. No late submissions will be accepted.
Note: There are no extra grad-credit problems for this problem set.
For this assignment, we will not accept any homework after Sunday, December 15, so that we can post solutions before the final exam. Plan your time carefully, and don’t wait until the last minute to begin an assignment. Starting early will give you ample time to ask questions and obtain assistance.
In your work on this assignment, make sure to abide by our policies on academic conduct.
If you have questions while working on this assignment, please attend office hours or post them on Ed Discussion.
50 points total
Create a subfolder called ps5
within your
e66
folder, and put all of the files for this assignment in that folder.
This part of the assignment will all be completed in a single PDF file. To create it, you should do the following:
Access the template that we have created by clicking on this link and signing into your Google account as needed.
When asked, click on the Make a copy button, which will save a copy of the template file to your Google Drive.
Select File->Rename, and change the name of the file to
ps5_partI
.
Add your work for all of the problems from Part I to this file.
Once you have completed Part I, choose File->Download->PDF document,
and save the PDF file on your machine.
The resulting PDF file (ps5_partI.pdf
) is the one that you will
submit. See the submission guidelines at the end of Part I.
20 points total; 4 points each part
Consider the following CREATE TABLE
commands for a portion of
relational database maintained by a bookseller:
CREATE TABLE Author(id CHAR(5) PRIMARY KEY, name VARCHAR(64), dob DATE); CREATE TABLE Book(isbn CHAR(13) PRIMARY KEY, title VARCHAR(128), publisher VARCHAR(64), num_pages INTEGER, genre VARCHAR(10), numInStock INTEGER); CREATE TABLE Wrote(authorID CHAR(5), book CHAR(13), PRIMARY KEY(authorID, book), FOREIGN KEY authorID REFERENCES Author(id), FOREIGN KEY book REFERENCES Book(isbn)); CREATE TABLE Sales(date VARCHAR(10), time VARCHAR(5), book CHAR(13), numSold INTEGER, PRIMARY KEY(date, time, book), FOREIGN KEY book REFERENCES Book(isbn));
The first two tables store information about authors and books.
The third table (Wrote
) captures relationsips between authors and books.
The fourth table (Sales
), combined with the numInStock
attribute
in Book
, allows the bookseller to keep track of sales and inventory.
You have been asked to create a database that stores this same information in MongoDB.
There are two types of relationships that the database needs to capture: those between authors and books, and those between books and their associated sales data.
One way to capture these relationships would be to take a purely reference-based approach that avoids embedding one type of document in another. Illustrate this approach by showing how the following tuples from the relational database would be represented as one or more documents in MongoDB:
from Author: ('44777', 'Ina Garten', '1948-02-02') from Book: ('9781984822789', 'Go-To Dinners: A Barefoot Contessa Cookbook', 'Clarkson Potter', 256, 'cookbook', 30) from Wrote: ('44777', '9781984822789') from Sales: ('2022-11-22', '11:00', '9781984822789', 1)
Guidelines:
You will need to determine how many documents are needed, how many references are needed, and where the references should go. Make sure to take into account the factors relevant to data modeling that we discussed in lecture.
Your answer should take into account the entire sets of relationships that the database will need to capture, but the only concrete relationship(s) that your answer needs to include are those described by the tuples above.
We encourage you to review the questions related to this problem in the PS 5 FAQ.
Briefly describe one advantage and one disadvantage of the approach to data modeling taken in part 1.
Another way to capture the relationships would be to take an approach that allows you to embed one type of document in another. Illustrate this approach by showing how the tuples from part 1 would be captured as one or more documents in MongoDB. The guidelines from part 1 also apply here.
Briefly describe one advantage and one disadvantage of the approach to data modeling taken in part 3.
Consider the updates that need to happen when the bookseller attempts to sell one or more copies of a given book. Discuss the appropriateness of the data-modeling approaches from parts 1 and 3 in light of this particular use of the database.
18 points total
Consider the following sequence of log records written by a system that uses undo-redo logging:
LSN |
record contents |
---|---|
0 |
txn: 1; BEGIN |
10 |
txn: 1; item: A; old: 100; new: 110; olsn: 0 |
20 |
txn: 2; BEGIN |
30 |
txn: 2; item: B; old: 200; new: 210; olsn: 0 |
40 |
txn: 1; item: C; old: 300; new: 310; olsn: 0 |
50 |
txn: 1; item: D; old: 400; new: 410; olsn: 0 |
60 |
txn: 2; COMMIT |
70 |
txn: 1; item: B; old: 210; new: 220; olsn: 30 |
80 |
txn: 1; item: D; old: 410; new: 420; olsn: 50 |
(7 points) If a crash occurs and log record 80 is the last one
to make it to disk, what steps would be performed during recovery
if the system is performing undo-redo logging and the
on-disk datum LSNs are not consulted? (In other words, you
should assume that the system is not performing logical
logging, and thus you don’t need to worry about redoing or
undoing a change unnecessarily.) Complete the table provided in
ps5_partI
to show how each log record would be handled during
both the backward and forward passes.
(7 points) If a crash occurs and log record 80 is the last one to make it to disk, what steps would be performed during recovery if the system is performing undo-redo logging and the on-disk datum LSNs are consulted (i.e., the system is performing logical logging)?
Complete the table provided in ps5_partI
to show how each log
record would be handled during both the backward and forward
passes. You should assume that the datum LSNs at the start of
recovery are the following:
In addition, you should assume that the recovery subsystem does not perform any actions that the LSNs indicate are unnecessary.
(4 points) Now assume that a dynamic checkpoint had occurred between log records 30 and 40.
a. How (if at all) would that change which log records are considered during the backward pass? Explain briefly.
b. How (if at all) would that change which log records are considered during the forward pass? Explain briefly.
12 points total
Consider again the following sequence of log records:
LSN |
record contents |
---|---|
0 |
txn: 1; BEGIN |
10 |
txn: 1; item: A; old: 100; new: 110; olsn: 0 |
20 |
txn: 2; BEGIN |
30 |
txn: 2; item: B; old: 200; new: 210; olsn: 0 |
40 |
txn: 1; item: C; old: 300; new: 310; olsn: 0 |
50 |
txn: 1; item: D; old: 400; new: 410; olsn: 0 |
60 |
txn: 2; COMMIT |
70 |
txn: 1; item: B; old: 210; new: 220; olsn: 30 |
80 |
txn: 1; item: D; old: 410; new: 420; olsn: 50 |
(4 points) This log was created by a system that uses undo-redo logging. If a crash occurs and log record 80 is the last one to make it to disk, what are all possible on-disk values of each of the data items (A, B, C, and D) after the crash but before recovery?
(4 points) How would your answer to part 1 change if the system were using redo-only logging instead of undo-redo? Briefly explain the reason for any changes. (Note: You should assume that none of the data items (A, B, C, D) are on the same page.)
(4 points) How would your answer to part 1 change if the system were using undo-only logging? Briefly explain the reason for any changes. (Here again, you should assume that none of the data items are on the same page. In addition, you should assume that when the DBMS forces dirty database pages to disk, it forces only those pages that must go to disk in order for undo-only logging to work correctly.)
Once you have completed Part I in Google Drive, choose
File->Download as->PDF document, and save the resulting file
(ps5_partI.pdf
) on your machine.
Login to Gradescope and click on the box for CSCI E-66.
Click on the name PS 5: Part I in the list of assignments. You should see a pop-up window labeled Submit Assignment. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Choose the Submit PDF option, and then click the Select PDF
button and find the ps5_partI.pdf
that you created in step 1.
Then click the Upload PDF button.
You should see an outline of the problems along with thumbnails of the pages from your uploaded PDF. For each problem in the outline:
As you do so, click on the magnifying glass icon for each page and doublecheck that the pages that you see contain the work that you want us to grade.
Once you have assigned pages to all of the problems in the question outline, click the Submit button in the lower-right corner of the window.
You should see a box saying that your submission was successful.
Click the (x)
button to close that box.
You can use the Resubmit button at the bottom of the page to resubmit your work as many times as needed before the final deadline.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
esaracin@bu.edu
50 points total
In this part of the assignment, you will write queries for a MongoDB version of our movie database – one that uses the data model outlined in lecture.
Click on each of the following links, and save the files in
your ps5
folder:
If the browser doesn’t allow you to choose where to download the files, right-click each link and use Save link as... or the equivalent option.
Follow the appropriate set of instructions to get everything installed and properly configured:
Make sure that you have downloaded the files listed above and installed and configured the software.
Launch MongoDB Compass as needed and click on the Connect
button to connect to the localhost
MongoDB deployment.
Click on the Databases tab, and then click Create database.
In the window that pops up, enter the following:
imdb
movies
and click Create Database.
Choose the imdb
database from the list of databases. You should
see an empty collection called movies
.
Click Create collection and create a new collection with the
name people
.
Click Create collection again and create a new collection with the
name oscars
.
To add documents to the movies
collection:
In MongoDB Compass, select the collection from the list of
collections for imdb
.
Scroll down as needed to find the Import Data button and click it (NOT the Add Data one).
Click Select a file... and choose the JSON file for the
collection that you downloaded above (e.g., for movies
, the
file is called movies.json
).
Click Import and then DONE.
You should see that the collection now includes the documents that were found in the JSON file.
Repeat step 7 to add documents to the oscars
and
people
collections.
Once you have everything installed and configured, you can perform queries by taking the following steps:
Launch MongoDB Compass.
Click the Connect button to connect to the MongoDB server that should
be running on localhost
.
Click on the MONGOSH
bar at the bottom of the window for MongoDB
Compass. Doing so should expand a pane for the MongoDB shell, and
should see the following command prompt:
test>
Enter the following command from the shell prompt:
use imdb
This should change the shell’s prompt to the following:
imdb>
Enter the appropriate method calls for your queries. If you simply enter the method call, you should automatically see up to the first 20 results.
If there are more than 20 results, you can enter the following command to continue iterating through them:
it
If you’re using a Mac, you should disable smart quotes, because they may lead to errors in MongoDB and in our testing. There are instructions for doing so here.
ps5_queries.py
is a Python file, so you could use a Python IDE
to edit it, but a regular text editor like TextEdit or Notepad++
would also be fine. However, if you use a text editor, you must
ensure that you save it as a plain-text
file.
Construct the MongoDB method calls needed to solve the problems given
below. Test each method call in MONGOSH
to make sure that it works.
Once you have finalized the method call for a given problem, copy
the call into your ps5_queries.py
file, putting it between
the triple quotes provided for that query’s variable. We have
included a sample query to show you what the format of your
answers should look like.
Each of the problems must be solved by means of a single query
(i.e., a single method call). The results of the query should
include only the requested information, with no extraneous fields.
In particular, you should exclude the _id
field from the results
unless the problem indicates otherwise.
You do not need to worry about the order of the fields in the results, nor the places in which line breaks or spaces appear.
Unless the problem indicates otherwise, you may only use aspects of the MongoDB query language that we discussed in lecture.
Your queries should only use information provided in the problem itself. In addition, they should work for any MongoDB database that follows the schema that we discussed in lecture.
Make sure to follow the guidelines above.
Write a query to find the names and runtimes of all movies in the database that have an R rating and a runtime of less than 100 minutes.
Ariana DeBose won an Oscar in 2012 for her performance in West Side Story. There are a number of movies in our database whose names include the word “Story”. Find the names and years of those movies. Your query should not make any assumptions about where the word “Story” appears in the name.
Write a query to find the names of all people who have won a supporting actor/actress Oscar from 2015 to the present. The result documents should include the name of the person and the year in which the award was given.
Hints:
Because the name of the person is part of a subdocument in the relevant documents, you will need to use dot notation for the name of the field used to obtain the name, and you will need to surround the field name with quotes.
The name of the person will also end up in a subdocument in the results of the query. For example, here is what one of the result documents should look like:
{ year: 2019, person: { name: 'Mahershala Ali' } }
Write a query to find the names and dates of birth of all actors in the database who were born in 1972.
Write a query to find the names of all directors who have directed a movie in the database in which Daniel Radcliffe acted. A given person’s name should appear at most once in the result of the query. You should use one of the single-purpose aggregation methods covered in lecture, not an aggregation pipeline.
Write a query to compute, for each movie rating, the number of movies with that rating and their average runtime. The final result documents should have three fields:
rating
for the movie ratingavg_runtime
for the average runtime of movies with that ratingmovie_count
for the number of movies with that rating.For example:
{ avg_runtime: 124, movie_count: 5, rating: 'M' }
Note: Some movies do not have a rating, and you should see a document
for those movies in which the rating is null
.
Write a query to find the name and date of birth of the youngest actor in the database – i.e., the one whose date of birth is largest.
Write a query to find the number of people in the database who
were born in the first decade of the current millenium – i.e.,
between '2000-01-01'
and '2009-12-31'
inclusive. The final result
should be a single document with a field called num_first_decade
whose value is the computed count.
Write a query to find all people who have directed 4 or more of the top-grossing movies in the database. The final result should consist of documents with two fields:
one called director
with the name of a director
one called num_top_grossers
with the number of their top-grossing movies.
Hints:
For the purposes of this query, a top-grossing movie is any
movie that has an earnings_rank
field.
You may assume that all people in the database have unique names.
Because a given movie may have multiple directors, one of the
stages in the pipeline will need to be an $unwind
stage that
“unwinds” the arrays of directors in the movie documents that
you are considering.
Find the top-ten Oscar winners in the database – i.e., the 10 people who have won the most Oscars in the database, along with the number and names of the movies for which they won their awards. The final result documents should each have the following fields:
award_count
for the number of Oscars they wonmovies
for an array containing the names of the
movies for which they won the Oscarsbig_winner
for the name of the person.For example:
{ award_count: 3, movies: [ 'Lincoln', 'There Will Be Blood', 'My Left Foot' ], big_winner: 'Daniel Day-Lewis' }
Hints:
In our database, documents for Best Picture Oscars do not
include a person
field. Despite that fact, you will still
need to take steps to exclude Best Picture Oscars from the
results of your query, or else you will end up with a result
document for them in which the name of the person is null
.
You may again assume that all people in the database have unique names.
Use the $addToSet
accumulator to build the array containing
the names of a person’s Oscar-winning movies.
Login to Gradescope by clicking the link in the left-hand navigation bar, and click on the box for CSCI E-66.
Submit your ps5_queries.py
file using these steps:
Click on PS 5: Part II in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your file to the box labeled DRAG & DROP. You can either drag and drop the file from its folder into the box, or you can click on the box itself and browse for the file.
Click the Upload button.
You should see a box saying that your submission was successful.
Click the (x)
button to close that box.
The Autograder will perform some tests on your file. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.
Notes:
You should see test results for each query. If you don’t see any results for a given query, it probably means that you have a syntax or logic error in your query, and you should attempt to fix it and resubmit.
You should keep making changes as needed until you get full credit for a given query. There will be no partial credit awarded for an incorrect query.
Make sure that each query is logically correct, and that it will work for any instance of the movie database that follows the data model outlined in lecture. We reserve the right to ultimately run your queries on a slightly different version of the database to ensure that your queries are logically correct.
If needed, use the Resubmit button at the bottom of the page to resubmit your work.
Near the top of the page, click on the box labeled Code. Then click on the name of your file to view its contents. Check to make sure that the file contains the work that you want us to grade.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
esaracin@bu.edu
Last updated on December 2, 2024.