Problem Set 5

Due at 11:59 PM on Sunday, December 15, 2024. No late submissions will be accepted.

Note: There are no extra grad-credit problems for this problem set.

Preliminaries
Part I
Part II

Preliminaries

For this assignment, we will not accept any homework after Sunday, December 15, so that we can post solutions before the final exam. Plan your time carefully, and don’t wait until the last minute to begin an assignment. Starting early will give you ample time to ask questions and obtain assistance.

In your work on this assignment, make sure to abide by our policies on academic conduct.

If you have questions while working on this assignment, please attend office hours or post them on Ed Discussion.

Part I

50 points total

Creating the necessary folder

Create a subfolder called ps5 within your e66 folder, and put all of the files for this assignment in that folder.

Creating the necessary file

This part of the assignment will all be completed in a single PDF file. To create it, you should do the following:

Access the template that we have created by clicking on this link and signing into your Google account as needed.
When asked, click on the Make a copy button, which will save a copy of the template file to your Google Drive.
Select File->Rename, and change the name of the file to ps5_partI.
Add your work for all of the problems from Part I to this file.
Once you have completed Part I, choose File->Download->PDF document, and save the PDF file on your machine. The resulting PDF file (ps5_partI.pdf) is the one that you will submit. See the submission guidelines at the end of Part I.

Problem 1: Data models for a NoSQL document database

20 points total; 4 points each part

Consider the following CREATE TABLE commands for a portion of relational database maintained by a bookseller:

CREATE TABLE Author(id CHAR(5) PRIMARY KEY, name VARCHAR(64), dob DATE);

CREATE TABLE Book(isbn CHAR(13) PRIMARY KEY, title VARCHAR(128), 
    publisher VARCHAR(64), num_pages INTEGER, genre VARCHAR(10),
    numInStock INTEGER);

CREATE TABLE Wrote(authorID CHAR(5), book CHAR(13), 
    PRIMARY KEY(authorID, book),
    FOREIGN KEY authorID REFERENCES Author(id),
    FOREIGN KEY book REFERENCES Book(isbn));

CREATE TABLE Sales(date VARCHAR(10), time VARCHAR(5), book CHAR(13),
    numSold INTEGER, PRIMARY KEY(date, time, book),
    FOREIGN KEY book REFERENCES Book(isbn));

The first two tables store information about authors and books.

The third table (Wrote) captures relationsips between authors and books.

The fourth table (Sales), combined with the numInStock attribute in Book, allows the bookseller to keep track of sales and inventory.

You have been asked to create a database that stores this same information in MongoDB.

There are two types of relationships that the database needs to capture: those between authors and books, and those between books and their associated sales data.

One way to capture these relationships would be to take a purely reference-based approach that avoids embedding one type of document in another. Illustrate this approach by showing how the following tuples from the relational database would be represented as one or more documents in MongoDB:
```
from Author: ('44777', 'Ina Garten', '1948-02-02')
  from Book: ('9781984822789', 'Go-To Dinners: A Barefoot Contessa Cookbook', 'Clarkson Potter', 256, 'cookbook', 30)
 from Wrote: ('44777', '9781984822789')
 from Sales: ('2022-11-22', '11:00', '9781984822789', 1)
```
Guidelines:
- You will need to determine how many documents are needed, how many references are needed, and where the references should go. Make sure to take into account the factors relevant to data modeling that we discussed in lecture.
- Your answer should take into account the entire sets of relationships that the database will need to capture, but the only concrete relationship(s) that your answer needs to include are those described by the tuples above.
- We encourage you to review the questions related to this problem in the PS 5 FAQ.
Briefly describe one advantage and one disadvantage of the approach to data modeling taken in part 1.
Another way to capture the relationships would be to take an approach that allows you to embed one type of document in another. Illustrate this approach by showing how the tuples from part 1 would be captured as one or more documents in MongoDB. The guidelines from part 1 also apply here.
Briefly describe one advantage and one disadvantage of the approach to data modeling taken in part 3.
Consider the updates that need to happen when the bookseller attempts to sell one or more copies of a given book. Discuss the appropriateness of the data-modeling approaches from parts 1 and 3 in light of this particular use of the database.

Problem 2: Logging and recovery

18 points total

Consider the following sequence of log records written by a system that uses undo-redo logging:

LSN	record contents
0	txn: 1; BEGIN
10	txn: 1; item: A; old: 100; new: 110; olsn: 0
20	txn: 2; BEGIN
30	txn: 2; item: B; old: 200; new: 210; olsn: 0
40	txn: 1; item: C; old: 300; new: 310; olsn: 0
50	txn: 1; item: D; old: 400; new: 410; olsn: 0
60	txn: 2; COMMIT
70	txn: 1; item: B; old: 210; new: 220; olsn: 30
80	txn: 1; item: D; old: 410; new: 420; olsn: 50

(7 points) If a crash occurs and log record 80 is the last one to make it to disk, what steps would be performed during recovery if the system is performing undo-redo logging and the on-disk datum LSNs are not consulted? (In other words, you should assume that the system is not performing logical logging, and thus you don’t need to worry about redoing or undoing a change unnecessarily.) Complete the table provided in ps5_partI to show how each log record would be handled during both the backward and forward passes.
(7 points) If a crash occurs and log record 80 is the last one to make it to disk, what steps would be performed during recovery if the system is performing undo-redo logging and the on-disk datum LSNs are consulted (i.e., the system is performing logical logging)?

Complete the table provided in ps5_partI to show how each log record would be handled during both the backward and forward passes. You should assume that the datum LSNs at the start of recovery are the following:
- A: 10
- B: 70
- C: 0
- D: 0
In addition, you should assume that the recovery subsystem does not perform any actions that the LSNs indicate are unnecessary.
(4 points) Now assume that a dynamic checkpoint had occurred between log records 30 and 40.

a. How (if at all) would that change which log records are considered during the backward pass? Explain briefly.

b. How (if at all) would that change which log records are considered during the forward pass? Explain briefly.

Problem 3: Logging and on-disk values

12 points total

Consider again the following sequence of log records:

LSN	record contents
0	txn: 1; BEGIN
10	txn: 1; item: A; old: 100; new: 110; olsn: 0
20	txn: 2; BEGIN
30	txn: 2; item: B; old: 200; new: 210; olsn: 0
40	txn: 1; item: C; old: 300; new: 310; olsn: 0
50	txn: 1; item: D; old: 400; new: 410; olsn: 0
60	txn: 2; COMMIT
70	txn: 1; item: B; old: 210; new: 220; olsn: 30
80	txn: 1; item: D; old: 410; new: 420; olsn: 50

(4 points) This log was created by a system that uses undo-redo logging. If a crash occurs and log record 80 is the last one to make it to disk, what are all possible on-disk values of each of the data items (A, B, C, and D) after the crash but before recovery?
(4 points) How would your answer to part 1 change if the system were using redo-only logging instead of undo-redo? Briefly explain the reason for any changes. (Note: You should assume that none of the data items (A, B, C, D) are on the same page.)
(4 points) How would your answer to part 1 change if the system were using undo-only logging? Briefly explain the reason for any changes. (Here again, you should assume that none of the data items are on the same page. In addition, you should assume that when the DBMS forces dirty database pages to disk, it forces only those pages that must go to disk in order for undo-only logging to work correctly.)

Submitting your work for Part I

Once you have completed Part I in Google Drive, choose File->Download as->PDF document, and save the resulting file (ps5_partI.pdf) on your machine.
Login to Gradescope and click on the box for CSCI E-66.
Click on the name PS 5: Part I in the list of assignments. You should see a pop-up window labeled Submit Assignment. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Choose the Submit PDF option, and then click the Select PDF button and find the ps5_partI.pdf that you created in step 1. Then click the Upload PDF button.
You should see an outline of the problems along with thumbnails of the pages from your uploaded PDF. For each problem in the outline:
- Click the title of the problem.
- Click the page(s) on which your work for that problem can be found.
As you do so, click on the magnifying glass icon for each page and doublecheck that the pages that you see contain the work that you want us to grade.
Once you have assigned pages to all of the problems in the question outline, click the Submit button in the lower-right corner of the window.
You should see a box saying that your submission was successful. Click the (x) button to close that box.
You can use the Resubmit button at the bottom of the page to resubmit your work as many times as needed before the final deadline.

Important

It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough time to do so, wait an hour or two and then try again. If you are unable to submit and it is close to the deadline, email your homework before the deadline to esaracin@bu.edu

Part II

50 points total

Overview

In this part of the assignment, you will write queries for a MongoDB version of our movie database – one that uses the data model outlined in lecture.

Downloading the necessary files

Click on each of the following links, and save the files in your ps5 folder:

If the browser doesn’t allow you to choose where to download the files, right-click each link and use Save link as... or the equivalent option.

Installing the necessary software

Follow the appropriate set of instructions to get everything installed and properly configured:

Creating the database

Make sure that you have downloaded the files listed above and installed and configured the software.
Launch MongoDB Compass as needed and click on the Connect button to connect to the localhost MongoDB deployment.
Click on the Databases tab, and then click Create database.
In the window that pops up, enter the following:
- for Database Name: imdb
- for Collection Name: movies
and click Create Database.
Choose the imdb database from the list of databases. You should see an empty collection called movies.
Click Create collection and create a new collection with the name people.
Click Create collection again and create a new collection with the name oscars.
To add documents to the movies collection:
- In MongoDB Compass, select the collection from the list of collections for imdb.
- Scroll down as needed to find the Import Data button and click it (NOT the Add Data one).
- Click Select a file... and choose the JSON file for the collection that you downloaded above (e.g., for movies, the file is called movies.json).
- Click Import and then DONE.
- You should see that the collection now includes the documents that were found in the JSON file.
Repeat step 7 to add documents to the oscars and people collections.

Performing queries in MongoDB

Once you have everything installed and configured, you can perform queries by taking the following steps:

Launch MongoDB Compass.
Click the Connect button to connect to the MongoDB server that should be running on localhost.
Click on the MONGOSH bar at the bottom of the window for MongoDB Compass. Doing so should expand a pane for the MongoDB shell, and should see the following command prompt:
```
test>
```
Enter the following command from the shell prompt:
```
use imdb
```
This should change the shell’s prompt to the following:
```
imdb>
```
Enter the appropriate method calls for your queries. If you simply enter the method call, you should automatically see up to the first 20 results.
If there are more than 20 results, you can enter the following command to continue iterating through them:
```
it
```

Important guidelines

If you’re using a Mac, you should disable smart quotes, because they may lead to errors in MongoDB and in our testing. There are instructions for doing so here.
ps5_queries.py is a Python file, so you could use a Python IDE to edit it, but a regular text editor like TextEdit or Notepad++ would also be fine. However, if you use a text editor, you must ensure that you save it as a plain-text file.
Construct the MongoDB method calls needed to solve the problems given below. Test each method call in MONGOSH to make sure that it works.
Once you have finalized the method call for a given problem, copy the call into your ps5_queries.py file, putting it between the triple quotes provided for that query’s variable. We have included a sample query to show you what the format of your answers should look like.
Each of the problems must be solved by means of a single query (i.e., a single method call). The results of the query should include only the requested information, with no extraneous fields. In particular, you should exclude the _id field from the results unless the problem indicates otherwise.
You do not need to worry about the order of the fields in the results, nor the places in which line breaks or spaces appear.
Unless the problem indicates otherwise, you may only use aspects of the MongoDB query language that we discussed in lecture.
Your queries should only use information provided in the problem itself. In addition, they should work for any MongoDB database that follows the schema that we discussed in lecture.

The query problems

Make sure to follow the guidelines above.

Write a query to find the names and runtimes of all movies in the database that have an R rating and a runtime of less than 100 minutes.
Ariana DeBose won an Oscar in 2012 for her performance in West Side Story. There are a number of movies in our database whose names include the word “Story”. Find the names and years of those movies. Your query should not make any assumptions about where the word “Story” appears in the name.
Write a query to find the names of all people who have won a supporting actor/actress Oscar from 2015 to the present. The result documents should include the name of the person and the year in which the award was given.

Hints:
- Because the name of the person is part of a subdocument in the relevant documents, you will need to use dot notation for the name of the field used to obtain the name, and you will need to surround the field name with quotes.
- The name of the person will also end up in a subdocument in the results of the query. For example, here is what one of the result documents should look like:
```
{ year: 2019, person: { name: 'Mahershala Ali' } }
```
Write a query to find the names and dates of birth of all actors in the database who were born in 1972.
Write a query to find the names of all directors who have directed a movie in the database in which Daniel Radcliffe acted. A given person’s name should appear at most once in the result of the query. You should use one of the single-purpose aggregation methods covered in lecture, not an aggregation pipeline.
Write a query to compute, for each movie rating, the number of movies with that rating and their average runtime. The final result documents should have three fields:
- one called rating for the movie rating
- one called avg_runtime for the average runtime of movies with that rating
- one called movie_count for the number of movies with that rating.
For example:
```
{ avg_runtime: 124, movie_count: 5, rating: 'M' }
```
Note: Some movies do not have a rating, and you should see a document for those movies in which the rating is null.
Write a query to find the name and date of birth of the youngest actor in the database – i.e., the one whose date of birth is largest.
Write a query to find the number of people in the database who were born in the first decade of the current millenium – i.e., between '2000-01-01' and '2009-12-31' inclusive. The final result should be a single document with a field called num_first_decade whose value is the computed count.
Write a query to find all people who have directed 4 or more of the top-grossing movies in the database. The final result should consist of documents with two fields:
- one called director with the name of a director
- one called num_top_grossers with the number of their top-grossing movies.
Hints:
- For the purposes of this query, a top-grossing movie is any movie that has an earnings_rank field.
- You may assume that all people in the database have unique names.
- Because a given movie may have multiple directors, one of the stages in the pipeline will need to be an $unwind stage that “unwinds” the arrays of directors in the movie documents that you are considering.
Find the top-ten Oscar winners in the database – i.e., the 10 people who have won the most Oscars in the database, along with the number and names of the movies for which they won their awards. The final result documents should each have the following fields:
- one called award_count for the number of Oscars they won
- one called movies for an array containing the names of the movies for which they won the Oscars
- one called big_winner for the name of the person.
For example:
```
{ award_count: 3,
   movies: [ 'Lincoln', 'There Will Be Blood', 'My Left Foot' ],
   big_winner: 'Daniel Day-Lewis' }
```
Hints:
- In our database, documents for Best Picture Oscars do not include a person field. Despite that fact, you will still need to take steps to exclude Best Picture Oscars from the results of your query, or else you will end up with a result document for them in which the name of the person is null.
- You may again assume that all people in the database have unique names.
- Use the $addToSet accumulator to build the array containing the names of a person’s Oscar-winning movies.

Submitting your work for Part II

Submit your ps5_queries.py file using these steps:

Click on PS 5: Part II in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your file to the box labeled DRAG & DROP. You can either drag and drop the file from its folder into the box, or you can click on the box itself and browse for the file.
Click the Upload button.
You should see a box saying that your submission was successful. Click the (x) button to close that box.
The Autograder will perform some tests on your file. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.

Notes:
- You should see test results for each query. If you don’t see any results for a given query, it probably means that you have a syntax or logic error in your query, and you should attempt to fix it and resubmit.
- You should keep making changes as needed until you get full credit for a given query. There will be no partial credit awarded for an incorrect query.
- Make sure that each query is logically correct, and that it will work for any instance of the movie database that follows the data model outlined in lecture. We reserve the right to ultimately run your queries on a slightly different version of the database to ensure that your queries are logically correct.
If needed, use the Resubmit button at the bottom of the page to resubmit your work.
Near the top of the page, click on the box labeled Code. Then click on the name of your file to view its contents. Check to make sure that the file contains the work that you want us to grade.

Important

It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough time to do so, wait an hour or two and then try again. If you are unable to submit and it is close to the deadline, email your homework before the deadline to esaracin@bu.edu

Last updated on December 2, 2024.