Project 3
Background
Due date: November 22 at 11:59pm
The goal of this assignment is to practice some of the skills we have been learning about in class around working with relational databases and machine learning paradigms. Specifically, you will create a new SQL database, write SQL queries to the database, explore and summarize the data, train a prediction model, and summarize your findings.
To submit your project
You need to create a private GitHub Classroom repository (only one per group) for you and your partner, which will be posted in CoursePlus. This creates an empty GitHub repository. You need to show all your code and submit both the .qmd
file and the rendered HTML file. Please include section headers for each of the components below. All plots should have titles, subtitles, captions, and human-understandable axis labels. The TAs will grade the contents in the GitHub Classroom repo by cloning the repo and checking for all the things described below.
Because you will work with a partner, please be sure to include the names, emails, and JHED IDs for both individuals in your submitted work.
Both of you should commit to the repository!
Data
You are welcome to pick your own datasets, but here are some ideas (which you are not limited to):
- nycflights13 dataset – contains data on flights that departed NYC in 2013, including tables for weather, planes, airports, and airlines. This dataset is useful for exploring what causes delays. To practice SQL queries, you could consider aggregations and group by queries across airlines or days of the week. To practice machine learning, you could predict flight delays. Also, here are two related datasets you might interesting too:
- nycflights for flights departing from NYC in the last year.
- anyflights for flights departing from any airport in any year.
- IMDB dataset – Contains data on movies, including tables for titles, ratings, cast, and crew. This dataset is useful for exploring movie information, reviews, and ratings. To practice SQL queries, you could consider nested queries, joins, filtering, and text-based queries on genres or keywords. To practice machine learning, you could predict ratings.
- Lahman Baseball dataset – contains historical baseball data, with tables for teams, players, games, and statistics. To practice SQL queries, you could consider aggregation, and time-based queries to calculate baseball statistics. To practice machine learning, you could predict the number of runs a team will score.
- World Bank Open datasets – contains extensive economic and social data on countries worldwide, such as population, GDP, education, life expectancy, and health indicators. To practice SQL queries, you could perform time-based queries such as filtering by regions or income levels, and aggregating metrics by country or year. To practice machine learning, you could predict life expectancy.
- You will either use an SQL database with at least three tables in it or you will make a SQL database with at least 3 tables in it. So you need to pick a dataset with at least 3 tables available.
- You will use this data to practice SQL queries (Part 1) and to practice building a machine learning model (Part 2). So you want to make sure you consider both parts while pondering which dataset to use (e.g. What kind of SQL queries might we use with this database? What kind of data visualization do we want to make? What would our prediction task be?).
Part 1
In this part, you and your partner will use the DBI
and RSQLite
packages to write SQL queries, explore and summarize the data, visualize the data, and summarize your findings.
Using the data you picked, choose a question to investigate. For example, if you are using the nycflights13
dataset, maybe you want to understand if there are certain days of the week that flights are more delayed than other days of the week. Describe what is the question you aim to answer with the data and what you want to visualize.
If it’s already a SQL database, read into R using the
DBI
andRSQLite
packages. If it’s not already a SQL database, create a new SQL database and use thedbWriteTable()
function to add at minimum three tables to the SQL database.Write three SQL queries using
dbGetQuery()
to explore and summarize the data. You must use at least five different SQL functions (e.g.SELECT
,GROUP BY
,DISTINCT
,SUM
, etc).Write two SQL queries to create new features from existing data. For example, if using
nycflights13
, you could think about how to- Bin departure times into time-of-day categories (e.g., morning, afternoon, evening).
- Lag features like the previous day’s average delay by carrier, which can be helpful for predictions.
- Merge additional weather data (such as hourly temperature, precipitation, etc., if available). e.g. this could be done using SQL joins.
Visualize your data by creating a plot with
ggplot2
. For example, if usingnycflights13
, you could think about how to visualize delays by month, carrier, or weather conditions.Report your findings. Provide a paragraph summarizing your methods and key findings. Include any limitations or potential biases in your analysis. Be sure to comment and organize your code so is easy to understand what you are doing.
At the end of the data analysis, list out each of the SQL functions you used to help the TA with respect to making sure you met all the requirements described above.
Part 2
In this part, you and your partner will use use caret
or tidymodels
to train a predictive model for a task or outcome, perform model evaluation, and summarize your findings. For this part, you can use the data directly in R as standard dataframes or tibbles. You do not need to build the models In this section, you must do the following when building your prediction model:
- Split the data into training and test sets.
- Choose a machine learning model appropriate for the task (e.g. consider if it’s a binary or continuous outcome and choose an appropriate model). For example, if you are using the
nycflights13
dataset, you could could pick- Logistic Regression (if predicting delay as a binary outcome, e.g., delay/no delay).
- Linear Regression (if predicting the length of delay in minutes).
- Decision Trees or Random Forests (for both binary classification or regression).
- Train the model using
caret
ortidymodels
in R using the training data. - Assess the model performance both the training and test datasets using metrics like Accuracy or area under the curve (AUC) (if classification) or root mean squared error (RMSE) (if regression).
Finally, report your findings. For example, if you are predicting flight delays, provide 1-2 practical recommendations for reducing delays. Broadly, provide a paragraph summarizing your methods and key findings. Include limitations or potential biases in training and evaluation your machine learning model. Be sure to comment and organize your code so is easy to understand what you are doing.