Project 3

Practicing machine learning paradigms and using SQL databases to predict flight delays

Author

Affiliation

Stephanie Hicks

Department of Biostatistics, Johns Hopkins

Published

November 12, 2024

Background

Due date: November 22 at 11:59pm

The goal of this assignment is to practice some of the skills we have been learning about in class around working with relational databases and machine learning paradigms. Specifically, you will create a new SQL database, write SQL queries to the database, explore and summarize the data, train a prediction model, and summarize your findings.

To submit your project

You need to create a private GitHub Classroom repository (only one per group) for you and your partner, which will be posted in CoursePlus. This creates an empty GitHub repository. You need to show all your code and submit both the .qmd file and the rendered HTML file. Please include section headers for each of the components below. All plots should have titles, subtitles, captions, and human-understandable axis labels. The TAs will grade the contents in the GitHub Classroom repo by cloning the repo and checking for all the things described below.

Important

Because you will work with a partner, please be sure to include the names, emails, and JHED IDs for both individuals in your submitted work.

Both of you should commit to the repository!

Data

You are welcome to pick your own datasets, but here are some ideas (which you are not limited to):

nycflights13 dataset – contains data on flights that departed NYC in 2013, including tables for weather, planes, airports, and airlines. This dataset is useful for exploring what causes delays. To practice SQL queries, you could consider aggregations and group by queries across airlines or days of the week. To practice machine learning, you could predict flight delays. Also, here are two related datasets you might interesting too:
- nycflights for flights departing from NYC in the last year.
- anyflights for flights departing from any airport in any year.
IMDB dataset – Contains data on movies, including tables for titles, ratings, cast, and crew. This dataset is useful for exploring movie information, reviews, and ratings. To practice SQL queries, you could consider nested queries, joins, filtering, and text-based queries on genres or keywords. To practice machine learning, you could predict ratings.
Lahman Baseball dataset – contains historical baseball data, with tables for teams, players, games, and statistics. To practice SQL queries, you could consider aggregation, and time-based queries to calculate baseball statistics. To practice machine learning, you could predict the number of runs a team will score.
World Bank Open datasets – contains extensive economic and social data on countries worldwide, such as population, GDP, education, life expectancy, and health indicators. To practice SQL queries, you could perform time-based queries such as filtering by regions or income levels, and aggregating metrics by country or year. To practice machine learning, you could predict life expectancy.

Important things to consider when thinking about which dataset to pick

You will either use an SQL database with at least three tables in it or you will make a SQL database with at least 3 tables in it. So you need to pick a dataset with at least 3 tables available.
You will use this data to practice SQL queries (Part 1) and to practice building a machine learning model (Part 2). So you want to make sure you consider both parts while pondering which dataset to use (e.g. What kind of SQL queries might we use with this database? What kind of data visualization do we want to make? What would our prediction task be?).

Part 1

In this part, you and your partner will use the DBI and RSQLite packages to write SQL queries, explore and summarize the data, visualize the data, and summarize your findings.

Using the data you picked, choose a question to investigate. For example, if you are using the nycflights13 dataset, maybe you want to understand if there are certain days of the week that flights are more delayed than other days of the week. Describe what is the question you aim to answer with the data and what you want to visualize.

If it’s already a SQL database, read into R using the DBI and RSQLite packages. If it’s not already a SQL database, create a new SQL database and use the dbWriteTable() function to add at minimum three tables to the SQL database.
Write three SQL queries using dbGetQuery() to explore and summarize the data. You must use at least five different SQL functions (e.g. SELECT, GROUP BY, DISTINCT, SUM, etc).
Write two SQL queries to create new features from existing data. For example, if using nycflights13, you could think about how to
- Bin departure times into time-of-day categories (e.g., morning, afternoon, evening).
- Lag features like the previous day’s average delay by carrier, which can be helpful for predictions.
- Merge additional weather data (such as hourly temperature, precipitation, etc., if available). e.g. this could be done using SQL joins.
Visualize your data by creating a plot with ggplot2. For example, if using nycflights13, you could think about how to visualize delays by month, carrier, or weather conditions.
Report your findings. Provide a paragraph summarizing your methods and key findings. Include any limitations or potential biases in your analysis. Be sure to comment and organize your code so is easy to understand what you are doing.

Important

At the end of the data analysis, list out each of the SQL functions you used to help the TA with respect to making sure you met all the requirements described above.

Part 2

In this part, you and your partner will use use caret or tidymodels to train a predictive model for a task or outcome, perform model evaluation, and summarize your findings. For this part, you can use the data directly in R as standard dataframes or tibbles. You do not need to build the models In this section, you must do the following when building your prediction model:

Split the data into training and test sets.
Choose a machine learning model appropriate for the task (e.g. consider if it’s a binary or continuous outcome and choose an appropriate model). For example, if you are using the nycflights13 dataset, you could could pick
- Logistic Regression (if predicting delay as a binary outcome, e.g., delay/no delay).
- Linear Regression (if predicting the length of delay in minutes).
- Decision Trees or Random Forests (for both binary classification or regression).
Train the model using caret or tidymodels in R using the training data.
Assess the model performance both the training and test datasets using metrics like Accuracy or area under the curve (AUC) (if classification) or root mean squared error (RMSE) (if regression).

Finally, report your findings. For example, if you are predicting flight delays, provide 1-2 practical recommendations for reducing delays. Broadly, provide a paragraph summarizing your methods and key findings. Include limitations or potential biases in training and evaluation your machine learning model. Be sure to comment and organize your code so is easy to understand what you are doing.