Database programming paradigms

Relational databases and SQL basics
Author
Affiliation

Department of Biostatistics, Johns Hopkins

Published

November 12, 2024

Pre-lecture activities

Important

For this lecture, we will use Unix shell, plus SQLite3 or DB Browser for SQLite.

You can see if the command-line tool sqlite3 (also known as “SQLite”) is already installed with

sqlite3 --version
3.43.2 2023-10-10 13:08:14 1b37c146ee9ebb7acd0160c0ab1fd11017a419fa8a3187386ed8cb32b709aapl (64-bit)

If not, you can follow the instructions here:

Also, you will need to install these R packages:

install.packages("DBI")
install.packages("RSQLite")
install.packages("dbplyr")

In addition, please read through

How much should I prepare for before class?

Please install the packages above and be comfortable with running the functions in the example code here:

Lecture

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

Learning objectives

At the end of this lesson you will:

  • Explain the difference between a table, a record, and a field in relational databases
  • Explain the difference between a database and a database manager
  • Write a query to select all values for specific fields from a single table
  • Write queries that display results in a particular order
  • Write queries that eliminate duplicate values from data
  • Write queries that select records that satisfy user-specified conditions
  • Learn about the DBI, RSQLite, dbplyr packages for making SQL queries in R

Slides

Class activity

For the rest of the time in class, we will practice creating a SQL database and also writing SQL syntax to analyze a Beyoncé discography dataset.

Objectives of the activity
  • Practice creating a SQL database in R .
  • Practice filtering, grouping, and joining data using SQL commands.

For this in-class activity, you need to find a partner. Work together on the following problems. Use the following datasets related to Beyoncé’s songs for the activity.

songs <- data.frame(
  song_id = 1:10,
  title = c("Crazy in Love", "Halo", "Single Ladies", "Formation", "Irreplaceable", 
            "Drunk in Love", "Run the World", "Partition", "Love on Top", "Sorry"),
  album_id = c(1, 2, 1, 3, 2, 3, 1, 3, 2, 3),
  release_year = c(2003, 2008, 2008, 2016, 2006, 2013, 2011, 2013, 2011, 2016),
  streams_millions = c(500, 600, 700, 800, 900, 1000, 450, 350, 750, 900)
)

# Sample data frame for Beyoncé's albums
albums <- data.frame(
  album_id = 1:3,
  album_title = c("Dangerously in Love", "I Am... Sasha Fierce", "Lemonade"),
  release_year = c(2003, 2008, 2016)
)

Part 1: Create a SQL database

  • Create a new SQL database titled beyonce.sqlite using the DBI and RSQLite packages using the dbConnect() function.
  • Using the dbWriteTable() function, add the songs and albums datasets to the SQL database.
# This code to ensure that the lecture builds and there no SQL database at the start of the exercise
sql_beyonce <- here::here("lectures","05-database-programming", "data", "beyonce.sqlite")
if(file.exists(sql_beyonce)){
  file.remove(sql_beyonce)
}
[1] TRUE
library(DBI)
mydb <- dbConnect(drv = RSQLite::SQLite(), 
                  dbname = sql_beyonce)

dbWriteTable(conn = mydb, name = "songs", value = songs)
dbWriteTable(conn = mydb, name = "albums", value = albums)
dbListTables(conn = mydb)
[1] "albums" "songs" 

Part 2: Practice SQL queries

  • Write a SQL query that selects all columns from the songs table where the number of streams is greater than 600 million.
results <- dbGetQuery(mydb, "SELECT * 
                             FROM songs
                             WHERE streams_millions > 600;")
print(results)
  song_id         title album_id release_year streams_millions
1       3 Single Ladies        1         2008              700
2       4     Formation        3         2016              800
3       5 Irreplaceable        2         2006              900
4       6 Drunk in Love        3         2013             1000
5       9   Love on Top        2         2011              750
6      10         Sorry        3         2016              900
  • Use SQL to select the titles of Beyoncé’s songs released after 2010, along with the year they were released.
results <- dbGetQuery(mydb, "SELECT title, release_year 
                             FROM songs 
                             WHERE release_year > 2010;")
print(results)
          title release_year
1     Formation         2016
2 Drunk in Love         2013
3 Run the World         2011
4     Partition         2013
5   Love on Top         2011
6         Sorry         2016
  • Write a SQL query that counts the number of songs for each album.
results <- dbGetQuery(mydb, "SELECT album_id, COUNT(song_id) AS song_count
                             FROM songs
                             GROUP BY album_id;")
print(results)
  album_id song_count
1        1          3
2        2          3
3        3          4
  • Write a SQL query that calculates the total streams for each album. Select the album ID, album title, and total streams.
results <- dbGetQuery(mydb, "SELECT album_id, title, SUM(streams_millions) AS total_streams 
                             FROM songs 
                             GROUP BY album_id;")
print(results)
  album_id         title total_streams
1        1 Crazy in Love          1650
2        2          Halo          2250
3        3     Formation          3050
  • Identify the least streamed song for each release year.
results <- dbGetQuery(mydb, "SELECT release_year, title, MIN(streams_millions) AS min_streams
                             FROM songs
                             GROUP BY release_year;")
print(results)
  release_year         title min_streams
1         2003 Crazy in Love         500
2         2006 Irreplaceable         900
3         2008          Halo         600
4         2011 Run the World         450
5         2013     Partition         350
6         2016     Formation         800

Close connection when done

dbDisconnect(conn = mydb)

Part 3: Discussion

Discuss with your partner:

  • How does SQL make it easier to analyze and summarize data?
  • Which SQL query might you use to quickly identify Beyoncé’s most popular song based on streams?
  • How could these SQL skills be applied to other areas of data analysis?

Post-lecture

Additional practice

Here are some additional practice questions to help you think about the material discussed.

Questions

Using the survey.db database discussed in the relational databases and SQL basics pre-reading material:

  1. Use .schema to identify column that contains integers
  2. Write a query that selects only the name column from the Site table.
  3. Many people format queries in the following two ways. What style do you find easiest to read, and why?
SELECT personal, family FROM person;

or

select Personal, Family from PERSON;
  1. Write a query that selects distinct dates from the Visited table.
  2. Write a query that displays the full names of the scientists in the Person table, ordered by family name.