Project 2

Practicing functional programming and data collection paradigms
Author
Affiliation

Department of Biostatistics, Johns Hopkins

Published

November 5, 2024

Background

Due date: November 15 at 11:59pm

The goal of this assignment is to practice some of the skills we have been learning about in class around data collection paradigms and functional programming.

To submit your project

You need to create a private GitHub Classroom repository (only one per group) for you and your partner, which will be posted in CoursePlus. This creates an empty GitHub repository. You need to show all your code and submit both the .qmd file and the rendered HTML file. Please include section headers for each of the components below. All plots should have titles, subtitles, captions, and human-understandable axis labels. The TAs will grade the contents in the GitHub Classroom repo by cloning the repo and checking for all the things described below.

Because you will work with a partner, please be sure to include the names, emails, and JHED IDs for both individuals in your submitted work.

Part 1

Here, you and your partner will practice using a API and making data visualizations.

The API we will use is tidycensus (https://walker-data.com/tidycensus), which is an R package that allows users to interface with a select number of the US Census Bureau’s data APIs and return tidyverse-ready data frames, optionally with simple feature geometry included.

The goal of this part is to create a set of data visualizations using the US Census Bureau’s data.

Important

To use this API, you must obtain and API key, which can be found at the top of this page:

Tip

When you use an API, you want to figure out the data you want to extract and then save it locally so that you are not using the API each time you knit or render your data analysis.

Most APIs have limits on the number of times you can ping it in a given hour and your IP address can be blocked if you try to ping it too many times within a short time.

Therefore, it is strongly suggested that you create a new folder called data locally and save the output from extracting data from the tidycensus API. In this way, you can read the data in locally each time you knit/render the document, rather than continue to pull data from the API each time you knit/render the document.

  1. Choose a question to investigate. Describe what is the question you aim to answer with the data and what you want to visualize.

  2. Extract data from the tidycensus API. Use at least three different calls to the tidycensus API to extract out different datasets. For example, these could be across years, locations, or variables.

  3. Clean the data. Include some form of data wrangling and data visualization using packages such as dplyr or tidyr. Other packages that might be helpful to you include lubridate, stringr, and forcats. You must use at least two functions from purrr.

  4. Visualize the data. Create data visualizations of your choice. However, your analysis should include at least three plots with you using at least two different geom_*() functions from ggplot2 (or another package with geom_*() functions).

  5. Report your findings. Provide a paragraph summarizing your methods and key findings. Include any limitations or potential biases in pulling data from the API or the analysis. Be sure to comment and organize your code so is easy to understand what you are doing.

Part 2

In this part, you and your partner will use the rvest package to scrape data from a website, wrangle and analyze the data, and summarize your findings.

  1. Choose a website to scrape. Select a website with structured data in HTML tables or well-defined sections. Some examples could include:

    • A movie database like IMDb or Rotten Tomatoes (scraping movie titles, ratings, release years, etc.)
    • A job listing site like Indeed or LinkedIn (scraping job titles, companies, and locations)
    • A sports statistics site like ESPN or Baseball Reference (scraping team statistics, player info, etc.)
  2. Extract data with rvest. Here, you will want to identify the specific HTML elements or CSS selectors containing the data. Then, use rvest functions like read_html(), html_elements(), and html_text() or html_table() to retrieve the data.

  3. Clean the data. Next, perform some basic wrangling, such as remove extra whitespace, handle missing values, and convert data types as needed. You might find the functions from dplyr or tidyr useful for any additional transformations, such as renaming columns, filtering rows, or creating new variables.

  4. Analyze the data. Perform a simple analysis of your choice. For example, you could

    • Count how many times specific words or themes appear.
    • Create a summary statistic (e.g., average rating, job salary, team win percentage).
    • Create a data visualization (e.g., bar chart, histogram) of an interesting metric.
  5. Report your findings. Provide a paragraph summarizing your methods and key findings. Include any limitations or potential biases in your scraping or analysis. Be sure to comment and organize your code so is easy to understand what you are doing.

Ensure ethical use of web scraping

Before scraping a site, it’s a good practice to check if the site allows scraping in its terms of service or look at the robots.txt file. While robots.txt does not legally prevent access, it reflects the site’s preferred limits on automated access.

Some web scraping tools and libraries allow you to specify a “polite” setting that follows the robots.txt rules by default, which is often a good practice to follow.

It’s typically located at the root of the domain (e.g., https://example.com/robots.txt) and follows a simple format to specify rules.

For more information, check out https://en.wikipedia.org/wiki/Robots.txt.