Project 2
Background
Due date: November 15 at 11:59pm
The goal of this assignment is to practice some of the skills we have been learning about in class around data collection paradigms and functional programming.
To submit your project
You need to create a private GitHub Classroom repository (only one per group) for you and your partner, which will be posted in CoursePlus. This creates an empty GitHub repository. You need to show all your code and submit both the .qmd
file and the rendered HTML file. Please include section headers for each of the components below. All plots should have titles, subtitles, captions, and human-understandable axis labels. The TAs will grade the contents in the GitHub Classroom repo by cloning the repo and checking for all the things described below.
Because you will work with a partner, please be sure to include the names, emails, and JHED IDs for both individuals in your submitted work.
Part 1
Here, you and your partner will practice using a API and making data visualizations.
The API we will use is tidycensus
(https://walker-data.com/tidycensus), which is an R package that allows users to interface with a select number of the US Census Bureau’s data APIs and return tidyverse-ready data frames, optionally with simple feature geometry included.
The goal of this part is to create a set of data visualizations using the US Census Bureau’s data.
To use this API, you must obtain and API key, which can be found at the top of this page:
When you use an API, you want to figure out the data you want to extract and then save it locally so that you are not using the API each time you knit or render your data analysis.
Most APIs have limits on the number of times you can ping it in a given hour and your IP address can be blocked if you try to ping it too many times within a short time.
Therefore, it is strongly suggested that you create a new folder called data
locally and save the output from extracting data from the tidycensus
API. In this way, you can read the data in locally each time you knit/render the document, rather than continue to pull data from the API each time you knit/render the document.
Choose a question to investigate. Describe what is the question you aim to answer with the data and what you want to visualize.
Extract data from the
tidycensus
API. Use at least three different calls to thetidycensus
API to extract out different datasets. For example, these could be across years, locations, or variables.Clean the data. Include some form of data wrangling and data visualization using packages such as
dplyr
ortidyr
. Other packages that might be helpful to you includelubridate
,stringr
, andforcats
. You must use at least two functions frompurrr
.Visualize the data. Create data visualizations of your choice. However, your analysis should include at least three plots with you using at least two different
geom_*()
functions fromggplot2
(or another package withgeom_*()
functions).Report your findings. Provide a paragraph summarizing your methods and key findings. Include any limitations or potential biases in pulling data from the API or the analysis. Be sure to comment and organize your code so is easy to understand what you are doing.
Part 2
In this part, you and your partner will use the rvest
package to scrape data from a website, wrangle and analyze the data, and summarize your findings.
Choose a website to scrape. Select a website with structured data in HTML tables or well-defined sections. Some examples could include:
- A movie database like IMDb or Rotten Tomatoes (scraping movie titles, ratings, release years, etc.)
- A job listing site like Indeed or LinkedIn (scraping job titles, companies, and locations)
- A sports statistics site like ESPN or Baseball Reference (scraping team statistics, player info, etc.)
Extract data with
rvest
. Here, you will want to identify the specific HTML elements or CSS selectors containing the data. Then, uservest
functions likeread_html()
,html_elements()
, andhtml_text()
orhtml_table()
to retrieve the data.Clean the data. Next, perform some basic wrangling, such as remove extra whitespace, handle missing values, and convert data types as needed. You might find the functions from
dplyr
ortidyr
useful for any additional transformations, such as renaming columns, filtering rows, or creating new variables.Analyze the data. Perform a simple analysis of your choice. For example, you could
- Count how many times specific words or themes appear.
- Create a summary statistic (e.g., average rating, job salary, team win percentage).
- Create a data visualization (e.g., bar chart, histogram) of an interesting metric.
Report your findings. Provide a paragraph summarizing your methods and key findings. Include any limitations or potential biases in your scraping or analysis. Be sure to comment and organize your code so is easy to understand what you are doing.
Before scraping a site, it’s a good practice to check if the site allows scraping in its terms of service or look at the robots.txt
file. While robots.txt
does not legally prevent access, it reflects the site’s preferred limits on automated access.
Some web scraping tools and libraries allow you to specify a “polite” setting that follows the robots.txt
rules by default, which is often a good practice to follow.
It’s typically located at the root of the domain (e.g., https://example.com/robots.txt) and follows a simple format to specify rules.
For more information, check out https://en.wikipedia.org/wiki/Robots.txt.