7  Scrape data from web pages

7.1 TODO

  • Strongly point to R4DS for basics.
  • Reconcile chapter with new LOs.

7.2 Introduction

(introduction will be written ~last)

Learning Objectives

After you read this chapter, you will be able to:

  • Decide whether to scrape data from a web page.
  • Use {polite} to responsibly scrape web pages.
  • Scrape complex data structures from web pages.
  • Scrape content that requires interaction.

Prerequisites

(prerequisites will be filled in as I write, if I decide to keep this section)

7.3 Should I scrape this data?

When you see data online that you think you could use, stop to answer these three questions:

  • Am I legally allowed to scrape this data?
  • Am I allowed to scrape this data automatically?
  • Do I need to scrape this data with code?

7.3.1 Am I legally allowed to scrape this data?

I am an R developer, not a lawyer, and none of this should be construed as legal advice. If you’re going to use the data in a commercial product, you may want to consult a lawyer. That said, these guidelines should get you started in most cases.

If you’re using the data for your own exploration or for nonprofit educational purposes, you’re almost definitely free to use the data however you like. Copyright cases tend to involve either making money off of the work, or making it harder for the owner of the work to use it to make money.

Also check the site for legal disclaimers. These are usually located at the bottom of the page, or possibly somewhere on the first page of the site. Look for words or phrases like “Legal,” “License,” “Code of Conduct”, “Usage”, or “Disclaimers.” Sometimes the site explicitly grants the right to use the data, which will generally supersede any general legal protection.

If you’re going to publish (or otherwise share) the data, and you can’t find anything on the site given you permission, you’ll have to decide if your use case is allowed. In the United States, facts are not protected by copyright.1 TODO: DIFFERENT EXAMPLE, EVERYONE USES RECIPES! That’s why cook books and online recipes tend to devote as much space (or more) to stories than to the recipes themselves — the recipes are facts and thus don’t have copyright protection in the U.S. However, a collection of facts (such as the data you’re trying to scrape) can be protected in the U.S. if that collection was selected by a person.

“These choices as to selection and arrangement, so long as they are made independently by the compiler and entail a minimal degree of creativity, are sufficiently original that Congress may protect such compilations through the copyright laws”2

Outside the U.S., protections may be stronger or weaker. For example, the European Union established specific legal protections for databases in a directive on the legal protection of databases. If you’re going to publish the data, investigate legal requirements in your location.

7.3.2 Am I allowed to scrape this data automatically?

Even if it is legal for you to use the data, it might not be polite to do so. For example, the site might have preferences about which pages can be accessed by code, or specific protections or guidelines about certain pages. Most websites list these preference in a robots.txt file at the root of the site. For example, the robots.txt file for the online version of this book is available at https://jonthegeek.github.io/wapir/robots.txt. This file might contain one or a few lines.

https://wapir.io/robots.txt
# TODO: Replace this with the final version for this book.

Sitemap: https://wapir.io/sitemap.xml

Things to put into this section:

  • robots.txt (search for "User-agent: *" and the particular page you want to scrape)
  • Licenses & legal usage. IANAL.
  • Is it worth scraping? Will {datapasta} cover it?
  • Quickly identifying whether it will be hard.
  • Look for an API to use instead of scraping. Just introduce the idea then point to that chapter.
  • Even when robots.txt allows scraping, give tips on how to be a good citizen.
  • {polite} package? I think probably focus more on learning to read robots.txt and how to implement things in rvest (and later httr2), but I should at least mention {polite}.

7.4 How can I scrape a table of data from a web page?

  • Basic intro to {rvest}.
    • Scrape a static table in this repo where html_table works out of the box. It looks like it’ll get a single table without any fuss.
    • BRIEFLY touch on data cleaning, but send them to R4DS for more details.
    • Show an example with multiple tables & how to deal with that (pre-SelectorGadget).

7.5 How can I scrape more complex data from a web page?

  • Using SelectorGadget.
    • Start by reproducing the table scrape, without SelectorGadget (basically just do what )
    • A second table where html_table doesn’t work as well (weird headers or something). Maybe; the next example might be enough.
    • A web page with structured content that isn’t in a table. Like the Star Wars example in rvest, but someting else. Maybe information about HTTP request methods? Oh, or Xpath rules!
  • CSS selectors
    • Start by digging into what SelectorGadget produced for the examples above/what it means.
    • Add an example where it’s way more straightforward to do things sequentially (structured data in a particular cell of a particular table, something like that).
    • CSS selectors vs Xpath. Hmm. Which is more straightforward? Probably teach Xpath because it lets you do all the things. Nope, CSS selectors can also do all the things, and they overlap with actual CSS which is useful in its own right! Consider teaching both, at least briefly, and point to MDN or w3schools for more. FWIW, rvest translates everything to xpath via selectr::css_to_xpath().
    • If Xpath: Note the difference between ".//" (search below the current node) and "//" (search anywhere in the document). You never want just "//" in a pipe, because it ignores previous steps (well, except MAYBE if you’re using a selection to go back and find something else)! Probably put this in a call out.
    • Maybe note that a lot of {rvest} is reexports from {xml2}?
    • Also note the flatten argument for xml2::xml_find_all()! By default it de-duplicates, so watch out if you’re trying to align lists. Make sure this behaves how I think it does, and, if so, provide an example where it matters. This appears to be the only place where I need to bring up {xml2} directly, but probably point it out for further reading.
    • html_attrs() (list all of the attributes) vs html_attr() (get a specific attribute). Similar to attributes() vs attr().
    • A con of CSS selectors: Can’t go back up the tree! I feel like this is probably important in some contexts. If there’s a legitimate case where this is true, use that as the reason to focus on Xpath.

Ideas to cover

  • Advanced {rvest} techniques.
    • Can I ~easily deploy something that requires a session?
    • Is the session stuff in {rvest} relatively stable? How might the {chromote} PR impact it?
    • Are there options other than {rvest} that I should explore? {RSelenium}, {chromote} worth mentions, or too different?
    • Bridge into {httr2}. Don’t write this yet, bridges should be written last!
    • Larger-than-memory scraping, splitting scrapes, etc.
  • Automation.
    • Refining selections toward “what” vs “how” & “where”, trying to make your selections less fragile.
    • Deduplication.
    • Error handling.
    • Logging? Likely mention and point to do4ds for more details (or just a specific logging package).
    • Mention packages for notification that a job is finished (from {beepr} to an email or Slack message… but maybe use those to tease later chapters).
  • {targets}?
    • At least mention.
    • Dig into this to see how quickly I could introduce without it growing into a separate thing.

Another Exploration

Suggested by Emir Bilim on DSLC, let them know when it’s worked out! Also include a case like this in the Appendix!

library(rvest)
url <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=1"

## Annual only for now.

# Cache the read during experimentation so you don't hit it over and over.
raw_html <- rvest::read_html(url)

annual_cells <-
    raw_html |> 
    rvest::html_element(
        xpath = ".//div[@class='M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)']"
    ) |> 
    rvest::html_text2() |> 
    # The text comes in as a single block separated by newlines (\n).
    stringr::str_split_1(stringr::fixed("\n"))

# The first text to come in is the column headers. We want to go through the
# last one that's a date.
annual_date_cell_numbers <- stringr::str_which(
    annual_cells, "^\\d{1,2}/\\d{1,2}/\\d{4}$"
)
last_date_cell_number <- max(annual_date_cell_numbers)

annual <- 
    annual_cells |> 
    # We're going to transpose the table as we go; rows will become columns,
    # columns will become rows. It's tidier that way. The last date is our last
    # column.
    matrix(nrow = last_date_cell_number, byrow = FALSE) |> 
    as.data.frame() |> 
    janitor::row_to_names(row_number = 1) |> 
    tibble::as_tibble()

# That gets the a good start on the "Annual" data; it just needs normal cleaning
# from there. It does NOT include the breakdowns yet, though. I believe both
# that and Quarterly will require some session fanciness. To be continued!

  1. See the U.S. Copyright Office Fair Use Index for a detailed discussion of the legal definition of fair use in the United States, and an index of related court decisions.↩︎

  2. Feist Publications, Inc. v. Rural Telephone Service Co., 499 U. S. 340 (1991)↩︎