https://wapir.io/robots.txt
: Replace this with the final version for this book.
# TODO
: https://wapir.io/sitemap.xml Sitemap
(introduction will be written ~last)
After you read this chapter, you will be able to:
(prerequisites will be filled in as I write, if I decide to keep this section)
When you see data online that you think you could use, stop to answer these three questions:
I am an R developer, not a lawyer, and none of this should be construed as legal advice. If you’re going to use the data in a commercial product, you may want to consult a lawyer. That said, these guidelines should get you started in most cases.
If you’re using the data for your own exploration or for nonprofit educational purposes, you’re almost definitely free to use the data however you like. Copyright cases tend to involve either making money off of the work, or making it harder for the owner of the work to use it to make money.
Also check the site for legal disclaimers. These are usually located at the bottom of the page, or possibly somewhere on the first page of the site. Look for words or phrases like “Legal,” “License,” “Code of Conduct”, “Usage”, or “Disclaimers.” Sometimes the site explicitly grants the right to use the data, which will generally supersede any general legal protection.
If you’re going to publish (or otherwise share) the data, and you can’t find anything on the site given you permission, you’ll have to decide if your use case is allowed. In the United States, facts are not protected by copyright.1 TODO: DIFFERENT EXAMPLE, EVERYONE USES RECIPES! That’s why cook books and online recipes tend to devote as much space (or more) to stories than to the recipes themselves — the recipes are facts and thus don’t have copyright protection in the U.S. However, a collection of facts (such as the data you’re trying to scrape) can be protected in the U.S. if that collection was selected by a person.
“These choices as to selection and arrangement, so long as they are made independently by the compiler and entail a minimal degree of creativity, are sufficiently original that Congress may protect such compilations through the copyright laws”2
Outside the U.S., protections may be stronger or weaker. For example, the European Union established specific legal protections for databases in a directive on the legal protection of databases. If you’re going to publish the data, investigate legal requirements in your location.
Even if it is legal for you to use the data, it might not be polite to do so. For example, the site might have preferences about which pages can be accessed by code, or specific protections or guidelines about certain pages. Most websites list these preference in a robots.txt
file at the root of the site. For example, the robots.txt
file for the online version of this book is available at https://jonthegeek.github.io/wapir/robots.txt. This file might contain one or a few lines.
https://wapir.io/robots.txt
: Replace this with the final version for this book.
# TODO
: https://wapir.io/sitemap.xml Sitemap
Things to put into this section:
"User-agent: *"
and the particular page you want to scrape)selectr::css_to_xpath()
.".//"
(search below the current node) and "//"
(search anywhere in the document). You never want just "//"
in a pipe, because it ignores previous steps (well, except MAYBE if you’re using a selection to go back and find something else)! Probably put this in a call out.flatten
argument for xml2::xml_find_all()
! By default it de-duplicates, so watch out if you’re trying to align lists. Make sure this behaves how I think it does, and, if so, provide an example where it matters. This appears to be the only place where I need to bring up {xml2} directly, but probably point it out for further reading.html_attrs()
(list all of the attributes) vs html_attr()
(get a specific attribute). Similar to attributes()
vs attr()
.Suggested by Emir Bilim on DSLC, let them know when it’s worked out! Also include a case like this in the Appendix!
library(rvest)
url <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=1"
## Annual only for now.
# Cache the read during experimentation so you don't hit it over and over.
raw_html <- rvest::read_html(url)
annual_cells <-
raw_html |>
rvest::html_element(
xpath = ".//div[@class='M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)']"
) |>
rvest::html_text2() |>
# The text comes in as a single block separated by newlines (\n).
stringr::str_split_1(stringr::fixed("\n"))
# The first text to come in is the column headers. We want to go through the
# last one that's a date.
annual_date_cell_numbers <- stringr::str_which(
annual_cells, "^\\d{1,2}/\\d{1,2}/\\d{4}$"
)
last_date_cell_number <- max(annual_date_cell_numbers)
annual <-
annual_cells |>
# We're going to transpose the table as we go; rows will become columns,
# columns will become rows. It's tidier that way. The last date is our last
# column.
matrix(nrow = last_date_cell_number, byrow = FALSE) |>
as.data.frame() |>
janitor::row_to_names(row_number = 1) |>
tibble::as_tibble()
# That gets the a good start on the "Annual" data; it just needs normal cleaning
# from there. It does NOT include the breakdowns yet, though. I believe both
# that and Quarterly will require some session fanciness. To be continued!
See the U.S. Copyright Office Fair Use Index for a detailed discussion of the legal definition of fair use in the United States, and an index of related court decisions.↩︎
Feist Publications, Inc. v. Rural Telephone Service Co., 499 U. S. 340 (1991)↩︎