R – Import data from HTML web site
A good reference for this topic was: R for Data Science – Scraping
This post is just my notes while going through that ebook. I post it here for a quick overview when necessary. Examples here come from that source.
Recommended package is ‘rvest’ (as in ‘harvest’ ie scrape). This is included with Tidyverse but is not loaded by default so that needs to be loaded:
library(tidyverse) library(rvest)
Firstly, import the HTML via ‘read_html’ like so:
html <- read_html("http://rvest.tidyverse.org/")
This returns the HTML document in xml format.
Then it is a matter of looking through the document (ie with browser inspector functions) to find the desired data and the appropriate elements/classes/id’s to target the data. This is done with ‘html_elements’ and ‘html_element’ which returns an array or an element depending on which of these that is used. Class and Id descriptors are used in much the same way as jQuery. (ie “.class” or “#Id”).
html |> html_elements("b") html |> html_element("b")
Contents of a cell is accessed via ‘html_text2’. ‘Text2’ returns the text as would be seen by the user of the page. ‘html_text’ is a more primitive function. This is an example of how it would be used:
characters |> html_element(".weight") |> html_text2()
HTML can be added directly with ‘minimal_html’ such as: html <- minimal_html("<p><a href='https://link.com'>Random link</a></p>")
html_attr() is used to return the attributes of a selected attribute. html |> html_elements("p") |> html_attr("href")
Table data can be access with html_table(). It returns a tibble with the data found in the specified table.
html |> html_element(".mytable") |> html_table()
The rvest package does not handle content managed or manipulated by javascript.