scraping an interactive table in R with rvest - r

I'm trying to scrape the scrolling table from the following link: http://proximityone.com/cd114_2013_2014.htm
I'm using rvest but am having trouble finding the correct xpath for the table. My current code is as follows:
url <- "http://proximityone.com/cd114_2013_2014.htm"
table <- gis_data_html %>%
html_node(xpath = '//span') %>%
html_table()
Currently I get the error "no applicable method for 'html_table' applied to an object of class "xml_missing""
Anyone know what I would need to change to scrape the interactive table in the link?

So the problem you're facing is that rvest will read the source of a page, but it won't execute the javascript on the page. When I inspect the interactive table, I see
<textarea id="aw52-box-focus" class="aw-control-focus " tabindex="0"
onbeforedeactivate="AW(this,event)" onselectstart="AW(this,event)"
onbeforecopy="AW(this,event)" oncut="AW(this,event)" oncopy="AW(this,event)"
onpaste="AW(this,event)" style="z-index: 1; width: 100%; height: 100%;">
</textarea>
but when I look at the page source, "aw52-box-focus" doesn't exist. This is because it's created as the page loads via javascript.
You have a couple of options to deal with this. The 'easy' one is to use RSelenium and use an actual browser to load the page and then get the element after it's loaded. The other options is to read through the javascript and see where it's getting the data from and then tap into that rather than scraping the table.
UPDATE
Turns out it's really easy to read the javascript - it's just loading a CSV file. The address is in plain text, http://proximityone.com/countytrends/cd114_acs2014utf8_hl.csv
The .csv doesn't have column headers, but those are in the <script> as well
var columns = [
"FirstNnme",
"LastName",
"Party",
"Feature",
"St",
"CD",
"State<br>CD",
"State<br>CD",
"Population<br>2013",
"Population<br>2014",
"PopCh<br>2013-14",
"%PopCh<br>2013-14",
"MHI<br>2013",
"MHI<br>2014",
"MHI<br>Change<br>2013-14",
"%MHI<br>Change<br>2013-14",
"MFI<br>2013",
"MFI<br>2014",
"MFI<br>Change<br>2013-14",
"%MFI<br>Change<br>2013-14",
"MHV<br>2013",
"MHV<br>2014",
"MHV<br>Change<br>2013-14",
"%MHV<br>Change<br>2013-14",
]
Programmatic Solution
Instead of digging through the javacript (in case there are several such pages on this website you want) you can attempt this pro programmatically too. We read the page, get the <script> notes, get the "text" (the script itself) and look for references to a csv file. Then we expand out the relative URL and read it in. This doesn't help with column names, but shouldn't be too hard to extract that too.
library(rvest)
page = read_html("http://proximityone.com/cd114_2013_2014.htm")
scripts = page %>%
html_nodes("script") %>%
html_text() %>%
grep("\\.csv",.,value=T)
relCSV = stringr::str_extract(scripts,"\\.\\./.*?csv")
fullCSV = gsub("\\.\\.","http://proximityone.com",relCSV)
data = read.csv(fullCSV,header = FALSE)

Related

Unable while using Rvest and Selenium to extract text specific text located in <span> object from LinkedIn

First of all, I'm only a beginner in R so my apologies if this sound like a dumb question.
Basically, I want to scape the experience section in LinkedIn and extract the name of the position. As an example, I picked the profile of Hadley Wickham. As you can see on this Screenshot, the data I need ("Chief Scientist") is located in a Span object, with the span object itself located within several Div objects.
As a first attempt, I figured that I'll just try to extract directly the text from the Span objects using this code. However and unsurprisingly, it returned every text that was in other Span objects.
role <-signals %>%
html_nodes("span") %>%
html_nodes(".visually-hidden") %>%
html_text()
I can isolate the text I need by subsetting "[ ]" the object but I'm gonna apply this code to several LinkedIn profiles and the order of the title will change depending on the page. So I thought "Ok maybe I need to specify to R that I want to target the Span object that is located in the experience section and not the whole page" so I thought that I'll just need to mention in the code the "#experience" so that it only pick the Span object I need. But it only returned an empty object.
role <-signals %>%
html_nodes("#experience") %>%
html_nodes("span") %>%
html_nodes(".visually-hidden") %>%
html_text()
I'm pretty sure I'm missing some steps here but I can't figure out what. Maybe I need to specify each objects that are between "#experience" and "span" in order for this code to work but I feel there must be a better and easier way. Hope this make sense. I spent a lot of time trying to debug this and I'm not skilled enough in scraping to find a solution on my own.
As is, this requires RSelenium since data is rendered after the page loads and not with reading pre-defined html page. Refer here on how to launch a browser (either Chrome, Firefox, IE etc..) with the object as remdr
The below snippet opens Firefox but there are other ways to launch any other browser
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remdr <- remoteDriver(port = 4567L, browserName = "firefox")
Sys.sleep(format(runif(1,10,15),digits = 1))
remdr$open()
You might have to login to LinkedIn since it won't allow viewing profiles without signing up. You will need to use the RSelenium clickElement and sendKeys functions to operate the webpage.
remdr$navigate(url = 'https://www.linkedin.com/') # navigate to the link
username <- remdr$findElement(using = 'id', value = 'session_key') # finding the username field
username$sendKeysToElement(list('your_USERNAME'))
password <- remdr$findElement(using = 'id', value = 'session_password') # finding the password field
password$sendKeysToElement(list('your_PASSWORD'))
remdr$findElement(using = 'xpath', value = "//button[#class='sign-in-form__submit-button']")$clickElement() # find and click Signin button
Once the page is loaded, you can get the page source and use rvest functions to read between the HTML tags. You can use this extension to easily get xpath selectors for the text you want to scrape.
pgSrc <- remdr$getPageSource()
pgData <- read_html(pgSrc[[1]])
experience <- pgData %>%
html_nodes(xpath = "//div[#class='text-body-medium break-words']") %>%
html_text(trim = TRUE)
Output of experience:
> experience
[1] "Chief Scientist at RStudio, Inc."

Scraping frames in R without RSelenium?

I need to scrape “manuscript received date” that is visible in the right-hand frame, once you click “Information” at this page: https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717 . I tried to use an rvest script listed below, that worked fine in similar situations. However, it does not work in this case, perhaps because of the click required to get to the publication history. I tried solving this issue by adding #pane-pcw-details to the url (https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717#pane-pcw-details) but to no avail. Another option would be to use RSelenium, but perhaps there is a simpler workaround?
library(rvest)
link <-c("https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717#pane-pcw-details")
wiley_output <-data.frame()
page = read_html(link)
revhist = page %>% html_node(".publication-history li:nth-child(5)") %>% html_text()
wiley_output = rbind(wiley_output, data.frame(link, revhist, stringsAsFactors = FALSE))
That data comes from an ajax call you can find in the network tab. It has a lot of querystring params but you actually only need the identifier for the document, along with ajax = True to ensure return of data associated with the specified ajax action:
https://onlinelibrary.wiley.com/action/ajaxShowPubInfo?ajax=true&doi=10.1002/jcc.26717
library(rvest)
library(magrittr)
link <- 'https://onlinelibrary.wiley.com/action/ajaxShowPubInfo?ajax=true&doi=10.1002/jcc.26717'
page <- read_html(link)
page %>% html_node(".publication-history li:nth-child(5)") %>% html_text()

Why is html_nodes() in R not giving me the desired output for this webpage?

I am looking to extract all the links for each episode on this webpage, however I appear to be having difficulty using html_nodes() where I haven't experienced such difficulty before. I am trying to iterate the code using "." such that all the attributes for the page are obtained with that CSS. This code is meant to give an output of all the attributes, but instead I get {xml_nodeset (0)}. I know what to do once I have all the attributes in order to obtain the links specifically out of them, but this step is proving a stumbling block for this website.
Here is the code I have begun in R:
episode_list_page_1 <- "https://jrelibrary.com/episode-list/"
episode_list_page_1 %>%
read_html() %>%
html_node("body") %>%
html_nodes(".type-text svelte-fugjkr first-mobile first-desktop") %>%
html_attrs()
This rvest down does not work here because this page uses javascript to insert another webpage into an iframe on this page, to display the information.
If you search the imebedded script you will find a reference to this page: "https://datawrapper.dwcdn.net/eoqPA/66/" which will redirect you to "https://datawrapper.dwcdn.net/eoqPA/67/". This second page contains the data you are looking for in as embedded JSON and generated via javascript.
The links to the shows are extractable, and there is a link to a Google doc that is the full index.
Searching this page turns up a link to a Google doc:
library(rvest)
library(dplyr)
library(stringr)
page2 <-read_html("https://datawrapper.dwcdn.net/eoqPA/67/")
#find all of the links on the page:
str_extract_all(html_text(page2), 'https:.*?\\"')
#isolate the Google docs
print(str_extract_all(html_text(page2), 'https://docs.*?\\"') )
#[[1]]
#[1] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/edit?usp=sharing"
#[2] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/export?format=csv&id=12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8"

Web Data Scraping issues-Auto Download

I want to automatically download all the whitepapers from this website: https://icobench.com/ico, when you choose to enter each ICO's webpage, there's a whitepaper tab to click, which will take you to the pdf preview screen, I want to retrieve the pdf url from the css script by using rvest, but nothing comes back after I tried multiple input on the nodes
A example of one ico's css inspect:
embed id="plugin" type="application/x-google-chrome-pdf"
src="https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf"
stream-url="chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/9ca6571a-509f-4924-83ef-5ac83e431a37"
headers="content-length: 2629762
content-type: application/pdf
I've tried something like the following:
library(rvest)
url <- "https://icobench.com/ico"
url <- str_c(url, '/hygh')
webpage <- read_html(url)
Item_html <- html_nodes(webpage, "content embed#plugin")
Item <- html_attr(Item_html, "src")
or
Item <- html_text(Item_html)
Item
But nothing comes back, anybody can help?
From above example, I'm expecting to retrieve the embedded url to the ico's official website for pdf whitepapers, eg: https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf
But as it's google chrome plugin, it's not being retrieved by the rvest package, any ideas?
A possible solution:
Using your example I would change the selector to combine, using descendant combinator, id with attribute = value selector. This would target the whitepaper tab by id and the child link by href attribute value; using $ ends with operator to get the pdf.
library(rvest)
library(magrittr)
url <- "https://icobench.com/ico/hygh"
pdf_link <- read_html(url) %>% html_node(., "#whitepaper [href$=pdf]") %>% html_attr(., "href")
Faster option?
You could also target the object tag and its data attribute
pdf_link <- read_html(url) %>% html_node(., "#whitepaper object") %>% html_attr(., "data")
Explore which is fit for purpose across pages.
The latter is likely faster and seems to be used across the few sites I checked.
Solution for all icos:
You could put this in a function that receives an url as input (the url of each ico); the function would return the pdf url, or some other specified value if no url found/css selector fails to match. You'd need to add some handling for that scenario. Then call that function over a loop of all ico urls.

Harvesting data with rvest retrieves no value from data-widget

I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))
For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.

Resources