WebScraping dynamic pages in R - r

I will change the website, to make this question better. Still facing similar issues, that can't use only rvest package and maybe answer will be easier to obtain with RSelenium. Website: http://ravimaailma.fi/cg/tulokset/20/ and I want to obtain links from the main article which would direct me to individual race results. Links look something like this: http://ravimaailma.fi/article/tulokset/pori-18-11-2017-tulokset/8718/
I'm trying to use simple Rvest as thought that would be all needed here. SelectorGadget is giving links CSS as .article-title a, so my code is simply
url %>%
read_html() %>%
html_nodes(".article-title a") %>%
html_text()
This will return nothing. Website loads more results when you scroll down, but I thought I would atleast get first results out. Below gives out some links and links 28:32 looks promising, but I think they are links from the sidebar, not from article.
url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
What I'm I doing wrong here and can RSelenium help me?

Here is my partial answer, still not getting all, but maybe helps some one. Code will return 1 link for first result. Not sure why it isn't giving them all. I'm using
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("http://ravimaailma.fi/cg/tulokset/20/")
elem <- remDr$findElement(using="css selector", value=".article-title a")
elemtxt <- elem$getElementAttribute("href")
#Click button to load more results
#button <- remDr$findElement(using="id", value="loadmore")
#button$clickElement()
remDr$close()
I haven't used button click yet, but seemed that it was working as well. Only problem is that I can't get all results from the site.

[I'm not (yet) allowed to write comments, so I chose to make this post an answer]
RSelenium is not always necessary, you can also interact with a website using directly PhantomJS (see e.g. this example).
If you provided an example from the website instead of a local link to a .pdf, I can try to find out how to retrieve the data.

Related

HTML scraping with R and selectorgadget

I have a script below that works for simple html scraping. Nothing is returned below for this particular site. New to using html with R and selectorgadget but I have other sites that work. I am wondering why this one does not see the element. The picture below has the path in the highlighted red box and I am curious if it because of the # before the fancy-box that makes this hidden. Any tips and language correction would be helpful as I am still learning how to scrape html.
library(rvest)
library(dplyr)
library(tm)
library(stringi)
library(readr)
url <- read_html('https://www.draftkings.com/draft/contest/84207356')
rot <- url %>%
html_nodes('..prize-payouts td+ td') %>%
html_text()
roster <- data.frame(ROT = rot)
The website is using javascript to render the page. One solution is to download the data as JSON. If you examine the files from the network under the developer tools on your web browser.
This file should provide the information you are looking for:
library(jsonlite)
fromJSON("https://api.draftkings.com/contests/v1/contests/84207356?format=json")
Be sure to comply with the term of service on this website.

Google Searching in R

This question is not so much about how to google search in R (discussed many times before) as much as why it does not always work.
I found this code from another posted question here
That I recall working perfectly. It would produce all the links in the search.
But now it does not work. For some reason the node is not there anymore when I pull the data into R. But when I actually inspect the html code on Chrome it's there when I am browsing the code. It show's the h3 node in the display inspector but not when it's being downloded.
library(rvest)
ht <- read_html('https://www.google.co.in/search?q=guitar+repair+workshop')
links <- ht %>% html_nodes(xpath='//h3/a') %>% html_attr('href')
gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
I get the following return:
character(0)
The google page display of links depends on your location/preferences. So maybe this is what is causing the issue?
It appears that the format switched very recently, maybe today, and that the //h3 is no longer used. This produces what is intended with one final extraneous result
library(rvest)
ht <- read_html('https://www.google.co.in/search?q=guitar+repair+workshop')
links <- ht %>% html_nodes(xpath='//a') %>% html_attr('href')
gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))

Xpath found with Elements but not readable/scrapeable via rvest

I am trying to scrape off the number amounts listed in a set of donation websites. So in this example, I would like to get
$3, $10, $25, $100, $250, $1500, $2800
The xpath indicates that one of them should be
/html/body/div[1]/div[3]/div[2]/div/div[1]/div/div/
form/div/div[1]/div/div/ul/li[2]/label
and the css selector
li.btn--wrapper:nth-child(2) > label:nth-child(1)
Up to the following, I see something in the xml_nodeset:
library(rvest)
url <- "https://secure.actblue.com/donate/pete-buttigieg-announcement-day"
read_html(url) %>% html_nodes(
xpath = '//*[#id="cf-app-target"]/div[3]/div[2]/div/div[1]/div/div'
)
Then I see add the second part of the xpath and it shows up blank. Same with
X %>% html_nodes("li")
which gives a bunch of things, but all the StyledButton__StyledAnchorButton-a7s38j-0 kEcVlT turn blank.
I have worked with rvest for a fair bit now, but this one's baffling. And I am not quite sure how RSelenium will help here, although I have knowledge on how to use it for screenshots and clicks. If it helps, the website also refuses to be captured in the wayback machine---there's only the background and nothing else.
I have even tried just taking a screenshot with RSelenium and attempting ocr with tessaract and magick, but while other pages worked this particular example spectacularly fails, because the text is in white and in a rather nonstandard font. Yes, I've tried image_negate and image_resize to see if it helped, but it only showed that relying on OCR is rather a bad idea, as it depends on screenshot size.
Any advice on how to best extract what I want in this situation? Thanks.
You can use regex to extract numbers from script tag. You get a comma separated character vector
library(rvest)
library(stringr)
con <- url('https://secure.actblue.com/donate/pete-buttigieg-announcement-day?refcode=website', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'preloadedState')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'(?<="amounts":\\[)(\\d+,?)+')
print(res[[1]][,1])
Try it here

Scraping a HTML table in R, after changing a Javascript dropdown option

I am looking to scrape the main table from this website. I have managed to get the table into R and working, but the one problem is the website defaults to PS4, while I wanted the data for Xbox (this is changed in the top-right dropdown menu).
Ideally there would be a way to pass a parameter in the URL that will define the platform in this way, but I haven't been able to find anything about that.
Looking around it seems that PhantomJS would be the best way to go but I have no experience using Javascript and I'm not sure how you would implement performing an action on the page, then scraping the resulting table.
This is currently all I have in terms of my main code scraping the data:
library(rvest)
url1 <- "https://www.futbin.com/19/players?page="
pge <- 1
tbl <- paste0(url1, pge) %>%
read_html() %>%
html_nodes(xpath='//*[#id="repTb"]') %>%
html_table()
Thanks in advance for any help.

Triggering doPostBack javascript with RSelenium to scrape multi-page table

I am struggling to 'web-scrape' data from a table which spans over several pages. The pages are linked via javascript.
The data I am interested in is based on the website's search function:
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
I am able to download the first page with the rvest package:
library(rvest)
library(tidyverse)
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res <- NI.res[[1]][c(1:10),c(1:5)]
So far so good.
As far as I understood, the RSelenium package is the way forward to navigate on websites with javascript/when html scraping via changing urls is not possible. I installed the package and run it in combination with the Docker Quicktool Box (all working fine),
library (RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-chrome')
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4445L,
browserName = "chrome")
remDr$open()
My hope was that by triggering the javascript I could naviate to the next page and repeat the rvest command and obtain the data contained on the e.g. 2nd, 3rd etc page (eventually that should be part of a loop or purrr::map function).
Navigate to the table with search results (1st page):
remDr$navigate("http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/1989&td=01/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0")
Trigger the javascript. The content of the javascript is taken from hovering with the mouse over the index of pages on the webiste (below the table). In the case below the javascript leading to page 2 is triggered:
remDr$executeScript("__doPostBack('ctl00$MainContentPlaceHolder$SearchResultsGridView','Page$2');", args=list("dummy"))
Repeat the scraping with rvest
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res2 <- NI.res[[1]][c(1:10),c(1:5)]
Unfortunately though, the triggering of the javascript appears not to work. The scraped results are again those from page 1 and not from page 2. I might be missing something rather basic here, but I can't figure out what.
My attempt is partly informed by SO posts here, here and here. I also saw this post.
Context: Eventually, in further steps, I will have to trigger a click on each single finding/row which shows up on all pages and also scrape the information behind each entry. Hence, as far as I understood, RSelenium will be the main tool here.
Grateful for any hint!
UPDATE
I made 'some' progress following the approach suggested here. It's a) still not doing everything I am intending to do and b) is very likely not the most elegant form of how to do it. But maybe it's of some help to others/opens up a way forward. Note that this approach does not require RSelenium.
I basically created a loop for each javascript (page index) leading to another page of the table which I want to scrape. The crucial detail is the EVENTARGUMENT argument to which I assign the respective page number (my knowledge of js is bascially zero).
for (i in 2:15) {
target<- paste0("Page$",i)
page<-rvest:::request_POST(pgsession,"http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0",
body=list(
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__EVENTTARGET`="ctl00$MainContentPlaceHolder$SearchResultsGridView",
`__EVENTARGUMENT`= target, `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value
),
encode="form")
x <- read_html(page) %>%
html_nodes(css = "#ctl00_MainContentPlaceHolder_SearchResultsGridView") %>%
html_table(fill=TRUE) %>%
as.data.frame()
d<- x[c(1:paste(nrow(x)-2)),c(1:5)]
page.list[[i]] <- d
i=i+1
}
However, this code is not able to trigger the javascript/go to the pages which are not visible in the page index below the table when opening the site (1 - 11). Only page 2 to 11 can be scraped with this loop. Since the script for page 12 and subsequent are not visible, they can't be triggered.

Resources