Triggering doPostBack javascript with RSelenium to scrape multi-page table - r

I am struggling to 'web-scrape' data from a table which spans over several pages. The pages are linked via javascript.
The data I am interested in is based on the website's search function:
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
I am able to download the first page with the rvest package:
library(rvest)
library(tidyverse)
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res <- NI.res[[1]][c(1:10),c(1:5)]
So far so good.
As far as I understood, the RSelenium package is the way forward to navigate on websites with javascript/when html scraping via changing urls is not possible. I installed the package and run it in combination with the Docker Quicktool Box (all working fine),
library (RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-chrome')
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4445L,
browserName = "chrome")
remDr$open()
My hope was that by triggering the javascript I could naviate to the next page and repeat the rvest command and obtain the data contained on the e.g. 2nd, 3rd etc page (eventually that should be part of a loop or purrr::map function).
Navigate to the table with search results (1st page):
remDr$navigate("http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/1989&td=01/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0")
Trigger the javascript. The content of the javascript is taken from hovering with the mouse over the index of pages on the webiste (below the table). In the case below the javascript leading to page 2 is triggered:
remDr$executeScript("__doPostBack('ctl00$MainContentPlaceHolder$SearchResultsGridView','Page$2');", args=list("dummy"))
Repeat the scraping with rvest
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res2 <- NI.res[[1]][c(1:10),c(1:5)]
Unfortunately though, the triggering of the javascript appears not to work. The scraped results are again those from page 1 and not from page 2. I might be missing something rather basic here, but I can't figure out what.
My attempt is partly informed by SO posts here, here and here. I also saw this post.
Context: Eventually, in further steps, I will have to trigger a click on each single finding/row which shows up on all pages and also scrape the information behind each entry. Hence, as far as I understood, RSelenium will be the main tool here.
Grateful for any hint!
UPDATE
I made 'some' progress following the approach suggested here. It's a) still not doing everything I am intending to do and b) is very likely not the most elegant form of how to do it. But maybe it's of some help to others/opens up a way forward. Note that this approach does not require RSelenium.
I basically created a loop for each javascript (page index) leading to another page of the table which I want to scrape. The crucial detail is the EVENTARGUMENT argument to which I assign the respective page number (my knowledge of js is bascially zero).
for (i in 2:15) {
target<- paste0("Page$",i)
page<-rvest:::request_POST(pgsession,"http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0",
body=list(
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__EVENTTARGET`="ctl00$MainContentPlaceHolder$SearchResultsGridView",
`__EVENTARGUMENT`= target, `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value
),
encode="form")
x <- read_html(page) %>%
html_nodes(css = "#ctl00_MainContentPlaceHolder_SearchResultsGridView") %>%
html_table(fill=TRUE) %>%
as.data.frame()
d<- x[c(1:paste(nrow(x)-2)),c(1:5)]
page.list[[i]] <- d
i=i+1
}
However, this code is not able to trigger the javascript/go to the pages which are not visible in the page index below the table when opening the site (1 - 11). Only page 2 to 11 can be scraped with this loop. Since the script for page 12 and subsequent are not visible, they can't be triggered.

Related

Scraping website with html_text() and html_node() gives empty return

I would like to scrape all addresses from a website. I can see the addresses on the website, but somehow cannot navigate to them. If I navigate further, the navigation with html_nodes() doesn't work anymore and just stays at the point after three levels.
The code that I execute is:
mpreis <- "https://www.mpreis.at/sp/standorte"
addresses_vector = c()
x <- GET(url=mpreis)
html_doc <- read_html(x)
addresses_vector <- c(addresses_vector, html_doc %>%
rvest::html_nodes(xpath="body/div/main")%>%
xml2::xml_find_all(xpath="//*[contains(#class, 'c3-market-list__item-address')]")%>%
rvest::html_text())
I tried to navigate to the element I'm interested in using several different options of xpath-notation and css notation but html_text() always returns an empty vector because html_nodes() does not navigate to the node inteded.
Any help on why I cannot navigate to the specific node with that class or any other ideas, why the content I see on the website is returned as empty in R is appreciated!

R webscraping a slow/overburdened (?) website

I am webscraping a website to collect data for research purposes using RSelenium, docker and rvest.
I've built a script that automatically 'clicks' through the pages of which I want to download content. My problem is that when I run this script, the results change. The amount of observations of the variable I'm interested in change. It concerns about 50.000 observations. When running the script several times, the total amount of observations differs by a few hundred.
I'm thinking it has something to do with the internet connection being too slow or with the website not being able to load quick enough... Or something... When I change Sys.sleep(2) the results change too, but without clear effect of wether changing it to higher numbers makes it worse or better.
In the R terminal I run:
docker run -d -p 4445:4444 selenium/standalone-chrome
Then my code looks something like this:
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("url of website")
pages <- 100 # for example, I want information from the first hundred pages
variable <- vector("list", pages)
i <- 1
while (i <= pages) {
variable[[i]] <- remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_nodes("node that indicates the information I want") %>% # select the information I want
html_text()
element_next_page <- remDr$findElement(using = 'css selector', "node that indicates the 'next page button") # select button with which I can go to the next page
element_next_page$sendKeysToElement(list(key="enter")) # go to the next page
Sys.sleep(2) # I believe this is done to not overload the website I'm scraping
i <- i + 1
}
variable <- unlist(variable)
Somehow running this multiple times this keeps returning different results in terms of the number of observations that remain when I unlist variable.
Does someone have the same experiences and tips on what to do?
Thanks.
You could consider including the following code before extracting the text :
for(i in 1 : 100)
{
print(i)
remDr$executeScript(paste0("scroll(0, ", i * 2000, ")"))
}
This code forces the application to go "almost everywhere in the web page" which can help the page to load some sections that are not loaded. This approach is used in the following post : How to webscrape texts that are contained into sublinks of a link in R?.

How to pull a product link from customer profile page on Amazon

I'm trying to get the product link from a customers profile page usign R's RVEST package
I've referenced various questions on stack overflow including here(could not read webpage with read_html using rvest package from r), but each time I try something, I'm not able to return the correct result.
For example on this profile page:
https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8
I'd like to be able to return this link, with the end goal to extract the product id: B01A51S9Y2
https://www.amazon.com/Amagabeli-Stainless-Chainmail-Scrubber-Pre-Seasoned/dp/B01A51S9Y2?ref=pf_vv_at_pdctrvw_dp
library(dplyr)
library(rvest)
library(stringr)
library(httr)
library(rvest)
# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
x <- GET(url, add_headers('user-agent' = 'test'))
page <- read_html(x)
page %>%
html_nodes("[class='a-link-normal profile-at-product-box-link a-text-normal']") %>%
html_text()
#I did a test to see if i could even find the href, with no luck
test <- page %>%
html_nodes("#a-page") %>%
html_text()
grepl("B01A51S9Y2",test)
Thanks for the tip #Qharr on Rselenium. that is helpful, but still unsure how to extract the link or asin. library(RSelenium)
driver <- rsDriver(browser=c("chrome"), port = 4574L, chromever = "77.0.3865.40")
rd <- driver[["client"]]
rd$open()
rd$navigate("https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_arp_d_gw_btm?ie=UTF8")
prod <- rd$findElement(using = "css", '.profile-at-product-box-link')
prod$getElementText
This doesn't really return anything
Added the get attribute href, and was able to get the link
prod <- rd$findElements(using = "css selector", '.profile-at-product-box-link')
for (link in 1:length(prod)){
print(prod[[link]]$getElementAttribute('href'))
}
That info is pulled in dynamically from a POST request the page makes that your rvest initial request doesn't capture. This subsequent request returns in json format the content governing asins, the products links etc.....
You can find it in the network tab of dev tools F12. Press F5 to refresh the page then examine network traffic:
It is not a simple POST request to mimic and I would just go with RSelenium to let the page render and then use css selector
.profile-at-product-box-link
to gather a webElements collection you can loop and extract href attribute from.

How to extract the inherent links from the webpage with my code (error: subscript out of bounds)?

I am rather new in webscraping but need data for my PhD project. For this, I am extracting data on different activities of MEPs from the European Parliament's website. Concretely, and where I have problems, I would like to extract the title and especially the link underlying the title of each speech from a MEP's personal page. I use a code that already worked fine several times, but here I do not succeed in getting the link, but only the title of the speech. For the links I get the error message "subscript out of bounds". I am working with RSelenium because there are several load more buttons on the individual pages I have to click first before extracting the data (which makes rvest a complicated option as far as I see it).
I am basically trying to solve this for days now, and I really do not know how to get further. I have the impression that the css selector is not actually capturing the underlying link (as it extracts the title without problems), but the class has a compounded name ("ep-a_heading ep-layout_level2") so it is not possible to go via this way either. I tried Rvest as well (ignoring the problem I would then have for the load more--button) but I still do not get to those links.
```{r}
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-
activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "css", value=".erpl-activities-
loadmore-button .ep_name")
while (!is.null(more)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="css", ".ep-layout_level2 .ep_title")
length(links)
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
URL[i] <- links[[i]]$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
For this example there 128 speeches on the page, so in the end I would need a table with 128 titles and links. The code works fine when I only try for the title but for the URLs I get:
`"Error in links[[i]]$getElementAttribute("href")[[1]] : subscript out of bounds"`
Thank you very much for your help, I already read many posts on subscript out of bounds issues in this forum, but unfortunately I still couldn't solve the problem.
Have a great day!
I don't seem to have a problem using rvest to get that info. No need for overhead of using selenium. You want to target the a tag child of that class i.e. .ep-layout_level2 a in order to be able to access an href attribute. Same selector would apply for selenium.
library(rvest)
library(magrittr)
page <- read_html('https://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8')
titles <- page %>% html_nodes('.ep-layout_level2 .ep_title') %>% html_text() %>% gsub("\\r\\n\\t+", "", .)
links <- page %>% html_nodes('.ep-layout_level2 a') %>% html_attr(., "href")
results <- data.frame(titles,links)
Here you have a working solution based on the code you provided:
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "class",value= "erpl-activity-loadmore-button")
while ((grepl("erpl-activity-loadmore-button",more$getPageSource(),fixed=TRUE)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="class", "ep-layout_level2")
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
l=links[[i]]$findChildElement(using="css","a")
URL[i] <-l$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
speeches
The main differences are:
In the first findElement I use value= erpl-activity-loadmore-button. Indeed the documentation says that you can not look at multiple class values at once
Same when it comes to look for the links
In the final loop, you need fist to select the link element in the
div you selected and then read the href attribute
To answer your question about the error message in comments after the while loop: When you pressed enough time the "Load more" buttons it become invisible, but still exists. So when you check for !is.null(more)it is TRUE because the button still exists, but when you try to click it you get and error message because it is invisible. So you can fix it by checking it it is visible or note.

WebScraping dynamic pages in R

I will change the website, to make this question better. Still facing similar issues, that can't use only rvest package and maybe answer will be easier to obtain with RSelenium. Website: http://ravimaailma.fi/cg/tulokset/20/ and I want to obtain links from the main article which would direct me to individual race results. Links look something like this: http://ravimaailma.fi/article/tulokset/pori-18-11-2017-tulokset/8718/
I'm trying to use simple Rvest as thought that would be all needed here. SelectorGadget is giving links CSS as .article-title a, so my code is simply
url %>%
read_html() %>%
html_nodes(".article-title a") %>%
html_text()
This will return nothing. Website loads more results when you scroll down, but I thought I would atleast get first results out. Below gives out some links and links 28:32 looks promising, but I think they are links from the sidebar, not from article.
url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
What I'm I doing wrong here and can RSelenium help me?
Here is my partial answer, still not getting all, but maybe helps some one. Code will return 1 link for first result. Not sure why it isn't giving them all. I'm using
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("http://ravimaailma.fi/cg/tulokset/20/")
elem <- remDr$findElement(using="css selector", value=".article-title a")
elemtxt <- elem$getElementAttribute("href")
#Click button to load more results
#button <- remDr$findElement(using="id", value="loadmore")
#button$clickElement()
remDr$close()
I haven't used button click yet, but seemed that it was working as well. Only problem is that I can't get all results from the site.
[I'm not (yet) allowed to write comments, so I chose to make this post an answer]
RSelenium is not always necessary, you can also interact with a website using directly PhantomJS (see e.g. this example).
If you provided an example from the website instead of a local link to a .pdf, I can try to find out how to retrieve the data.

Resources