R webscraping a slow/overburdened (?) website - r

I am webscraping a website to collect data for research purposes using RSelenium, docker and rvest.
I've built a script that automatically 'clicks' through the pages of which I want to download content. My problem is that when I run this script, the results change. The amount of observations of the variable I'm interested in change. It concerns about 50.000 observations. When running the script several times, the total amount of observations differs by a few hundred.
I'm thinking it has something to do with the internet connection being too slow or with the website not being able to load quick enough... Or something... When I change Sys.sleep(2) the results change too, but without clear effect of wether changing it to higher numbers makes it worse or better.
In the R terminal I run:
docker run -d -p 4445:4444 selenium/standalone-chrome
Then my code looks something like this:
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("url of website")
pages <- 100 # for example, I want information from the first hundred pages
variable <- vector("list", pages)
i <- 1
while (i <= pages) {
variable[[i]] <- remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_nodes("node that indicates the information I want") %>% # select the information I want
html_text()
element_next_page <- remDr$findElement(using = 'css selector', "node that indicates the 'next page button") # select button with which I can go to the next page
element_next_page$sendKeysToElement(list(key="enter")) # go to the next page
Sys.sleep(2) # I believe this is done to not overload the website I'm scraping
i <- i + 1
}
variable <- unlist(variable)
Somehow running this multiple times this keeps returning different results in terms of the number of observations that remain when I unlist variable.
Does someone have the same experiences and tips on what to do?
Thanks.

You could consider including the following code before extracting the text :
for(i in 1 : 100)
{
print(i)
remDr$executeScript(paste0("scroll(0, ", i * 2000, ")"))
}
This code forces the application to go "almost everywhere in the web page" which can help the page to load some sections that are not loaded. This approach is used in the following post : How to webscrape texts that are contained into sublinks of a link in R?.

Related

Webscraping images in r and saving them into a zip file

I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.
Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?
I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png
I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?
Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.
You can consider the following code to save the screenshots of the webpage :
library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Elem <- remDr$findElement("xpath", '//*[#id="rain-area-slider"]/div/button')
web_Elem$clickElement()
for(i in 1 : 10)
{
print(i)
Sys.sleep(1)
path_To_File <- paste0("C:/file", i, ".png")
remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}
Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.
1. Load packages
We will begin by loading our packages of interest:
# Load packages ----
pacman::p_load(
httr,
png,
purrr,
RSelenium,
rvest,
servr
)
2. Setup
Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:
# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())
Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:
cl <- rsd$client
The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:
# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")
Let's get scraping
Now we're going to begin the actual scraping! #EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.
In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.
Here I get the HTML element for each step:
# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]
Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?
img_urls <- map_chr(rail_steps, function(step){
cl$mouseMoveToLocation(webElement = step)
cl$click()
img_el <- cl$findElement(using = "css", value = "#rain_overlay")
Sys.sleep(1)
imcg_url <-
img_el$getElementAttribute(attrName = "src")[[1]]
})
Finally, I create an image folder img where I download and save the images:
# Create an image folder then download all images in it ----
dir.create("img")
walk(img_urls, function(img_url){
GET(url = img_url) |>
content() |>
writePNG(target = paste0("img/", basename(img_url)))
})
Important
The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:
# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
content() |>
writePNG(target = "base_image.png")
If you want to combine the images programmatically, you may want to look into the magick package in R.

Scrape webpage that does not change URL

I’m new to web scraping. I can do the very basic stuff of scraping pages using URLs and css selector tools with R. Now I have run into problems.
For hobby purposes I would like to be able to scrape the following URL:
 https://matchpadel.halbooking.dk/newlook/proc_baner.asp (a time slot booking system for sports)
However, the URL does not change when I navigate to different dates or adresses (‘Område’).
I have read a couple of similar problems suggesting to inspect the webpage, look under ’Network’ and then ‘XHR’ or ‘JS’ to find the data source of the table and get information from there. I am able to do this, but to be honest, I have no idea what to do from there.
I would like to retrieve data on what time slots are available across dates and adresses (the ‘Område’ drop down on the webpage).
If anyone is willing to help me and my understanding, it would be greatly appreciated.
Have a nice day!
The website you have linked looks to be run on Javascript which changes dynamically. You need to extract your desired information using RSelenium library which opens a browser and then you need to choose your dropdown and get data.
Find the sample code here to fire up firefox to your website. From here you can write codes to select different types of ‘Område’ dropdown and get the following table info using remdr$getPageSource() and then using Rvest functions to extract the data
# load libraries
library(RSelenium)
# open browser
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
Sys.sleep(2)
shell(selCommand, wait = FALSE, minimized = TRUE)
Sys.sleep(2)
remdr <- remoteDriver(port = 4567L, browserName = "firefox")
Sys.sleep(10)
remdr$open()
remdr$navigate(url = 'https://matchpadel.halbooking.dk/newlook/proc_baner.asp')

How to extract the inherent links from the webpage with my code (error: subscript out of bounds)?

I am rather new in webscraping but need data for my PhD project. For this, I am extracting data on different activities of MEPs from the European Parliament's website. Concretely, and where I have problems, I would like to extract the title and especially the link underlying the title of each speech from a MEP's personal page. I use a code that already worked fine several times, but here I do not succeed in getting the link, but only the title of the speech. For the links I get the error message "subscript out of bounds". I am working with RSelenium because there are several load more buttons on the individual pages I have to click first before extracting the data (which makes rvest a complicated option as far as I see it).
I am basically trying to solve this for days now, and I really do not know how to get further. I have the impression that the css selector is not actually capturing the underlying link (as it extracts the title without problems), but the class has a compounded name ("ep-a_heading ep-layout_level2") so it is not possible to go via this way either. I tried Rvest as well (ignoring the problem I would then have for the load more--button) but I still do not get to those links.
```{r}
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-
activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "css", value=".erpl-activities-
loadmore-button .ep_name")
while (!is.null(more)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="css", ".ep-layout_level2 .ep_title")
length(links)
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
URL[i] <- links[[i]]$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
For this example there 128 speeches on the page, so in the end I would need a table with 128 titles and links. The code works fine when I only try for the title but for the URLs I get:
`"Error in links[[i]]$getElementAttribute("href")[[1]] : subscript out of bounds"`
Thank you very much for your help, I already read many posts on subscript out of bounds issues in this forum, but unfortunately I still couldn't solve the problem.
Have a great day!
I don't seem to have a problem using rvest to get that info. No need for overhead of using selenium. You want to target the a tag child of that class i.e. .ep-layout_level2 a in order to be able to access an href attribute. Same selector would apply for selenium.
library(rvest)
library(magrittr)
page <- read_html('https://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8')
titles <- page %>% html_nodes('.ep-layout_level2 .ep_title') %>% html_text() %>% gsub("\\r\\n\\t+", "", .)
links <- page %>% html_nodes('.ep-layout_level2 a') %>% html_attr(., "href")
results <- data.frame(titles,links)
Here you have a working solution based on the code you provided:
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "class",value= "erpl-activity-loadmore-button")
while ((grepl("erpl-activity-loadmore-button",more$getPageSource(),fixed=TRUE)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="class", "ep-layout_level2")
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
l=links[[i]]$findChildElement(using="css","a")
URL[i] <-l$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
speeches
The main differences are:
In the first findElement I use value= erpl-activity-loadmore-button. Indeed the documentation says that you can not look at multiple class values at once
Same when it comes to look for the links
In the final loop, you need fist to select the link element in the
div you selected and then read the href attribute
To answer your question about the error message in comments after the while loop: When you pressed enough time the "Load more" buttons it become invisible, but still exists. So when you check for !is.null(more)it is TRUE because the button still exists, but when you try to click it you get and error message because it is invisible. So you can fix it by checking it it is visible or note.

Triggering doPostBack javascript with RSelenium to scrape multi-page table

I am struggling to 'web-scrape' data from a table which spans over several pages. The pages are linked via javascript.
The data I am interested in is based on the website's search function:
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
I am able to download the first page with the rvest package:
library(rvest)
library(tidyverse)
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res <- NI.res[[1]][c(1:10),c(1:5)]
So far so good.
As far as I understood, the RSelenium package is the way forward to navigate on websites with javascript/when html scraping via changing urls is not possible. I installed the package and run it in combination with the Docker Quicktool Box (all working fine),
library (RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-chrome')
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4445L,
browserName = "chrome")
remDr$open()
My hope was that by triggering the javascript I could naviate to the next page and repeat the rvest command and obtain the data contained on the e.g. 2nd, 3rd etc page (eventually that should be part of a loop or purrr::map function).
Navigate to the table with search results (1st page):
remDr$navigate("http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/1989&td=01/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0")
Trigger the javascript. The content of the javascript is taken from hovering with the mouse over the index of pages on the webiste (below the table). In the case below the javascript leading to page 2 is triggered:
remDr$executeScript("__doPostBack('ctl00$MainContentPlaceHolder$SearchResultsGridView','Page$2');", args=list("dummy"))
Repeat the scraping with rvest
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res2 <- NI.res[[1]][c(1:10),c(1:5)]
Unfortunately though, the triggering of the javascript appears not to work. The scraped results are again those from page 1 and not from page 2. I might be missing something rather basic here, but I can't figure out what.
My attempt is partly informed by SO posts here, here and here. I also saw this post.
Context: Eventually, in further steps, I will have to trigger a click on each single finding/row which shows up on all pages and also scrape the information behind each entry. Hence, as far as I understood, RSelenium will be the main tool here.
Grateful for any hint!
UPDATE
I made 'some' progress following the approach suggested here. It's a) still not doing everything I am intending to do and b) is very likely not the most elegant form of how to do it. But maybe it's of some help to others/opens up a way forward. Note that this approach does not require RSelenium.
I basically created a loop for each javascript (page index) leading to another page of the table which I want to scrape. The crucial detail is the EVENTARGUMENT argument to which I assign the respective page number (my knowledge of js is bascially zero).
for (i in 2:15) {
target<- paste0("Page$",i)
page<-rvest:::request_POST(pgsession,"http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0",
body=list(
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__EVENTTARGET`="ctl00$MainContentPlaceHolder$SearchResultsGridView",
`__EVENTARGUMENT`= target, `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value
),
encode="form")
x <- read_html(page) %>%
html_nodes(css = "#ctl00_MainContentPlaceHolder_SearchResultsGridView") %>%
html_table(fill=TRUE) %>%
as.data.frame()
d<- x[c(1:paste(nrow(x)-2)),c(1:5)]
page.list[[i]] <- d
i=i+1
}
However, this code is not able to trigger the javascript/go to the pages which are not visible in the page index below the table when opening the site (1 - 11). Only page 2 to 11 can be scraped with this loop. Since the script for page 12 and subsequent are not visible, they can't be triggered.

R - Waiting for page to load in RSelenium with PhantomJS

I put together a crude scraper that scrapes prices/airlines from Expedia:
# Start the Server
rD <- rsDriver(browser = "phantomjs", verbose = FALSE)
# Assign the client
remDr <- rD$client
# Establish a wait for an element
remDr$setImplicitWaitTimeout(1000)
# Navigate to Expedia.com
appurl <- "https://www.expedia.com/Flights-Search?flight-type=on&starDate=04/30/2017&mode=search&trip=oneway&leg1=from:Denver,+Colorado,to:Oslo,+Norway,departure:04/30/2017TANYT&passengers=children:0,adults:1"
remDr$navigate(appURL)
# Give a crawl delay to see if it gives time to load web page
Sys.sleep(10) # Been testing with 10
###ADD JAVASCRIPT INJECTION HERE###
remDr$executeScript(?)
# Extract Prices
webElem <- remDr$findElements(using = "css", "[class='dollars price-emphasis']")
prices <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(prices)
# Extract Airlines
webElem <- remDr$findElements(using = "css", "[data-test-id='airline-name']")
airlines <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(airlines)
# close client/server
remDr$close()
rD$server$stop()
As you can see, I built in an ImplicitWaitTimeout and a Sys.Sleep call so that the page has time to load in phantomJS and to not overload the website with requests.
Generally speaking, when looping over a date range, the scraper works well. However, when looping through 10+ dates consecutively, Selenium sometimes throws a StaleElementReference error and stops the execution. I know the reason for this is because the page has yet to load completely and the class='dollars price-emphasis' doesn't exist yet. The URL construction is fine.
Whenever the page successfully loads all the way, the scraper gets near 60 prices and flights. I'm mentioning this because there are times when the script returns only 15-20 entries (when checking this date normally on a browser, there are 60). Here, I conclude that I'm only finding 20 of 60 elements, meaning the page has only partially loaded.
I want to make this script more robust by injecting JavaScript that waits for the page to fully load prior to looking for elements. I know the way to do this is remDr$executeScript(), and I have found many useful JS snippets for accomplishing this, but due to limited knowledge in JS, I'm having problems adapting these solutions to work syntactically with my script.
Here are several solutions that have been proposed from Wait for page load in Selenium & Selenium - How to wait until page is completely loaded:
Base Code:
remDr$executeScript(
WebDriverWait wait = new WebDriverWait(driver, 20);
By addItem = By.cssSelector("class=dollars price-emphasis");, args = list()
)
Additions to base script:
1) Check for Staleness of an Element
# get the "Add Item" element
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(addItem));
# wait the element "Add Item" to become stale
wait.until(ExpectedConditions.stalenessOf(element));
2) Wait for Visibility of element
wait.until(ExpectedConditions.visibilityOfElementLocated(addItem));
I have tried to use
remDr$executeScript("return document.readyState").equals("complete") as a check before proceeding with the scrape, but the page always shows as complete, even if it's not.
Does anyone have any suggestions about how I could adapt one of these solutions to work with my R script? Any ideas on how I could wait entirely for the page to load with nearly 60 found elements? I'm still leaning, so any help would be greatly appreciated.
Solution using while/tryCatch:
remDr$navigate("<webpage url>")
webElem <-NULL
while(is.null(webElem)){
webElem <- tryCatch({remDr$findElement(using = 'name', value = "<value>")},
error = function(e){NULL})
#loop until element with name <value> is found in <webpage url>
}
To tack on a bit more convenience to Victor's great answer, a common element on tons of pages is body which can be accessed via css. I also made it a function and added a quick random sleep (always good practice). This should work without you needing to assign the element on most web pages with text:
##use double arrow to assign to global environment permanently
#remDr <<- remDr
wetest <- function(sleepmin,sleepmax){
remDr <- get("remDr",envir=globalenv())
webElemtest <-NULL
while(is.null(webElemtest)){
webElemtest <- tryCatch({remDr$findElement(using = 'css', "body")},
error = function(e){NULL})
#loop until element with name <value> is found in <webpage url>
}
randsleep <- sample(seq(sleepmin, sleepmax, by = 0.001), 1)
Sys.sleep(randsleep)
}
Usage:
remDr$navigate("https://bbc.com/news")
clickable <- remDr$findElements(using='xpath','//button[contains(#href,"")]')
clickable[[1]]$clickElement()
wetest(sleepmin=.5,sleepmax=1)

Resources