Improve RSelenium loop, reduce time navigating to the page - r

I sometimes have to check that our webpages are loading correctly. This usually involves thousands of links. I wrote a for-loop to do this and would like to know if there is a way to cut down the time. I tried doing parallel processing and couldn't figure it out. I'm new to R and selenium is kind of a mystery to me. It seems that waiting for the $navigate() part is where it hangs. It takes about 3-5 sec for this to be complete before moving to the next one.
I tried having 10 browsers going at once and offset so that some steps could run while the page loads but it didn't seem to work. The last assignment I did took 11 hours for 8,200 links. If I could get this to go faster, I'd like to know how. Any help would be appreciated.
library(RSelenium)
#Reproducible data
URL <- read.table(header=TRUE, stringsAsFactors=FALSE,sep=",",text=
"Link,file,Corr
http://stackoverflow.com/questions,questions.png,0
http://www.google.com/,MATCH.png,0
http://stackoverflow.com/unanswered,unanswered.png,0
http://www.google.com,google.png,0")
#Starts Selenium server
checkForServer()
startServer()
#create browsers
for(i in 1:4)
{
name<- paste("remDr", i, sep="")
assign(name, remoteDriver$new())
get(name)$open()
get(name)$maxWindowSize()
}
#Go to site, & save screenshots
system.time(
for(i in seq(0,4, by=4))
{
#goes to link, takes 2.5 each
remDr1$navigate(URL$Link[i+1])
remDr2$navigate(URL$Link[i+2])
remDr3$navigate(URL$Link[i+3])
remDr4$navigate(URL$Link[i+4])
#waits until page loads or max 30 sec, usually takes no time
remDr1$setTimeout(type = "page load", milliseconds = 30000)
remDr2$setTimeout(type = "page load", milliseconds = 30000)
remDr3$setTimeout(type = "page load", milliseconds = 30000)
remDr4$setTimeout(type = "page load", milliseconds = 30000)
#saves screenshot
remDr1$screenshot(file = URL$file[i+1])
remDr2$screenshot(file = URL$file[i+2])
remDr3$screenshot(file = URL$file[i+3])
remDr4$screenshot(file = URL$file[i+4])
}
)
PS: Perhaps this interests someone: I then compare the screenshots to find the ones that match what I'm looking for
library(raster)
library(doSNOW)
library(doParallel)
#Create cluster for parallel processing
cl <- makeCluster(4)
registerDoSNOW(cl)
#Make a raster of the page I want to find copies of
MATCH <- raster("MATCH.png")
#for loop to compare pics
system.time(
URL$Corr <- unlist(
foreach(h=1:4, .packages = "raster", .verbose = T) %dopar%
{
comp <- raster(URL$file[h])
comp.samp <- round(resample(comp, MATCH, "bilinear"))
round(cor(getValues(MATCH),getValues(comp.samp), use="complete.obs"), 6)
})
)
MATCH <- URL$file[which(URL$Corr==1)]
MATCH #lists pages that match
stopCluster(cl)

Yes in java there a library called selnium grid. You can use selenium grid which can help you distribute the load into multiple browsers in multiple machines. You will need more than 1 machine ( Called nodes where you can run the tests on browsers (upto 5 chrome browser on each node) and fire the tests to a central hub which will control and manage these nodes. This is a good start. You can also find many questions in selenium-grid which can help you.

Related

memory not being released in future with nested plan(multisession)

Recently I've been playing with doing some parallel processing in R using future (and future.apply and furrr) which has been great mostly, but I've stumbled onto something that I can't explain. It's possible that this is a bug somewhere, but it may also be sloppy coding on my part. If anyone can explain this behavior it would be much appreciated.
The setup
I'm running simulations on different subgroups of my data. For each group, I want to run the simulation n times and then calculate some summary stats on the results. Here is some example code to reproduce my basic setup and demonstrate the issue I'm seeing:
library(tidyverse)
library(future)
library(future.apply)
# Helper functions
#' Calls out to `free` to get total system memory used
sys_used <- function() {
.f <- system2("free", "-b", stdout = TRUE)
as.numeric(unlist(strsplit(.f[2], " +"))[3])
}
#' Write time, and memory usage to log file in CSV format
#' #param .f the file to write to
#' #param .id identifier for the row to be written
mem_string <- function(.f, .id) {
.s <- paste(.id, Sys.time(), sys_used(), Sys.getpid(), sep = ",")
write_lines(.s, .f, append = TRUE)
}
# Inputs
fake_inputs <- 1:16
nsim <- 100
nrows <- 1e6
log_file <- "future_mem_leak_log.csv"
if (fs::file_exists(log_file)) fs::file_delete(log_file)
test_cases <- list(
list(
name = "multisession-sequential",
plan = list(multisession, sequential)
),
list(
name = "sequential-multisession",
plan = list(sequential, multisession)
)
)
# Test code
for (.t in test_cases) {
plan(.t$plan)
# loop over subsets of the data
final_out <- future_lapply(fake_inputs, function(.i) {
# loop over simulations
out <- future_lapply(1:nsim, function(.j) {
# in real life this would be doing simulations,
# but here we just create "results" using rnorm()
res <- data.frame(
id = rep(.j, nrows),
col1 = rnorm(nrows) * .i,
col2 = rnorm(nrows) * .i,
col3 = rnorm(nrows) * .i,
col4 = rnorm(nrows) * .i,
col5 = rnorm(nrows) * .i,
col6 = rnorm(nrows) * .i
)
# write memory usage to file
mem_string(log_file, .t$name)
# in real life I would write res to file to read in later, but here we
# only return head of df so we know the returned value isn't filling up memory
res %>% slice_head(n = 10)
})
})
# clean up any leftover objects before testing the next plan
try(rm(final_out))
try(rm(out))
try(rm(res))
}
The outer loop is for testing two parallelization strategies: whether to parallelize over the subsets of data or over the 100 simulations.
Some caveats
I realize that parallelizing over the simulations is not the ideal design, and also that chunking that data to send 10-20 simulations to each core would be more efficient, but that's not really the point here. I'm just trying to understand what is happening in memory.
I also considered that maybe plan(multicore) would be better here (though I'm sure if it would) but I'm more interested in figuring out what's happening with plan(multisession)
The results
I ran this on an 8-vCPU Linux EC2 (I can give more specs if people need them) and created the following plot from the results (plotting code at the bottom for reproducibility):
First off, plan(list(multisession, sequential)) is faster (as expected, see caveat above), but what I'm confused about is the memory profile. The total system memory usage remains pretty constant for plan(list(multisession, sequential)) which I would expect, because I assumed the res object is overwritten each time through the loop.
However, the memory usage for plan(list(sequential, multisession)) steadily grows as the program runs. It appears that each time through the loop the res object is created and then hangs around in limbo somewhere, taking up memory. In my real example this got large enough that it filled my entire (32GB) system memory and killed the process about halfway through.
Plot twist: it only happens when nested
And here's the part that really has me confused! When I changed the outer future_lapply to just regular lapply and set plan(multisession) I don't see it! From my reading of this "Future: Topologies" vignette this should be the same as plan(list(sequential, multisession)) but the plot doesn't show the memory growing at all (in fact, it's almost identical to plan(list(multisession, sequential)) in the above plot)
Note on other options
I actually originally found this with furrr::future_map_dfr() but to be sure it wasn't a bug in furrr, I tried it with future.apply::future_lapply() and got the results shown. I tried to code this up with just future::future() and got very different results, but quite possibly because what I coded up wasn't actually equivalent. I don't have much experience with using futures directly without the abstraction layer provided by either furrr or future.apply.
Again, any insight on this is much appreciated.
Plotting code
library(tidyverse)
logDat <- read_csv("future_mem_leak_log.csv",
col_names = c("plan", "time", "sys_used", "pid")) %>%
group_by(plan) %>%
mutate(
start = min(time),
time_elapsed = as.numeric(difftime(time, start, units = "secs"))
)
ggplot(logDat, aes(x = time_elapsed/60, y = sys_used/1e9, group = plan, colour = plan)) +
geom_line() +
xlab("Time elapsed (in mins)") + ylab("Memory used (in GB)") +
ggtitle("Memory Usage\n list(multisession, sequential) vs list(sequential, multisession)")

R foreach do parallel does not end after loop completion

My goal is to do some operation on a dataframe as follows
exp_info <- data.frame(location.Id = 1:1e7,
x = rnorm(10))
For each location, I want to do the square of the x variable and write the individual file as csv. My actual computation is lengthier and has other stuffs so this is a simplistic example. This is how I am parallelising my task:
library(doParallel)
myClusters <- parallel::makeCluster(6)
doParallel::registerDoParallel(myClusters)
foreach(i = 1:nrow(exp_info),
.packages = c("dplyr","data.table"),
.errorhandling = 'remove',
.verbose = TRUE) %dopar%
{
rowRef <- exp_info[i, ]
rowRef <- rowRef %>% dplyr::mutate(x.sq = x^2)
fwrite(rowRef, paste0(i,'_iteration.csv'))
}
When I look at my working directory, I have all the individual csv files (1e7 csv files)
written out which says the above code is successful. However, my foreach loop does not end
even if all the files are written out and I have to kill the job which also does not generate any error. Does anyone have any idea why this could possibly happen?
I'm experiencing something similar. I don't know the answer but will add this: the same code and operation works on one computer, but fails to exit the for each loop on another computer. Hope this provides some direction.

Maximizing speed of a loop/apply function

I am quite struggling with a huge data set at the moment.
What I would like to do is not very complicated, but the matter is that it is just too slow. In the first step, I need to check whether a website is active or not. For this intention, I used the following code (here with a sample of three API-pathes)
library(httr)
Updated <- function(x){http_error(GET(x))}
websites <- data.frame(c("https://api.crunchbase.com/v3.1/organizations/designpitara","www.twitter.com","www.sportschau.de"))
abc <- apply(websites,1,Updated)
I already noticed that a for loop is pretty much faster than the apply function. However, the full code (which has around 1MIllion APIs to check) still would take around 55 hours to be executed. Any help is appreciated :)
Alternatively, something like this would work for passing multiple libraries to the PSOCK cluster:
clusterEvalQ(cl, {
library(data.table)
library(survival)
})
The primary limiting factor will probably be the time taken to query the website. Currently, you're waiting for each query to return a result before executing the next one. The best way to speed up the workflow would be to execute batches of queries in parallel.
If you're using a Unix system you could try the following:
### Packages ###
library(parallel)
### On your example ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 3))
### On a larger number of sites ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = detectCores())
### You can even go beyond your machine's core count ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 40))
However, the precise number of threads at which you saturate your processor/internet connection is kind of dependent upon your machine and your connection.
Alternatively, if you're stuck on Windows:
### For a larger number of sites ###
cl <- makeCluster(detectCores(), type = "PSOCK")
clusterExport(cl, varlist = "websites")
clusterEvalQ(cl = cl, library(httr))
abc <- parSapply(cl = cl, X = websites[[1]], FUN = Updated, USE.NAMES = FALSE)
stopCluster(cl)
In the case of PSOCK clusters, I'm not sure whether there are any benefits to be had from exceeding your machine's core count, although I'm not a Windows person, and I welcome any correction.

Faster download for paginated nested JSON data from API in R?

I am downloading nested json data from the UN's SDG Indicators API, but using a loop for 50,006 paginations is waaayyy too slow to ever complete. Is there a better way?
https://unstats.un.org/SDGAPI/swagger/#!/Indicator/V1SdgIndicatorDataGet
I'm working in RStudio on a Windows laptop. Getting to the json nested data and structuring into a dataframe was a hard-fought win, but dealing with the paginations has me stumped. No response from the UN statistics email.
Maybe an 'apply' would do it? I only need data from 2004, 2007, and 2011 - maybe I can filter, but I don't think that would help the fundamental issue.
I'm probably misunderstanding the API structure - I can't see how querying 50,006 pages individually can be functional for anyone. Thanks for any insight!
library(dplyr)
library(httr)
library(jsonlite)
#Get data from the first page, and initialize dataframe and data types
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data", flatten = TRUE)
#Get number of pages from the field of the first page
pages <- page1$totalPages
SDGdata<- data.frame()
for(j in 1:25){
SDGdatarow <- rbind(page1$data[j,1:16])
SDGdata <- rbind(SDGdata,SDGdatarow)
}
SDGdata[1] <- as.character(SDGdata[[1]])
SDGdata[2] <- as.character(SDGdata[[2]])
SDGdata[3] <- as.character(SDGdata[[3]])
#Loop through all the rest of the pages
baseurl <- ("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data")
for(i in 2:pages){
mydata <- fromJSON(paste0(baseurl, "?page=", i), flatten=TRUE)
message("Retrieving page ", i)
for(j in 1:25){
SDGdatarow <- rbind(mydata$data[j,1:16])
rownames(SDGdatarow) <- as.numeric((i-1)*25+j)
SDGdata <- rbind.data.frame(SDGdata,SDGdatarow)
}
}
I do get the data I want, and in a nice dataframe, but inevitably the query has a connection issue after a few hundred pages, or my laptop shuts down etc. It's about 5 seconds a page. 5*50,006/3600 ~= 70 hours.
I think I (finally) figured out a workable solution: I can set the # of elements per page, resulting in a manageable number of pages to call. (I also filtered for just the 3 years I want which reduces the data). Through experimentation I figured out that about 1/10th of the elements download ok, so I set the call to 1/10 per page, with a loop for 10 pages. Takes about 20 minutes, but better than 70 hours, and works without losing the connection.
#initiate the df
SDGdata<- data.frame()
# call to get the # elements with the years filter
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011", flatten = TRUE)
perpage <- ceiling(page1$totalElements/10)
ptm <- proc.time()
for(i in 1:10){
SDGpage <- fromJSON(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011&pageSize=",perpage,"&page=",i), flatten = TRUE)
message("Retrieving page ", i, " :", (proc.time() - ptm) [3], " seconds")
SDGdata <- rbind(SDGdata,SDGpage$data[,1:16])
}

Recommendation when using Sys.sleep() in R with rvest

I am scraping thousands of webpages using the R package rvest. In order not to overload the server, I timed the Sys.sleep() function with 5 seconds.
It works nice until we reach a value of ~400 webpages scraped. However, beyond this value, I get nothing and all data is empty, although an error is not thrown.
I am wondering whether there is any possibility to modify the Sys.sleep() function to scrape 350 webpages by 5 seconds each, then wait for instance 5 minuts, then continue with another 350 webpages... and so on.
I was checking the Sys.sleep() function documentation, and only time appears as an argument. So, if this is not possible to be done with this function, is there any other possibility or function to deal with this problem when scraping a huge bunch of pages?
UPDATE WITH AN EXAMPLE
This is part of my code. The object links is composed of more than 8 thousand links.
title <- vector("character", length = length(links))
short_description <- vector("character", length = length(links))
for(i in seq_along(links)){
Sys.sleep(5)
aff_link <- read_html(links[i])
title[i] <- aff_link %>%
html_nodes("title") %>%
html_text()
short_description[i] <- aff_link %>%
html_nodes(".clp-lead__headline") %>%
html_text()
}
You could add a check on the modulus of a loop variable and do an extra sleep every N iterations. Example:
> for(i in 1:100){
message("Getting page ",i)
Sys.sleep(5)
if((i %% 10) == 0){
message("taking a break")
Sys.sleep(10)
}
}
Every 10 iterations the expression i %% 10 is TRUE and you get an extra 10 seconds sleep.
I can think of more complex solutions but this might work for you.
One other possibility is to check if a page returns any data, and if not, sleep twice as long and try again, repeating this a number of times. Here's some semi-pseudocode:
get_page = function(page){
sleep = 5
for(try in 1:5){
html = get_content(page)
if(download_okay(html)){
return(html)
}
sleep = sleep * 2
Sys.sleep(sleep)
}
return("I tried - but I failed!")
}
Some web page getters like CURL will do this automatically with the right options - there may be a way to wangle that into your code too.

Resources