I am scraping thousands of webpages using the R package rvest. In order not to overload the server, I timed the Sys.sleep() function with 5 seconds.
It works nice until we reach a value of ~400 webpages scraped. However, beyond this value, I get nothing and all data is empty, although an error is not thrown.
I am wondering whether there is any possibility to modify the Sys.sleep() function to scrape 350 webpages by 5 seconds each, then wait for instance 5 minuts, then continue with another 350 webpages... and so on.
I was checking the Sys.sleep() function documentation, and only time appears as an argument. So, if this is not possible to be done with this function, is there any other possibility or function to deal with this problem when scraping a huge bunch of pages?
UPDATE WITH AN EXAMPLE
This is part of my code. The object links is composed of more than 8 thousand links.
title <- vector("character", length = length(links))
short_description <- vector("character", length = length(links))
for(i in seq_along(links)){
Sys.sleep(5)
aff_link <- read_html(links[i])
title[i] <- aff_link %>%
html_nodes("title") %>%
html_text()
short_description[i] <- aff_link %>%
html_nodes(".clp-lead__headline") %>%
html_text()
}
You could add a check on the modulus of a loop variable and do an extra sleep every N iterations. Example:
> for(i in 1:100){
message("Getting page ",i)
Sys.sleep(5)
if((i %% 10) == 0){
message("taking a break")
Sys.sleep(10)
}
}
Every 10 iterations the expression i %% 10 is TRUE and you get an extra 10 seconds sleep.
I can think of more complex solutions but this might work for you.
One other possibility is to check if a page returns any data, and if not, sleep twice as long and try again, repeating this a number of times. Here's some semi-pseudocode:
get_page = function(page){
sleep = 5
for(try in 1:5){
html = get_content(page)
if(download_okay(html)){
return(html)
}
sleep = sleep * 2
Sys.sleep(sleep)
}
return("I tried - but I failed!")
}
Some web page getters like CURL will do this automatically with the right options - there may be a way to wangle that into your code too.
Related
I use R to parse XML data from a website. I have list of 20,000 rows with URLs from which I need to extract data. I have a code which gets the job done using a for loop, but it's very slow (takes approx. 12 hours). I thought of using parallel processing (I have access to several CPUs) to speed it up, but I cannot make it work properly. Would it be more efficient using a data table instead of a data frame? Is there any way to speed the process up? Thanks!
for (i in 1:nrow(list)) {
t <- xmlToDataFrame(xmlParse(read_xml(list$path[i]))) #Read the data into a file
t$ID <- list$ID[i]
emptyDF <- bind_rows(all, t) #Bind all into one file
if (i / 10 == floor(i / 10)) {
print(i)
} #print every 10th value to monitor progress of the loop
}
This script should point you in the correct direction:
t<-list()
for (i in 1:nrow(list)) {
tempdf <- xmlToDataFrame(xmlParse(list$path[i])) #Read the data into a file
tempdf$ID <- list$ID[i]
t[[i]]<-tempdf
if (i %% 10 == 0) {
print(i)
} #print every 10th value to monitor progress of the loop
}
answer <- bind_rows(t) #Bind all into one file
Instead of a for loop, an lapply would also work here. Without any sample data, this is untested.
I am downloading nested json data from the UN's SDG Indicators API, but using a loop for 50,006 paginations is waaayyy too slow to ever complete. Is there a better way?
https://unstats.un.org/SDGAPI/swagger/#!/Indicator/V1SdgIndicatorDataGet
I'm working in RStudio on a Windows laptop. Getting to the json nested data and structuring into a dataframe was a hard-fought win, but dealing with the paginations has me stumped. No response from the UN statistics email.
Maybe an 'apply' would do it? I only need data from 2004, 2007, and 2011 - maybe I can filter, but I don't think that would help the fundamental issue.
I'm probably misunderstanding the API structure - I can't see how querying 50,006 pages individually can be functional for anyone. Thanks for any insight!
library(dplyr)
library(httr)
library(jsonlite)
#Get data from the first page, and initialize dataframe and data types
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data", flatten = TRUE)
#Get number of pages from the field of the first page
pages <- page1$totalPages
SDGdata<- data.frame()
for(j in 1:25){
SDGdatarow <- rbind(page1$data[j,1:16])
SDGdata <- rbind(SDGdata,SDGdatarow)
}
SDGdata[1] <- as.character(SDGdata[[1]])
SDGdata[2] <- as.character(SDGdata[[2]])
SDGdata[3] <- as.character(SDGdata[[3]])
#Loop through all the rest of the pages
baseurl <- ("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data")
for(i in 2:pages){
mydata <- fromJSON(paste0(baseurl, "?page=", i), flatten=TRUE)
message("Retrieving page ", i)
for(j in 1:25){
SDGdatarow <- rbind(mydata$data[j,1:16])
rownames(SDGdatarow) <- as.numeric((i-1)*25+j)
SDGdata <- rbind.data.frame(SDGdata,SDGdatarow)
}
}
I do get the data I want, and in a nice dataframe, but inevitably the query has a connection issue after a few hundred pages, or my laptop shuts down etc. It's about 5 seconds a page. 5*50,006/3600 ~= 70 hours.
I think I (finally) figured out a workable solution: I can set the # of elements per page, resulting in a manageable number of pages to call. (I also filtered for just the 3 years I want which reduces the data). Through experimentation I figured out that about 1/10th of the elements download ok, so I set the call to 1/10 per page, with a loop for 10 pages. Takes about 20 minutes, but better than 70 hours, and works without losing the connection.
#initiate the df
SDGdata<- data.frame()
# call to get the # elements with the years filter
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011", flatten = TRUE)
perpage <- ceiling(page1$totalElements/10)
ptm <- proc.time()
for(i in 1:10){
SDGpage <- fromJSON(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011&pageSize=",perpage,"&page=",i), flatten = TRUE)
message("Retrieving page ", i, " :", (proc.time() - ptm) [3], " seconds")
SDGdata <- rbind(SDGdata,SDGpage$data[,1:16])
}
To make a long story short I'm trying to gather information on 6500 user, so I wrote a loop. Below you can find an example of 10 artists. In this loop I'm trying to use a call to gather information on all tracks of a user.
test <- fromJSON(getURL('http://api.soundcloud.com/users/52880864/tracks?client_id=0ab2657a7e5b63b6dbc778e13c834e3d&limit=200&offset=1&linked_partitioning=1'))
This short example shows a dataframe with all the tracks uploaded by a user. When I use my loop I'd like to add all the dataframes together so that I can process them with tapply. This way I can for instance see how what the sum of all track likes are. However, two things are going wrong. First, when I run the loop, each users only shows one uploaded track. Second, I think I'm not combining the dataframes properly. Could somebody please explain to me what I'm doing wrong?
id <- c(20376298, 63320169, 3806325, 12231483, 18838035, 117385796, 52880864, 32704993, 63975320, 95667573)
Partition1 <- paste0("'http://api.soundcloud.com/users/", id, "/tracks?client_id=0ab2657a7e5b63b6dbc778e13c834e3d&limit=200&offset=1&linked_partitioning=1'")
results <- vector(mode = "list", length = length(Partition1))
for (i in seq_along(Partition1)){
message(paste0('Query #',i))
tryCatch({
result_i <- fromJSON((getURL(str_replace_all(Partition1[i],"'",""))))
clean_i <- function(x)ifelse(is.null(x),NA,ifelse(length(x)==0,NA,x))
results[[i]] <- plyr::llply(result_i, clean_i) %>% as_data_frame
if( i == 4 ) {
stop('stop')
}
}, error = function(e){
beepr::beep(1)
}
)
Sys.sleep(0.5)
}
I sometimes have to check that our webpages are loading correctly. This usually involves thousands of links. I wrote a for-loop to do this and would like to know if there is a way to cut down the time. I tried doing parallel processing and couldn't figure it out. I'm new to R and selenium is kind of a mystery to me. It seems that waiting for the $navigate() part is where it hangs. It takes about 3-5 sec for this to be complete before moving to the next one.
I tried having 10 browsers going at once and offset so that some steps could run while the page loads but it didn't seem to work. The last assignment I did took 11 hours for 8,200 links. If I could get this to go faster, I'd like to know how. Any help would be appreciated.
library(RSelenium)
#Reproducible data
URL <- read.table(header=TRUE, stringsAsFactors=FALSE,sep=",",text=
"Link,file,Corr
http://stackoverflow.com/questions,questions.png,0
http://www.google.com/,MATCH.png,0
http://stackoverflow.com/unanswered,unanswered.png,0
http://www.google.com,google.png,0")
#Starts Selenium server
checkForServer()
startServer()
#create browsers
for(i in 1:4)
{
name<- paste("remDr", i, sep="")
assign(name, remoteDriver$new())
get(name)$open()
get(name)$maxWindowSize()
}
#Go to site, & save screenshots
system.time(
for(i in seq(0,4, by=4))
{
#goes to link, takes 2.5 each
remDr1$navigate(URL$Link[i+1])
remDr2$navigate(URL$Link[i+2])
remDr3$navigate(URL$Link[i+3])
remDr4$navigate(URL$Link[i+4])
#waits until page loads or max 30 sec, usually takes no time
remDr1$setTimeout(type = "page load", milliseconds = 30000)
remDr2$setTimeout(type = "page load", milliseconds = 30000)
remDr3$setTimeout(type = "page load", milliseconds = 30000)
remDr4$setTimeout(type = "page load", milliseconds = 30000)
#saves screenshot
remDr1$screenshot(file = URL$file[i+1])
remDr2$screenshot(file = URL$file[i+2])
remDr3$screenshot(file = URL$file[i+3])
remDr4$screenshot(file = URL$file[i+4])
}
)
PS: Perhaps this interests someone: I then compare the screenshots to find the ones that match what I'm looking for
library(raster)
library(doSNOW)
library(doParallel)
#Create cluster for parallel processing
cl <- makeCluster(4)
registerDoSNOW(cl)
#Make a raster of the page I want to find copies of
MATCH <- raster("MATCH.png")
#for loop to compare pics
system.time(
URL$Corr <- unlist(
foreach(h=1:4, .packages = "raster", .verbose = T) %dopar%
{
comp <- raster(URL$file[h])
comp.samp <- round(resample(comp, MATCH, "bilinear"))
round(cor(getValues(MATCH),getValues(comp.samp), use="complete.obs"), 6)
})
)
MATCH <- URL$file[which(URL$Corr==1)]
MATCH #lists pages that match
stopCluster(cl)
Yes in java there a library called selnium grid. You can use selenium grid which can help you distribute the load into multiple browsers in multiple machines. You will need more than 1 machine ( Called nodes where you can run the tests on browsers (upto 5 chrome browser on each node) and fire the tests to a central hub which will control and manage these nodes. This is a good start. You can also find many questions in selenium-grid which can help you.
I am using rvest to webscrape in R, and I'm running into memory issues. I have a 28,625 by 2 data frame of strings called urls that contains the links to the pages I'm scraping. A row of the frame contains two related links. I want to generate a 28,625 by 4 data frame Final with information scraped from the links. One piece of information is from the second link in a row, and the other three are from the first link. The xpaths to the three pieces of information are stored as strings in the vector xpaths. I am doing this with the following code:
data <- rep("", 4 * 28625)
k <- 1
for (i in 1:28625) {
name <- html(urls[i, 2]) %>%
html_node(xpath = '//*[#id="seriesDiv"]/table') %>%
html_table(fill = T)
data[k] <- name[4, 3]
data[k + 1:3] <- html(urls[i, 1]) %>%
html_nodes(xpath = xpaths) %>%
html_text()
k <- k + 4
}
dim(data) <- c(4, 28625)
Final <- as.data.frame(t(data))
It works well enough, but when I open the task manager, I see that my memory usage has been monotonically increasing and is currently at 97% after about 340 iterations. I'd like to just start the program and come back in a day or two, but all of my RAM will be exhausted before the job is done. I've done a bit of research on how R allocates memory, and I've tried my best to preallocate memory and modify in place, to keep the code from making unnecessary copies of things, etc.
Why is this so memory intensive? Is there anything I can do to resolve it?
Rvest has been updated to resolve this issue. See here:
http://www.r-bloggers.com/rvest-0-3-0/